API reference

This page documents the public Python API exposed by hugiml-core. The manual sections in the user guide explain how these APIs fit together in a modeling workflow; the reference below is generated from the source docstrings.

Core estimator

HUGIMLClassifier — C++ accelerated, scikit-learn compatible classifier.

HUGIMLClassifier is the primary public class name. HUGIMLClassifierNative remains as a backward-compatible alias.

Implements the High Utility Gain Interpretable Machine Learning (HUG-IML) algorithm from:

Krishnamoorthy, S. (2024). Interpretable Classifier Models for Decision Support Using High Utility Gain Patterns. IEEE Access, 12, 126088–126107. DOI: 10.1109/ACCESS.2024.3455563

Computationally intensive stages (discretisation, transaction construction, pattern mining, matrix assembly) run at native speed via a compiled C++ extension with optional OpenMP parallelism. The Python layer handles DataFrame ingestion, column-type detection, downstream estimation, explanation methods, monitoring, and drift detection.

Architecture

C++ extension (_hugiml_core):: Discretisation, transaction construction, top-K HUI pattern mining with information-gain filtering, bitmap-accelerated matrix assembly, OpenMP parallel pattern matching.
Python layer:: Column-type detection (prepareXy), NaN/Inf imputation, downstream sklearn estimator (LogisticRegression default, with optional saga/SGD downstream solvers), explanation methods (get_hug_features, get_pattern_info, feature_importances), versioned model serialisation, prediction monitoring, multi-method drift detection, latency SLA enforcement, and graceful degradation under memory pressure.

Quick start

Two usage paths are supported:

Path A — prepareXy (recommended when the full dataset is available upfront):

from hugiml import HUGIMLClassifier

clf = HUGIMLClassifier()
X, y = clf.prepareXy(X_df, y_series)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y)
clf.fit(X_tr, y_tr)
proba = clf.predict_proba(X_te)

print(clf.model_summary())
print(clf.feature_importances())

Path B — allCols + origColumns (cross-validation loops):

clf = HUGIMLClassifier(
    allCols=[int_cols, float_cols, cat_cols],
    origColumns=X_df.columns.tolist(),
)
clf.fit(X_train, y_train)

Monitoring and drift detection:

clf.enable_monitoring()
clf.predict_proba(X_new)
print(clf.monitor.report())

drift = clf.detect_drift(X_new)
print(drift)

Versioned serialisation:

clf.save_model("model.hugiml")
clf2 = HUGIMLClassifier.load_model("model.hugiml")

class hugiml.classifier.HUGIMLClassifier(allCols=None, origColumns=None, B=8, L=1, G=0.001, topK=30, base_estimator=None, lr_solver='auto', n_jobs=1, max_predict_ms=None, max_fit_seconds=None, max_mining_seconds=None, verbose=False, adaptive_binning=True, b_candidates=None, min_marginal_gain_ratio=0.02, adaptive_binning_sample_frac=False, adaptive_binning_sample_random_state=42, convert_binary_to_categorical=False, feature_mode='patterns_only', use_hotpath=True, augmented_pair_transforms=True, augmented_pair_mode='interaction_information', ii_partner_size=None, aug_feature_size=10, max_pair_features=10, augmented_pair_max_features=None, topk_budget_strict=False, dense_downstream_max_width=200, execution_mode='audit', interaction_relaxed_mining=False, interaction_relaxed_feature_size=10)[source]

Bases: _EstimatorMixin, _BinningMixin, _TrainingMixin, _FeatureAssemblyMixin, _InterpretationMixin, _PredictionMixin, _InspectionMixin, TransformerMixin, ClassifierMixin, BaseEstimator

HUG-IML interpretable classifier — C++ accelerated, scikit-learn compatible.

Extracts High Utility Gain (HUG) patterns from labelled tabular data, transforms the input into a binary pattern-presence matrix, and fits an interpretable downstream classifier. The mined patterns are human-readable and serve as the primary source of model explanations.

Parameters:

allCols (list of 3 lists, optional) – [int_col_names, float_col_names, cat_col_names]. Must be paired with origColumns.
origColumns (list of str, optional) – Ordered column names matching the columns of X passed to fit/predict.
B (int, default 8) – Number of quantile bins per numerical feature when adaptive_binning=False. With adaptive binning enabled, per-feature bin counts are selected from b_candidates.
L (int, default 1) – Maximum HUG pattern length. 1 = singletons; 2 = pairs; -1 = unlimited.
G (float, default 1e-3) – Minimum information-gain threshold.
topK (int, default 30) – Maximum number of patterns to retain. -1 computes automatically.
base_estimator (sklearn estimator, optional) – Downstream classifier trained on the selected representation. Defaults to LogisticRegression. An explicit LogisticRegression using the liblinear solver is fitted directly for binary targets and through one-vs-rest classification for targets with three or more classes.
lr_solver ({"auto", "saga", "sgd"}, default "auto") – Downstream linear classifier used when base_estimator is not supplied. "auto" uses L1-regularized logistic regression: binary classifiers use the liblinear solver and multiclass classifiers use the saga solver. "saga" uses LogisticRegression(solver="saga"). "sgd" uses SGDClassifier(loss="log_loss") so large sparse downstream matrices can be trained with stochastic gradient descent. All built-in choices keep the existing deterministic random_state=0 and max_iter=500 defaults.
n_jobs (int, default 1) – Number of OpenMP threads. -1 uses all available cores.
max_predict_ms (float or None) – Prediction latency budget in milliseconds.
max_fit_seconds (float or None) – Backward-compatible alias for the mining-stage wall-clock budget. Transaction preparation and downstream model fitting are not bounded by this value. Prefer max_mining_seconds for new code.
max_mining_seconds (float or None) – Wall-clock budget, in seconds, for native pattern mining. This is especially useful for explicit high-order bounded mining such as L=4/L=5/larger values. Use 1800 for a 30-minute mining cap. When unset, max_fit_seconds is used for backward compatibility. Partial patterns mined before timeout are retained, and attempt-level details are recorded in mining_audit_log_.
adaptive_binning (bool, default True) – Select per-feature numeric bin counts using supervised information gain.
b_candidates (list of int or None) – Candidate bin counts evaluated when adaptive binning is enabled.
min_marginal_gain_ratio (float, default 0.02) – Elbow threshold for adaptive-binning marginal gain.
adaptive_binning_sample_frac (float or bool, default False) – Fraction of training rows used for adaptive-bin selection. False uses all rows; a float in (0, 1] uses a deterministic stratified sample for selecting edges before applying those edges to all rows.
adaptive_binning_sample_random_state (int, default 42) – Random seed used when adaptive_binning_sample_frac requests a stratified sample.
convert_binary_to_categorical (bool, default False) – When enabled, numeric columns with exactly two observed values are inferred as categorical indicators during automatic column detection. The default keeps them numeric so they remain eligible for numeric interaction and augmented-pair paths. The named performance grids explicitly keep this disabled, while the named interpretability grids enable it for the categorical pattern surface. Explicit allCols metadata takes precedence over this inference option.
feature_mode ({"patterns_only", "original_plus_patterns",) – “original_plus_interactions”}, default “patterns_only” Downstream representation used by fit/predict APIs. transform(X) always returns the HUG pattern matrix.
use_hotpath (bool, default True) – Use the fused native L=1 preparation/mining/matrix path when eligible. Disable only for diagnostic equivalence checks against the staged path.
augmented_pair_transforms (bool, default True) – Enable downstream augmented-pair operator features for eligible L >= 2 adaptive-binning models.
augmented_pair_mode ({"interaction_information", "marginal_ig"},) – default “interaction_information” Source-column scorer for augmented-pair features.
ii_partner_size (int or None) – Optional partner-search bound for interaction-information scoring.
aug_feature_size (int, default 10) – Number of source columns retained for augmented-pair candidate generation in interaction-information mode.
max_pair_features (int, default 10) – Source-column budget used by the marginal-IG augmented-pair mode.
augmented_pair_max_features (int or None) – v1.1.11-compatible alias for the augmented-pair source budget. When provided with default new budgets, it maps to both aug_feature_size and max_pair_features.
topk_budget_strict (bool, default False) – Apply one global topK cap across the constructed downstream representation.
dense_downstream_max_width (int, default 200) – Width threshold below which downstream matrices may stay dense.
execution_mode ({"audit", "production"}, default "audit") – Artifact-retention mode.
interaction_relaxed_mining (bool, default False) – Allow interaction-information survivors to participate in native mining as original-feature bins without creating augmented-pair operator columns. Relaxed admission covers the root and its immediate first-child pairing partner; deeper positions receive no new admission exemption. The generic miner still requires the constructed child pattern to clear its joint information-gain gate. Mutually exclusive with augmented-pair transforms at L >= 2.
interaction_relaxed_feature_size (int, default 10) – Survivor-source budget for interaction-relaxed mining.
verbose (bool, default False) – Emit INFO-level log messages during fit.
fit) (Attributes (available after)
----------------------------------
classes (ndarray — unique class labels.)
n_features_in (int — number of input features.)
feature_names_in (list or None — column names from training data.)
cat_cols_mask (ndarray[bool] — True for categorical columns.)
is_int_mask (ndarray[bool] — True for integer columns.)
td (_TransactionDataWrapper — discretisation artefacts.)
patterns (list — mined HUG patterns.)
x_train_hup (csr_matrix — binary training pattern matrix.)
model (Pipeline — fitted downstream estimator.)
fit_metadata (FitMetadata — timings, memory, pattern stats.)
monitor (PredictionMonitor or None — prediction statistics.)

classmethod fast_grid_tune(X_train, y_train, X_val, y_val, param_grid=None, *, base_params=None, scoring='roc_auc', refit_full=False, return_results=True)

Exact cached tuner for the compact adaptive HUGIML grid.

Requirements

adaptive_binning=True for every candidate.
G may vary; the tuner partitions candidates into constant-G cache groups.
Only G, L, topK, and feature_mode vary. B may appear in the grid but is ignored for cache partitioning because adaptive binning chooses per-feature bins and fit() passes sentinel B=2 to the native transaction builder.
max_fit_seconds and max_mining_seconds must be None to guarantee equivalence to the ordinary grid loop; timeout/degradation can make cached mining fits differ from standalone candidates.

base_params['execution_mode'] defaults to 'production' when not supplied, since a tuning sweep evaluates many candidates and the line-level allocation diagnostics that 'audit' provides are rarely useful for any individual one of them; pass base_params={'execution_mode': 'audit', ...} to opt back in. This only affects line-level tracing cost and the post-fit retention of training matrices – it does not change which patterns are mined, scores, or rankings, all of which are identical either way (the per-(G, L, topK) cache fits that do the actual mining are tagged ‘audit’ internally regardless, so that the candidates derived from them keep access to the cached pattern matrix; see the cache-building loop below).

With refit_full=False (the default), best_model is one of these lightweight per-candidate objects: drift-baseline and rich metadata were never computed for it regardless of execution_mode (see below), and under the ‘production’ default its training pattern matrix is not retained either, so get_pattern_info() also raises on it – call this function again with refit_full=True, or call fit() directly with the selected params, for a model meant to be inspected afterward rather than just used for prediction.

With refit_full=True, the refit that produces the returned best_model defaults to execution_mode='audit' regardless of what the search candidates used, unless the caller’s own base_params explicitly named an execution_mode – the search benefits from ‘production’ for speed across many candidates, but the one model actually handed back for the caller to keep and inspect defaults to having full access to get_pattern_info(), detect_drift(), get_drift_psi(), and feature_importances() without missing-artifact warnings, rather than inheriting the search’s speed-oriented default.

Returns a dict with best_model, best_params, best_score, cv_results, and cache timings. Uses the same scorer as the ordinary grid path for all supported scoring values. During tuning it skips drift-baseline and rich final metadata; set refit_full=True to refit the selected model with normal fit().

Parameters:

X_train (Any)
y_train (Any)
X_val (Any)
y_val (Any)
param_grid (dict[str, list] | str | None)
base_params (dict[str, Any] | None)
scoring (str)
refit_full (bool)
return_results (bool)

Return type:

dict[str, Any]

set_predict_proba_request(*, X_test='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in predict_proba.
self (HUGIMLClassifier)

Returns:

self – The updated object.

Return type:

object

classmethod tune(X, y, *, cv=5, scoring='roc_auc', param_grid=None, refit=True, base_params=None, random_state=42, shuffle=True, cv_splits=None, use_fast_path=True, return_dataframe=True)

Tune HUGIML on full X, y using stratified CV and optional fast-grid caching.

This is the main public convenience API for quick HUGIML model selection. The regular constructor remains a single-configuration estimator; this method owns grid search, cross-validation, aggregation, and optional refit.

Parameters:

X (array-like or DataFrame/Series) – Full training data.
y (array-like or DataFrame/Series) – Full training data.
cv (int or splitter, default=5) – Number of stratified folds, or any sklearn-compatible splitter with split(X, y). Integer cv uses StratifiedKFold.
scoring ({'roc_auc', 'accuracy', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_weighted'}) – Validation metric. ‘roc_auc’ supports binary and multiclass OVR macro AUC.
param_grid (dict or None) – A dict (sklearn-style grid), the name of a grid registered in hugiml.hyperparameter_configs (for example ‘interpretability’), or None to use HUGIMLClassifier.default_param_grid().
refit (bool, default=True) – If True, refit the best configuration on the full X, y with normal fit().
base_params (dict or None) – Constructor parameters shared by every candidate. execution_mode defaults to 'production' for the candidates evaluated during the search, since per-candidate line-level allocation diagnostics are rarely useful mid-sweep; mining results, scores, and the resulting ranking are identical to 'audit' either way. result.best_estimator_ – the one model actually returned – is refit under execution_mode='audit' regardless, so get_pattern_info(), detect_drift(), get_drift_psi(), and feature_importances() all work on it without missing-artifact warnings, unless base_params explicitly names an execution_mode, in which case that value is used everywhere (search and final refit alike) and the above defaulting is skipped entirely.
random_state (int or None, default=42) – Random seed for StratifiedKFold when cv is an integer.
shuffle (bool, default=True) – Whether StratifiedKFold shuffles before splitting.
cv_splits (list of (train_idx, val_idx) or None, default=None) – Exact fold indices to use. When supplied, cv, shuffle, and random_state are ignored for split generation, and the same indices are returned in result.cv_splits_ for reuse by other models.
use_fast_path (bool, default=True) – Use exact cached fast-grid evaluation when the grid qualifies; otherwise fall back to ordinary per-candidate evaluation.
return_dataframe (bool, default=True) – Return results_ as a pandas DataFrame when pandas is available.

Returns:

GridSearchCV-like result object with best_estimator_, best_params_, best_score_, results_, fast_path_used_, elapsed_seconds_, and n_splits_.

Return type:

HUGIMLTuneResult

hugiml.classifier.HUGIMLClassifierNative: alias of HUGIMLClassifier

class hugiml.classifier.FitMetadata(n_samples, n_features, n_classes, n_items, n_patterns, n_compound, topK_used, stage_times_ms, total_fit_ms, matrix_density, config, n_augmented_pairs=0, n_downstream_features=0, downstream_feature_counts=<factory>, memory_peak_mb=0.0, memory_rss_mb=0.0, memory_cpp_mb=0.0, openmp_threads=1, degraded=False)

Bases: object

Immutable record of everything that happened during fit().

Parameters:

n_samples (int)
n_features (int)
n_classes (int)
n_items (int)
n_patterns (int)
n_compound (int)
topK_used (int)
stage_times_ms (dict)
total_fit_ms (float)
matrix_density (float)
config (dict)
n_augmented_pairs (int)
n_downstream_features (int)
downstream_feature_counts (dict)
memory_peak_mb (float)
memory_rss_mb (float)
memory_cpp_mb (float)
openmp_threads (int)
degraded (bool)

n_samples, n_features

Training set dimensions.

Type:: int

n_classes

Number of distinct target classes.

Type:: int

n_items

Number of utility-annotated items (bins + categories).

Type:: int

n_patterns

Number of HUG patterns mined and retained.

Type:: int

n_compound

Compound patterns (length > 1).

Type:: int

n_augmented_pairs

Number of augmented pair features retained for the downstream estimator.

Type:: int

n_downstream_features

Number of columns used by the downstream estimator after feature-mode construction and optional strict TopK filtering.

Type:: int

downstream_feature_counts

Counts by downstream feature family, for example original, pattern, and augmented_pair.

Type:: dict

topK_used

Effective topK budget used during mining.

Type:: int

stage_times_ms

Wall-clock milliseconds per fit stage.

Type:: dict[str, float]

total_fit_ms

Total fit wall-clock milliseconds.

Type:: float

matrix_density

Fraction of non-zero entries in the training pattern matrix.

Type:: float

config

Snapshot of (B, L, G, topK) as used.

Type:: dict

memory_peak_mb

Python-traced peak memory during fit.

Type:: float

memory_rss_mb

RSS delta during fit (Unix only).

Type:: float

memory_cpp_mb

Estimated C++ extension memory usage.

Type:: float

openmp_threads

Number of OpenMP threads used.

Type:: int

degraded

True when fit fell back to reduced parameters.

Type:: bool

summary()[source]

Return a single-line human-readable summary of the fit outcome.

Return type:: str

class hugiml.classifier.HUGIMLTuneResult(best_estimator_, best_params_, best_score_, results_, fast_path_used_, elapsed_seconds_, n_splits_, scoring, cv_splits_, shuffle, random_state)

Bases: object

Result object returned by HUGIMLClassifier.tune().

Attributes mirror the small subset of GridSearchCV-style fields users need for quick HUGIML tuning while keeping the API lightweight.

Parameters:

best_estimator_ (HUGIMLClassifier)
best_params_ (dict[str, Any])
best_score_ (float)
results_ (Any)
fast_path_used_ (bool)
elapsed_seconds_ (float)
n_splits_ (int)
scoring (str)
cv_splits_ (list[tuple[np.ndarray, np.ndarray]])
shuffle (bool)
random_state (int | None)

RPTE downstream model

Leaf-wise RPTE: a boosted ensemble of shallow trees over HUGIML features, used as an optional downstream estimator for higher-order interactions.

Two backends, selected by enable_lookahead (True / False / “adaptive”, default “adaptive”):

Bounded lookahead (enable_lookahead=True). Each tree is grown natively (see _hugiml_core.rpte_grow_tree in native/rpte_tree.cpp): leaf-wise best-first, trying an ordinary greedy stump at each leaf first and falling back to a bounded depth-two microtree search when the stump’s held-out gain stalls. A microtree’s root is a two-source feature supplied by HUGIML or synthesized on the fly, and its child is a single raw feature or another pair, optionally extended one more level for 5-way/6-way interactions. A candidate is committed only when its probe-set gain clears the configured thresholds and a Bonferroni-corrected statistical-significance bar. This module’s role is orchestration: it drives the boosting loop, resolves HUGIML’s feature catalog into the raw-feature-index form the native engine expects, and renders the returned tree structure into human-readable rules – it performs none of the search itself.

Sequential default (enable_lookahead=False). An ordinary leaf-wise ensemble of sklearn.tree.DecisionTreeClassifier stumps, with the same raw-feature reservation but no augmented-pair/microtree search – see _DefaultRPTEFeatureExtractor / _DefaultRPTEFeatureLR.

“adaptive” picks a backend per fit, from the data (see _max_marginal_gain): if no single raw feature carries real marginal signal on its own (e.g. pure parity), lookahead is worth its extra cost; otherwise the sequential backend is generally at least as accurate for less compute.

Both the root and any pair-valued child combine their two raw features by binarizing first when both inputs are binary-like, or combining raw values directly when either is continuous (e.g. GPA or a log-coinsurance-rate, where a hard binary split first would discard graded information the interaction might depend on).

Boosting. Both backends fit an ensemble of trees to the negative gradient of binomial deviance (r_i = y_i - sigmoid(F(x_i))); each tree’s leaf values are the exact per-leaf Newton step, and a tree is kept only if a backtracking line search verifies it lowers deviance.

Representation roles. Original columns, augmented pairs, and mined patterns of order one or two are eligible for RPTE tree growth. Mined patterns above order two are direct-only sparse terms. They cannot become tree roots, children, or ordinary splits. The final L1 logistic layer receives every accepted tree leaf indicator plus each supplied downstream column that was not selected in an accepted split. If an unused mined pattern of any order has the same positive atom conjunction, raw-feature ownership, and fitted support as a leaf, the direct copy is suppressed and recorded as an alias. Each fitted component is represented once in the final LR.

Split acceptance. A partition’s information gain is, via Wilks’ theorem, asymptotically a chi-squared statistic. Because the lookahead search compares many candidates per leaf and keeps the best, use_statistical_acceptance (default True) Bonferroni-corrects the significance threshold by the actual number of candidates and probe-set size compared at that leaf. See native/rpte_significance.hpp for the calibration this module relies on.

hugiml.rpte_bounded_lookahead_leafwise.leaf_path_conditions(tree, leaf_id, feature_names)[source]

Reconstruct the root-to-leaf conjunction of split tests for one leaf, as a list of (feature_name, “0”|”1”) pairs in root-to-leaf order.

Assumes binary (0/1) input features, so every split threshold sits at 0.5: the left child corresponds to feature == 0 and the right child to feature == 1.

Parameters:

tree (DecisionTreeClassifier)
leaf_id (int)
feature_names (list[str])

Return type:

list[tuple[str, str]]

hugiml.rpte_bounded_lookahead_leafwise.leaf_path_conditions_with_thresholds(tree, leaf_id, feature_names)[source]

Same root-to-leaf path reconstruction as leaf_path_conditions, but keeps the split’s actual threshold value and direction (is_right: True = right/”>” branch) instead of collapsing every split to “0”/”1”.

Parameters:

tree (DecisionTreeClassifier)
leaf_id (int)
feature_names (list[str])

Return type:

list[tuple[str, float, bool]]

hugiml.rpte_bounded_lookahead_leafwise.simplify_conditions(conditions)[source]

Drop leaf-path conditions on a mined pattern column whose truth value is already logically guaranteed by an original-feature condition elsewhere in the same leaf path, so every surviving condition carries independent information.

Parameters:: conditions (list[tuple[str, str]])
Return type:: list[tuple[str, str]]

class hugiml.rpte_bounded_lookahead_leafwise.LeafWiseBoundedLookaheadRPTEFeatureExtractor(leaf_config='3xD', depth=4, n_estimators=10, min_samples_leaf=5, learning_rate=0.3, random_state=42, lookahead_child_mode='shared', lookahead_ops=('absolute_difference',), lookahead_beam_width=64, lookahead_probe_fraction=0.25, lookahead_min_probe_ig=0.05, lookahead_min_increment=0.03, greedy_stall_probe_ig=0.02, max_root_thresholds=7, min_probe_leaf=2, min_weighted_probe_gain=0.01, reserve_raw_features=True, min_tree_residual_gain=1e-08, raw_pair_fallback=True, raw_pair_max_candidates=400, aug_child_enabled=True, aug_child_max_candidates=100, use_statistical_acceptance=True, significance_alpha=0.05, enable_lookahead='adaptive', adaptive_marginal_gain_threshold=0.005)[source]

Bases: object

Sequential RPTE ensemble whose component trees grow leaf-wise via the native bounded-lookahead engine (native/rpte_tree.cpp).

Parameters:

leaf_config (str)
depth (int)
n_estimators (int)
min_samples_leaf (int)
learning_rate (float)
random_state (int)
lookahead_child_mode (str)
lookahead_ops (tuple[str, ...] | None)
lookahead_beam_width (int)
lookahead_probe_fraction (float)
lookahead_min_probe_ig (float)
lookahead_min_increment (float)
greedy_stall_probe_ig (float)
max_root_thresholds (int)
min_probe_leaf (int)
min_weighted_probe_gain (float)
reserve_raw_features (bool)
min_tree_residual_gain (float)
raw_pair_fallback (bool)
raw_pair_max_candidates (int)
aug_child_enabled (bool)
aug_child_max_candidates (int)
use_statistical_acceptance (bool)
significance_alpha (float)
enable_lookahead (bool | str)
adaptive_marginal_gain_threshold (float)

leaf_column_descriptors()[source]

Describe extracted leaf columns in the exact matrix column order.

Return type:: list[dict[str, Any]]

leaf_pattern_signatures()[source]

Canonical positive-atom signatures aligned with leaf matrix columns.

Return type:: list[tuple[tuple[object, …], …] | None]

leaf_raw_owner_sets()[source]

Raw-feature ownership for each extracted leaf matrix column.

Return type:: list[frozenset[str]]

growth_summary()[source]

Only meaningful when this fit actually used the bounded- lookahead mechanism – returns [] when it used the default backend instead. Use rule_table() instead there.

Return type:: list[dict[str, Any]]

rule_table(coefficients, feature_names=None)[source]

Default-mode only (see growth_summary’s docstring for the complementary bounded-lookahead-mode diagnostic).

Parameters:: feature_names (list[str] | None)
Return type:: list[dict[str, object]]

class hugiml.rpte_bounded_lookahead_leafwise.LeafWiseBoundedLookaheadRPTEFeatureLR(leaf_config='3xD', depth=4, n_estimators=10, min_samples_leaf=5, rpte_learning_rate=0.3, lr_C=1.0, lr_penalty='l1', random_state=42, lookahead_child_mode='shared', lookahead_ops=('absolute_difference',), lookahead_beam_width=64, lookahead_probe_fraction=0.25, lookahead_min_probe_ig=0.05, lookahead_min_increment=0.03, greedy_stall_probe_ig=0.02, max_root_thresholds=7, min_probe_leaf=2, min_weighted_probe_gain=0.01, reserve_raw_features=True, min_tree_residual_gain=1e-08, raw_pair_fallback=True, raw_pair_max_candidates=400, aug_child_enabled=True, aug_child_max_candidates=100, use_statistical_acceptance=True, significance_alpha=0.05, enable_lookahead='adaptive', adaptive_marginal_gain_threshold=0.005, hugiml_feature_names=None, hugiml_augmented_catalog=None, hugiml_pattern_provenance=None, hugiml_original_feature_standardization=None)[source]

Bases: ClassifierMixin, BaseEstimator

Sklearn-compatible leaf-wise bounded-look-ahead RPTE + logistic model.

Parameters:

leaf_config (str)
depth (int)
n_estimators (int)
min_samples_leaf (int)
rpte_learning_rate (float)
lr_C (float)
lr_penalty (str)
random_state (int)
lookahead_child_mode (str)
lookahead_ops (tuple[str, ...] | None)
lookahead_beam_width (int)
lookahead_probe_fraction (float)
lookahead_min_probe_ig (float)
lookahead_min_increment (float)
greedy_stall_probe_ig (float)
max_root_thresholds (int)
min_probe_leaf (int)
min_weighted_probe_gain (float)
reserve_raw_features (bool)
min_tree_residual_gain (float)
raw_pair_fallback (bool)
raw_pair_max_candidates (int)
aug_child_enabled (bool)
aug_child_max_candidates (int)
use_statistical_acceptance (bool)
significance_alpha (float)
enable_lookahead (bool | str)
adaptive_marginal_gain_threshold (float)
hugiml_feature_names (list[str] | None)
hugiml_augmented_catalog (list[dict[str, Any]] | None)
hugiml_pattern_provenance (dict[str, tuple[str, ...] | dict[str, Any]] | None)
hugiml_original_feature_standardization (dict[str, dict[str, Any]] | None)

set_hugiml_feature_metadata(feature_names, augmented_catalog, pattern_provenance=None, original_feature_standardization=None)[source]

Convenience setter – equivalent to passing the same four values as constructor arguments (hugiml_feature_names / hugiml_augmented_catalog / hugiml_pattern_provenance / hugiml_original_feature_standardization), which is the clone()-safe way to do it (see their docstring above). This setter is still useful for callers that already have a constructed instance in hand (e.g. HUGIMLRPTEHybrid, which fits this class directly rather than through a cloning meta-estimator, so clone-safety isn’t a concern there) and don’t want to reconstruct it just to attach metadata.

Parameters:

feature_names (list[str])
augmented_catalog (list[dict[str, Any]])
pattern_provenance (dict[str, tuple[str, ...] | dict[str, Any]] | None)
original_feature_standardization (dict[str, dict[str, Any]] | None)

leaf_coefficients()[source]

Coefficients assigned to RPTE leaf-indicator columns.

Return type:: ndarray

direct_input_coefficients()[source]

Coefficients assigned to direct source columns after the leaf block.

These are supplied HUGIML downstream columns that were not selected in an accepted RPTE split and therefore remain standalone terms in the final logistic regression.

Return type:: ndarray

representation_alias_table()[source]

Return direct pattern columns suppressed as exact RPTE leaf aliases.

Return type:: list[dict[str, Any]]

rule_table(feature_names=None)[source]

Human-readable root-to-leaf conjunctions for the default backend.

Direct source terms are intentionally not returned by this rules-only method. Use unified_rule_table() to obtain both leaf rules and direct source terms.

Parameters:: feature_names (list[str] | None)
Return type:: list[dict[str, object]]

unified_rule_table(feature_names=None)[source]

Return the fitted prediction explanation surface.

The result uses one schema for bounded-lookahead trees, sequential trees, direct supplied features, raw-feature fallbacks, and constant fallbacks. Callers therefore do not need to branch on the fitted tree backend. growth_summary() remains a separate algorithm-audit view of how a bounded-lookahead tree was constructed.

Each returned row describes one active terminal leaf or direct term. Tree rows include the class label, tree and leaf indices, backend, structured split conditions, raw source lineage, training support, downstream logistic coefficient, centered contribution, and Newton leaf value. Direct-term rows identify supplied features retained by the final sparse logistic layer after tree-use and exact-alias canonicalization.

centered_tree_contribution expresses a leaf coefficient relative to its tree’s support-weighted coefficient baseline. This avoids treating the redundant one-hot leaf parameterization as an absolute importance scale. newton_leaf_value is a tree-construction diagnostic and is not the final prediction effect.

Returns:: Structured, backend-independent explanation rows.
Return type:: list of dict
Parameters:: feature_names (list[str] | None)

Notes

Per-leaf lookahead probe gain is not currently joined into this table. Use growth_summary() to inspect the accepted split-level probe evidence for bounded-lookahead fits.

unified_rule_tree(feature_names=None, *, condition_space='raw', detail_level='full', precision=5, include_direct_terms=True, include_generation_details=False, class_label=None, tree_index=None)[source]

Return the fitted RPTE representation as ready-to-print flat trees.

Shared root-to-leaf prefixes are merged into a decision-tree-style text layout. Each terminal leaf includes its final LR coefficient, odds multiplier, support, centered contribution, and raw-feature provenance. Direct source terms are grouped after the leaf trees by source family.

condition_space accepts "raw", "downstream", or "both". detail_level accepts "compact" or "full". The returned string can be printed directly or embedded in a report:

print(rpte.unified_rule_tree())

Parameters:

feature_names (list[str] | None)
condition_space (str)
detail_level (str)
precision (int)
include_direct_terms (bool)
include_generation_details (bool)
class_label (Any | None)
tree_index (int | None)

Return type:

str

hugiml.rpte_bounded_lookahead_leafwise.aggregate_rule_table_by_raw_source(rows, allocation='interaction_only')[source]

Aggregate a unified_rule_table() result by raw-feature provenance, producing two complementary views.

downstream_contributions: keyed by each condition’s own family-qualified downstream name (e.g. “orig:age”, “pattern:age=[50,60)”, “augmented_pair:age*income”) – how much each ENGINEERED representation of a raw feature contributes, kept separate even when several representations share the same underlying raw feature(s).

raw_source_contributions: keyed by raw feature name (single features) or a tuple of raw feature names (interactions – any condition whose raw_sources has more than one entry) – how much each underlying raw feature, or interaction between two, contributes in total, aggregated ACROSS every downstream representation that touches it.

allocation controls how one condition’s coefficient is credited to raw_source_contributions when it has more than one raw source:

“interaction_only” (default): the full coefficient goes to the
(sourceA, sourceB) INTERACTION entry only – never independently to sourceA’s or sourceB’s own main-effect entry. This is the safer default: crediting the same coefficient to both an interaction AND each of its main effects would double- (or triple-) count it and overstate every source’s apparent importance.

“equal_split”: the coefficient is instead divided evenly across each
of the k raw sources’ own main-effect entries (no separate interaction entries at all). Useful when a single, simpler per-raw-feature ranking is wanted and the approximation is acceptable.

Every entry in both views tracks downstream_use_count (how many distinct conditions/leaves touch it) separately from raw_feature_count (how many distinct raw features are involved in that one entry) – conflating the two would make an ensemble look more diverse than it really is, since a single raw feature reused across many leaves is not the same as many distinct raw features each contributing once.

Parameters:

rows (list[dict[str, Any]])
allocation (str)

Return type:

dict[str, Any]

Text rendering helpers for RPTE prediction evidence.

The structured rule APIs remain the source of record. This module converts those rows into a compact decision-tree-style view that is suitable for notebooks, terminals, reports, and the Governance Studio.

hugiml.rpte_interpretability.format_rpte_rule_tree(rows, *, condition_space='raw', detail_level='full', precision=5, include_direct_terms=True, include_generation_details=False, class_label=None, tree_index=None)[source]

Return RPTE prediction evidence as ready-to-print flat trees.

Shared condition prefixes are merged, one split is shown per indentation level, and each terminal leaf reports its fitted LR coefficient, odds multiplier, support, centered contribution, and raw-feature provenance. Direct source terms are grouped by original, HUG-pattern, and augmented-pair families after the tree sections.

Example:

print(model.rpte_rule_tree())

Parameters:

rows (Sequence[Mapping[str, Any]])
condition_space (Literal['raw', 'downstream', 'both'])
detail_level (Literal['compact', 'full'])
precision (int)
include_direct_terms (bool)
include_generation_details (bool)
class_label (Any | None)
tree_index (int | None)

Return type:

str

hugiml.rpte_interpretability.rpte_rule_tree_sections(rows, *, condition_space='raw', detail_level='full', precision=5, include_generation_details=False, class_label=None, tree_index=None)[source]

Build one flat-tree section per RPTE class and tree.

The returned dictionaries contain title, text, tree metadata, and coefficients. format_rpte_rule_tree is the simpler text API.

Parameters:

rows (Sequence[Mapping[str, Any]])
condition_space (Literal['raw', 'downstream', 'both'])
detail_level (Literal['compact', 'full'])
precision (int)
include_generation_details (bool)
class_label (Any | None)
tree_index (int | None)

Return type:

list[dict[str, Any]]

Complexity and inspection

Uniform structural and inspection counts for fitted classifiers.

The public interface uses three coherent levels:

model units: Coarse active components in the complete fitted model, such as terms, terminal leaves, or rules.
model inspection units: Expanded evidence required to inspect the complete fitted model, such as HUG source elements, all active root-to-leaf conditions, arity-weighted active EBM score cells, or RuleFit literals.
instance inspection units: Expanded evidence reviewed for one prediction. Tree models count the reached contributing path in each tree. HUGIML RPTE adds active direct terms. HUGIML linear models count every active original/direct term and only the active pattern or augmented-pair terms with non-zero row-specific values. EBM terms are expanded by source-feature arity: a main effect contributes one unit, a pairwise interaction two, and a higher-order term its arity.

Calling get_complexity() without a mode returns model inspection units. Use get_instance_inspection_units() to obtain one integer count per row, or pass X to get_complexity_report() for the mean and confidence interval. Optional estimator libraries are recognized through fitted public attributes and are not imported by this module.

hugiml.compute_complexity.get_complexity(model, mode=None, *, X=None, coefficient_tolerance=1e-12, confidence_level=0.95)[source]

Return one of the three complexity measures for a fitted estimator.

Parameters:

model (Any) – A fitted HUGIML estimator, supported baseline estimator, or a pipeline whose final step is a supported estimator.
mode ({"model units", "model inspection units",) – “instance inspection units”} or None, default=None None selects "model inspection units".
X (array-like or DataFrame, optional) – Required for "instance inspection units". The returned value is the arithmetic mean of the row-level counts.
coefficient_tolerance (float, default=1e-12) – Absolute threshold used to identify active fitted coefficients, active leaf outputs, and non-zero row-specific transformed values.
confidence_level (float, default=0.95) – Confidence level used when the instance-level summary is requested.

Returns:

Global measures return integers. Instance inspection units return the mean row-level count. None indicates that the fitted structure is not available in a form that can be counted reliably.

Return type:

int, float, or None

hugiml.compute_complexity.get_complexity_report(model, *, X=None, coefficient_tolerance=1e-12, confidence_level=0.95)[source]

Return model, model-inspection, and optional instance details.

Passing X adds an instance_inspection_units section containing the row count, mean, sample standard deviation, standard error, and two-sided Student-t confidence interval.

Parameters:

model (Any)
X (Any | None)
coefficient_tolerance (float)
confidence_level (float)

Return type:

dict[str, Any] | None

hugiml.compute_complexity.get_instance_inspection_units(model, X, *, coefficient_tolerance=1e-12)[source]

Return one instance-inspection count for every row in X.

The result is a one-dimensional integer array. None is returned when the estimator does not expose enough fitted structure to identify the row-specific evidence reliably.

Parameters:

model (Any)
X (Any)
coefficient_tolerance (float)

Return type:

ndarray | None

Adaptive binning

Per-feature adaptive binning for HUG-IML — HUGIMLAdaptive.

HUGIMLAdaptive is a thin, sklearn-compatible subclass of HUGIMLClassifier that hard-wires adaptive_binning=True and exposes a simplified constructor (no B, allCols, or origColumns parameters — those are managed internally).

All adaptive-binning mathematics live in hugiml._binning (single source of truth). Both this module and hugiml.classifier import from there; neither imports from the other at module level, so there is no circular dependency.

Adaptive-binning algorithm (three steps)

Per-feature B selection — for each numerical feature, evaluate candidate B values by computing information gain against y and stop when the marginal gain from adding more bins drops below min_marginal_gain_ratio of the gain already achieved (elbow-stopping).
Pre-discretisation — discretise each numerical feature to B_j equal-frequency quantile bins, computed on the training split only. Bin boundaries are stored in _bin_edges_ and reapplied at predict time. Each bin is encoded as a readable string label, e.g. "[12.0,24.0)".
Categorical pass-through — pre-binned columns are treated as categorical by the C++ layer; the global B parameter is set to the sentinel value 2 (no effect on already-categorical columns).

Non-finite value handling

Non-finite cells (NaN, ±Inf) in any pre-binned column receive np.nan in the label array. The C++ transaction builder skips those cells, generating no item for that (row, feature) pair — semantically “not observed”, with no imputation.

Usage

Example:

from hugiml.adaptive import HUGIMLAdaptive

clf = HUGIMLAdaptive(b_candidates=[3, 5, 7, 10, 15],
                     min_marginal_gain_ratio=0.02,
                     L=2, G=1e-4)
X_enc, y_enc = clf.prepareXy(X_df, y)
X_tr, X_te, y_tr, y_te = train_test_split(X_enc, y_enc, stratify=y_enc)
clf.fit(X_tr, y_tr)

print(clf.per_feature_b_)      # chosen B_j per feature
print(clf.model_summary())
clf.plot_bin_profiles()        # requires matplotlib
clf.ig_heatmap()               # requires matplotlib

Diagnostic plots (plot_bin_profiles, ig_heatmap) and fitted attributes (per_feature_b_, ig_scores_, _bin_edges_) are defined on HUGIMLClassifier and inherited here.

class hugiml.adaptive.HUGIMLAdaptive(b_candidates=None, min_marginal_gain_ratio=0.02, L=1, G=0.005, topK=-1, n_jobs=1, verbose=False, max_fit_seconds=None, interaction_relaxed_mining=False)[source]

Bases: HUGIMLClassifier

HUG-IML with per-feature adaptive binning via elbow-stopping IG search.

Thin subclass of HUGIMLClassifier with adaptive_binning=True hard-wired and a simplified constructor that omits parameters which are managed internally (B, allCols, origColumns).

All public methods, fitted attributes, serialisation, monitoring, drift detection, and explanation helpers are inherited from HUGIMLClassifier. No logic is duplicated.

Parameters:

b_candidates (list of int, optional) – Candidate bin counts to evaluate per feature. Default: [2, 3, 5, 7, 10, 15].
min_marginal_gain_ratio (float, default 0.02) – Stop adding bins when the incremental IG gain relative to the current level falls below this fraction. 0.02 means stop when a new candidate adds less than 2 % more IG than the previous step. Lower values allow finer bins; higher values enforce coarser bins.
L (int, default 1) – Maximum HUG pattern length. 1 = singletons; 2 = pairs; -1 = unlimited.
G (float, default 5e-3) – Minimum information-gain threshold.
topK (int, default -1) – Maximum number of patterns to retain. -1 computes automatically.
n_jobs (int, default 1) – Number of OpenMP threads. -1 uses all available cores.
verbose (bool, default False) – Emit INFO-level log messages during fit.
max_fit_seconds (float or None) – Wall-clock budget for the pattern-mining stage of fit().
HUGIMLClassifier) (Attributes (after fit — inherited from)
--------------------------------------------------------------
per_feature_b (dict[str, int]) – Chosen bin count per numerical feature.
ig_scores (dict[str, dict[int, float]]) – Full IG score grid {feature: {B: ig_value}} for diagnostics.
_bin_edges_ (dict[str, np.ndarray]) – Quantile edges used during fit, reapplied at predict time.
patterns (list) – Mined HUG patterns.
classes (ndarray) – Unique class labels.
fit_metadata (FitMetadata) – Timings, memory, pattern count stats.
interaction_relaxed_mining (bool)

classmethod default_param_grid()[source]

Return the default compact tuning grid inherited from the native classifier.

Return type:: dict[str, list]

get_params(deep=True)[source]

Return the constructor parameters (sklearn protocol).

Only the parameters that HUGIMLAdaptive.__init__ accepts are returned, so sklearn.clone and cross-validation helpers reconstruct the correct subclass.

Parameters:: deep (bool)
Return type:: dict

fit(X_train, y_train)[source]

Fit with per-feature adaptive binning.

Delegates entirely to HUGIMLClassifier.fit with adaptive_binning=True. When X_train is a plain ndarray and prepareXy has supplied column names, names from feature_names_in_ are applied so that feature-name-aware operations (adaptive binning, bin-edge lookup, schema validation) work correctly.

Parameters:

X_train (pd.DataFrame or ndarray)
y_train (array-like of int)

Return type:

self

property clf_: HUGIMLAdaptive

Backward-compatibility alias.

Old code that accessed adaptive_clf.clf_ to reach the inner HUGIMLClassifier now gets self, because HUGIMLAdaptive is a HUGIMLClassifier. All methods and fitted attributes are directly on self.

set_predict_proba_request(*, X_test='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in predict_proba.
self (HUGIMLAdaptive)

Returns:

self – The updated object.

Return type:

object

Metrics

Interpretability-complexity metrics for a fitted HUGIMLClassifierNative.

All functions accept a fitted HUGIMLClassifierNative and (optionally) a data matrix X to compute sample-level statistics. They never re-train the model.

Quick reference

Example:

from hugiml.metrics import compute_all_metrics
m = compute_all_metrics(clf, X_test)
print(m)

Available metrics

n_patterns — total mined patterns.
avg_pattern_length — mean number of items per pattern.
coverage — fraction of samples matched by at least one pattern.
overlap_rate — mean number of patterns active per sample.
top_k_cumulative_contribution(k) — cumulative absolute-coefficient share of top-k patterns.
active_patterns_per_prediction — per-sample array.
explanation_sparsity — fraction of patterns never active on the supplied data.

class hugiml.metrics.InterpretabilityMetrics(n_patterns=0, avg_pattern_length=0.0, max_pattern_length=0, coverage=0.0, mean_active_patterns=0.0, std_active_patterns=0.0, overlap_rate=0.0, explanation_sparsity=0.0, top_k_cumulative_contribution=<factory>, n_samples=0)[source]

Bases: object

All interpretability metrics for one fitted model + dataset.

Parameters:

n_patterns (int)
avg_pattern_length (float)
max_pattern_length (int)
coverage (float)
mean_active_patterns (float)
std_active_patterns (float)
overlap_rate (float)
explanation_sparsity (float)
top_k_cumulative_contribution (dict)
n_samples (int)

n_patterns

Total number of mined HUG patterns.

Type:: int

avg_pattern_length

Mean items (conditions) per pattern.

Type:: float

max_pattern_length

Length of the longest pattern.

Type:: int

coverage

Fraction of samples covered by at least one active pattern.

Type:: float

mean_active_patterns

Average number of patterns active per sample.

Type:: float

std_active_patterns

Standard deviation of active patterns per sample.

Type:: float

overlap_rate

Alias for mean_active_patterns / n_patterns (normalised).

Type:: float

explanation_sparsity

Fraction of patterns that are never active on X (“dead” patterns).

Type:: float

top_k_cumulative_contribution

Mapping from k to cumulative share of total absolute coefficient magnitude for the top-k patterns. Keys: [1, 5, 10, 20, 50].

Type:: dict[int, float]

n_samples

Number of rows in X used for sample-level metrics.

Type:: int

to_dict()[source]

Return a flat dict suitable for DataFrame construction.

Return type:: dict

hugiml.metrics.compute_all_metrics(clf, X)[source]

Compute all interpretability metrics in a single call.

Parameters:

clf (fitted HUGIMLClassifierNative)
X (array-like or DataFrame)

Return type:

InterpretabilityMetrics

hugiml.metrics.metrics_dataframe(results)[source]

Convert a mapping of {model_name: InterpretabilityMetrics} to a DataFrame.

Useful for side-by-side comparisons across models or configurations.

Parameters:: results (dict) – Keys are model labels; values are InterpretabilityMetrics instances.
Return type:: pd.DataFrame

Hyperparameter configuration

Centralized hyperparameter grid definitions.

The shared benchmark grids keep HUGIML and baseline configuration in one module for classifier tuning, the benchmark runner, and dashboard Workbench reuse, so recommended search spaces stay aligned across command-line, Python, and UI entry points.

Four named HUGIML grids are provided:

"performance"

LR-only first-pass grid. It uses adaptive binning, searches L and topK, keeps feature_mode="original_plus_patterns", and evaluates G at 0.01 and 0.001. No longer the default (see "performance_ho" below); kept available by name for callers that want to restrict tuning to the built-in logistic-regression branch only.

"interpretability"

Pattern-focused grid. It keeps feature_mode="patterns_only", enables interaction-relaxed mining, and disables augmented-pair transforms so the fitted representation remains a HUG pattern surface.

"interpretability_ho"

Higher-order extension of "interpretability". It preserves the same pattern-only, interaction-relaxed representation and disables augmented pairs, while searching the built-in LR and sequential RPTE downstream branches. This is a controlled test of downstream RPTE effectiveness.

"performance_ho"

Default grid (DEFAULT_HUGIML_GRID_NAME). Higher-order Hybrid grid. It searches an explicit L1-regularized logistic-regression base_estimator together with the adaptive RPTE downstream branch only at leaf_config="3xD". Binary targets fit the logistic estimator directly; targets with three or more classes use one-vs-rest classification with the same estimator configuration. The mining dimensions search L in [1, 2], topK in [50, 100], and G in [0.01, 0.001]. The RPTE estimators are wrapped in sklearn’s OneVsRestClassifier: binary problems still use a single binary RPTE fit, while a K-class problem fits K one-vs-rest adaptive-RPTE models. This produces 8 LR candidates plus 8 RPTE-OvR candidates, for 16 candidates in total. topk_budget_strict=False is fixed for every candidate, keeping the non-strict augmented-pair budget semantics.

The named representation paths set binary-column handling explicitly: augmented-pair grids use convert_binary_to_categorical=False so numeric 0/1 columns remain eligible as pair-transform sources, while interaction-relaxed grids use True so those indicators enter the categorical item surface used by native pattern mining. Direct classifier construction remains user-controlled through the constructor parameter.

base_estimator varying is supported by fast_grid_tune’s cache: mining (native pattern discovery) is cached per (G, L, topK) exactly as for any other grid, and the downstream feature matrix built from it is additionally cached per (G, L, topK, feature_mode) and reused across every base_estimator candidate at that key – only the final “fit an estimator on the already-built matrix” step repeats per candidate (cheap for the LR branch; the RPTE branch’s own boosting loop is the one part of a candidate this grid cannot avoid recomputing). See classifier._hugiml_prepare_downstream_template_from_cached_base / classifier._hugiml_fit_downstream_estimator_from_template. The underlying RPTE learner is binary, while the grid-level OneVsRestClassifier wrapper provides multiclass support.

BASELINE_MODEL_GRIDS holds the standard non-HUGIML benchmark grids. BUDGETED_BASELINE_MODEL_GRIDS holds the corresponding 200-leaf ensemble grids used by the dashboard benchmark. Models without a registered grid are fitted once with their default estimator settings.

hugiml.hyperparameter_configs.make_l1_logistic_base_estimator()[source]

Create the linear HUGIML base estimator used by named grids.

Return type:: LogisticRegression

hugiml.hyperparameter_configs.get_hugiml_grid(name=None)[source]

Return a copy of the named HUGIML hyperparameter grid.

Parameters:: name (str or None, default None) – A key in HUGIML_GRIDS. None resolves to DEFAULT_HUGIML_GRID_NAME.
Returns:: A fresh copy of the grid so callers can narrow candidate values without mutating the shared definition.
Return type:: dict[str, list]
Raises:: KeyError – If name does not match a known grid.

hugiml.hyperparameter_configs.list_hugiml_grids()[source]

Return the available HUGIML grid names for CLI and UI population.

Return type:: list[str]

hugiml.hyperparameter_configs.get_baseline_grid(model_name)[source]

Return a copy of a standard baseline tuning grid, or None.

Parameters:: model_name (str)
Return type:: dict[str, list] | None

hugiml.hyperparameter_configs.get_budgeted_baseline_grid(model_name)[source]

Return a copy of a 200-leaf ensemble tuning grid, or None.

Parameters:: model_name (str)
Return type:: dict[str, list] | None

Calibration

Calibration evaluation for HUGIMLClassifierNative.

Provides Expected Calibration Error (ECE), Brier score decomposition, reliability diagram data, and calibration curve computation consistent with best practices for interpretable classifiers.

class hugiml.calibration.CalibrationResult(ece, mce, brier_score, brier_reliability, brier_resolution, brier_uncertainty, n_bins, bin_confidences=<factory>, bin_accuracies=<factory>, bin_counts=<factory>)[source]

Bases: object

Calibration evaluation summary for a fitted classifier.

Parameters:

ece (float)
mce (float)
brier_score (float)
brier_reliability (float)
brier_resolution (float)
brier_uncertainty (float)
n_bins (int)
bin_confidences (list[float])
bin_accuracies (list[float])
bin_counts (list[int])

ece

Expected Calibration Error (lower is better; 0 = perfect).

Type:: float

mce

Maximum Calibration Error across all bins.

Type:: float

brier_score

Mean Brier score (lower is better; 0 = perfect).

Type:: float

brier_reliability

Brier reliability component (miscalibration contribution).

Type:: float

brier_resolution

Brier resolution component (sharpness contribution).

Type:: float

brier_uncertainty

Brier uncertainty component (base rate uncertainty).

Type:: float

n_bins

Number of calibration bins used.

Type:: int

bin_confidences

Mean predicted confidence per bin.

Type:: list of float

bin_accuracies

Empirical accuracy per bin.

Type:: list of float

bin_counts

Sample count per bin.

Type:: list of int

summary()[source]

Human-readable calibration summary.

Return type:: str

to_dict()[source]

Return metrics as a plain dictionary.

Return type:: dict

hugiml.calibration.evaluate_calibration(y_true, y_proba, n_bins=10, strategy='uniform')[source]

Compute ECE, MCE, and Brier score decomposition.

Parameters:

y_true (np.ndarray of int, shape (n_samples,)) – True class labels (0 or 1 for binary; multi-class uses one-vs-rest).
y_proba (np.ndarray of float, shape (n_samples,) or (n_samples, n_classes)) – Predicted probabilities. For multi-class, pass the probability of the positive class or use the column for the class of interest.
n_bins (int) – Number of calibration bins.
strategy ({'uniform', 'quantile'}) – Bin strategy: uniform width or equal-frequency.

Return type:

CalibrationResult

hugiml.calibration.reliability_diagram_data(y_true, y_proba, n_bins=10)[source]

Return bin-level data for plotting a reliability diagram.

Parameters:

y_true (np.ndarray)
y_proba (np.ndarray)
n_bins (int)

Returns:

Three parallel lists, one entry per non-empty bin.

Return type:

(mean_predicted, fraction_positives, bin_counts)

hugiml.calibration.brier_decomposition(y_true, y_proba)[source]

Murphy decomposition of the Brier score.

Decomposes Brier = Reliability - Resolution + Uncertainty.

Parameters:

y_true (np.ndarray of {0, 1})
y_proba (np.ndarray of float in [0, 1])

Returns:

All three components as floats.

Return type:

(reliability, resolution, uncertainty)

Plots

HUG-IML first-class visualizations using Plotly.

Public API

from hugiml.plots import HUGPlotter

plotter = HUGPlotter(clf) fig = plotter.plot_marginal_bin_profile(“age”, X) # EBM shape-function equivalent fig = plotter.plot_feature_combinations(“age”) # compound patterns for one feature fig = plotter.plot_feature_importance(top_n=15) fig = plotter.plot_utility_vs_ig() # scatter: utility × IG × support fig = plotter.plot_top_patterns(top_n=20) fig = plotter.plot_feature_coverage() fig = plotter.plot_pattern_lengths() fig = plotter.plot_support_distribution() fig = plotter.plot_active_patterns(X, sample_idx=0) # local explanation fig = plotter.plot_dashboard(X) # full multi-panel HTML

class hugiml.plots.HUGPlotter(clf, height_default=380)[source]

Bases: object

Unified Plotly-based visualization interface for a fitted HUGIMLClassifierNative.

Parameters:

clf (fitted HUGIMLClassifierNative)
height_default (int) – Default figure height.

plot_marginal_bin_profile(feature_name, X=None, height=None, title=None)[source]

1-D HUG profile — EBM shape function equivalent.

For a given feature, shows every singleton pattern bin as a bar (x = bin label, y = utility, colour = information gain). An orange dotted line overlays the training support fraction on the right y-axis, mirroring the dashboard’s “Marginal Bin Profile” card.

Parameters:

feature_name (str)
X (ignored) – Support uses training data stored in clf.x_train_hup_.
height (int, optional)
title (str, optional)

Return type:

plotly.graph_objects.Figure

plot_feature_combinations(feature_name, top_n=25, height=None, title=None)[source]

Compound patterns that include a specific feature.

Each bar = one compound pattern; bars coloured by the number of extra features (+1 = green, +2 = orange, +3 = red), matching the dashboard’s “Feature Combinations” card.

Parameters:

feature_name (str)
top_n (int)
height (int, optional)
title (str, optional)

Return type:

go.Figure

plot_feature_importance(top_n=15, height=None, title=None)[source]

Feature importance: mean utility per feature, coloured by mean IG.

Matches the “Feature Importance” card in the governance dashboard.

Parameters:

top_n (int)
height (int, optional)
title (str, optional)

Return type:

go.Figure

plot_utility_vs_ig(feature_filter=None, height=None, title=None)[source]

Scatter: utility (x) × information gain (y), coloured by support.

Matches the “Utility vs Info Gain” card in the governance dashboard. Optionally filter to patterns containing one feature.

Parameters:

feature_filter (str, optional) – If given, highlight only patterns for this feature.
height (int, optional)
title (str, optional)

Return type:

go.Figure

plot_top_patterns(top_n=20, height=None, title=None)[source]

Horizontal bar chart of top-N patterns by utility, coloured by IG.

Matches the “Top Patterns” card in the governance dashboard.

Parameters:

top_n (int)
height (int, optional)
title (str, optional)

Return type:

go.Figure

plot_feature_coverage(top_n=15, height=None, title=None)[source]

Horizontal bar: how many patterns reference each feature.

Matches the “Feature Coverage” card in the governance dashboard.

Parameters:

top_n (int)
height (int | None)
title (str | None)

Return type:

plotly.graph_objects.Figure

plot_pattern_lengths(height=None, title=None)[source]

Bar chart of pattern length distribution.

Matches the “Pattern Lengths” card in the governance dashboard.

Parameters:

height (int | None)
title (str | None)

Return type:

plotly.graph_objects.Figure

plot_support_distribution(height=None, title=None)[source]

Histogram of pattern support values.

Matches the “Support Distribution” card in the governance dashboard.

Parameters:

height (int | None)
title (str | None)

Return type:

plotly.graph_objects.Figure

plot_active_patterns(X, sample_idx=0, max_patterns=20, height=None, title=None)[source]

Local explanation: active HUG patterns for a single sample.

Shows active patterns sorted by absolute coefficient magnitude, coloured blue for positive coefficients and red for negative coefficients.

Parameters:

X (array-like or DataFrame)
sample_idx (int)
max_patterns (int)
height (int, optional)
title (str, optional)

Return type:

go.Figure

plot_performance_radar(metrics, dataset_name='Dataset', height=None)[source]

Radar / spider chart of classification performance metrics.

Matches the “Performance” card in the governance dashboard.

Parameters:

metrics (dict) – Keys: ‘accuracy’, ‘balanced_accuracy’, ‘roc_auc’, ‘f1’ Values: floats in [0, 1].
dataset_name (str)
height (int, optional)

Return type:

go.Figure

plot_2d_profile(feature_a, feature_b, height=None, title=None)[source]

2-D HUG profile heatmap for compound patterns involving two features.

Parameters:

feature_a (str)
feature_b (str)
height (int, optional)
title (str, optional)

Return type:

go.Figure

plot_dashboard(X, dataset_name='Dataset', feature_names_for_profile=None, output_path=None)[source]

Generate a self-contained multi-panel HTML dashboard.

Produces performance overview, feature importance, utility-vs-IG, top patterns, pattern lengths, support distribution, feature coverage, and per-feature marginal bin profiles.

Parameters:

X (array-like or DataFrame) – Used for active-pattern coverage check.
dataset_name (str)
feature_names_for_profile (list of str, optional) – Which features to include marginal bin profiles for. Defaults to all features that have singleton patterns.
output_path (str, optional) – If given, writes the HTML to this path.

Return type:

str (HTML string)

Governance

Governance artifacts for HUGIMLClassifierNative.

Provides model card generation, audit artifact packaging, and governance metadata consistent with responsible model deployment practices and the HUG-IML paper’s emphasis on interpretability.

class hugiml.governance.ModelCard(model_id, model_type='HUGIMLClassifierNative', paper_reference='Krishnamoorthy, S. (2024). Interpretable Classifier Models for Decision Support Using High Utility Gain Patterns. IEEE Access, 12, 126088-126107. DOI: 10.1109/ACCESS.2024.3455563', license='Apache-2.0', intended_use='', out_of_scope_use='', training_data_description='', evaluation_data_description='', hyperparameters=<factory>, performance_metrics=<factory>, n_patterns=0, n_compound=0, top_patterns=<factory>, limitations=<factory>, ethical_considerations='', created_at=<factory>, framework_version='')[source]

Bases: object

Structured model card for a fitted HUGIMLClassifierNative.

Follows the Google Model Cards framework adapted for rule-based interpretable classifiers.

Parameters:

model_id (str)
model_type (str)
paper_reference (str)
license (str)
intended_use (str)
out_of_scope_use (str)
training_data_description (str)
evaluation_data_description (str)
hyperparameters (dict[str, Any])
performance_metrics (dict[str, Any])
n_patterns (int)
n_compound (int)
top_patterns (list[str])
limitations (list[str])
ethical_considerations (str)
created_at (str)
framework_version (str)

model_id

Unique identifier for this model version.

Type:: str

model_type

Always ‘HUGIMLClassifierNative’.

Type:: str

paper_reference

Citation for the HUG-IML algorithm.

Type:: str

license

Software license.

Type:: str

intended_use

Describe the intended classification task.

Type:: str

out_of_scope_use

Describe uses not covered by this model.

Type:: str

training_data_description

Description of training data.

Type:: str

evaluation_data_description

Description of evaluation data.

Type:: str

hyperparameters

B, L, G, topK as used during training.

Type:: dict

performance_metrics

Accuracy, F1, AUC, ECE, Brier score, etc.

Type:: dict

n_patterns

Number of mined HUG patterns.

Type:: int

n_compound

Number of compound patterns.

Type:: int

top_patterns

Most important patterns.

Type:: list of str

limitations

Known limitations.

Type:: list of str

ethical_considerations

Fairness, bias, and ethical notes.

Type:: str

created_at

ISO 8601 timestamp of creation.

Type:: str

framework_version

hugiml-core version.

Type:: str

to_dict()[source]

Serialize to a plain dictionary.

Return type:: dict

to_json(indent=2)[source]

Serialize to a JSON string.

Parameters:: indent (int)
Return type:: str

to_markdown()[source]

Render the model card as a Markdown document.

Return type:: str

save(path, fmt='json')[source]

Save the model card to a file.

Parameters:

path (str) – Output file path.
fmt ({'json', 'markdown', 'md'}) – Output format.

Return type:

None

class hugiml.governance.AuditArtifact(model_id, created_at=<factory>, training_hash='', model_card=None, governance=None, fit_metadata=None, pattern_info=None, calibration=None, explainability=None, framework_version='')[source]

Bases: object

Audit record for a model training run.

Captures all information needed for regulatory review or internal audit.

Parameters:

model_id (str)
created_at (str)
training_hash (str)
model_card (dict[str, Any] | None)
governance (dict[str, Any] | None)
fit_metadata (dict[str, Any] | None)
pattern_info (list[dict[str, Any]] | None)
calibration (dict[str, Any] | None)
explainability (dict[str, Any] | None)
framework_version (str)

to_dict()[source]

Return audit artifact fields as a plain dictionary.

Return type:: dict

to_json(indent=2)[source]

Serialise the audit artifact to a JSON string.

Parameters:: indent (int)
Return type:: str

save(path)[source]

Write the audit artifact to a JSON file.

Parameters:: path (str)
Return type:: None

class hugiml.governance.GovernanceMetadata(model_id, owner='', purpose='', data_classification='unclassified', review_status='draft', approved_by=None, approved_at=None, tags=<factory>)[source]

Bases: object

Minimal governance metadata attached to a model instance.

Parameters:

model_id (str)
owner (str)
purpose (str)
data_classification (str)
review_status (str)
approved_by (str | None)
approved_at (str | None)
tags (list[str])

model_id

Type:: str

owner

Person or team responsible for this model.

Type:: str

purpose

Business or scientific purpose.

Type:: str

data_classification

Sensitivity of training data (e.g. ‘public’, ‘internal’, ‘confidential’).

Type:: str

review_status

One of ‘draft’, ‘reviewed’, ‘approved’, ‘deprecated’.

Type:: str

approved_by

Type:: str or None

approved_at

Type:: str or None

tags

Type:: list of str

to_dict()[source]

Return governance metadata as a plain dictionary.

Return type:: dict

to_json(indent=2)[source]

Serialise governance metadata to a JSON string.

Parameters:: indent (int)
Return type:: str

hugiml.governance.generate_model_card(classifier, model_id, *, intended_use='', out_of_scope_use='', training_data_description='', evaluation_data_description='', performance_metrics=None, limitations=None, ethical_considerations='')[source]

Populate a ModelCard from a fitted classifier.

Parameters:

classifier (HUGIMLClassifierNative) – A fitted classifier.
model_id (str) – Unique identifier.
intended_use (str)
out_of_scope_use (str)
training_data_description (str)
evaluation_data_description (str)
performance_metrics (dict[str, Any] | None)
limitations (list[str] | None)
ethical_considerations (str)

Return type:

ModelCard

hugiml.governance.package_audit_artifacts(classifier, model_id, output_dir, *, model_card=None, governance=None, calibration_result=None, explainability_report=None)[source]

Package all audit artifacts for a trained model.

Writes model card, governance metadata, fit metadata, pattern info, and optional calibration/explainability reports to output_dir.

Returns:

Path to the audit manifest JSON file.

Return type:

str

Parameters:

classifier (Any)
model_id (str)
output_dir (str)
model_card (ModelCard | None)
governance (GovernanceMetadata | None)
calibration_result (Any | None)
explainability_report (Any | None)

Explainability

Enterprise explainability for HUGIMLClassifierNative.

Provides SHAP interoperability, feature lineage tracking, explanation stability metrics, and audit artifact generation. The core HUG patterns are human-readable by design; this module adds depth for downstream governance and audit workflows.

class hugiml.explainability.ExplainabilityReport(model_id, n_patterns, n_features, top_patterns=<factory>, feature_lineage=<factory>, model_composition=<factory>, augmented_pair_effects=<factory>, stability=None, shap_available=False)[source]

Bases: object

Full explainability report for a fitted classifier instance.

Contains pattern importances, feature lineage, and stability metrics. Serializable to JSON for audit workflows.

Parameters:

model_id (str)
n_patterns (int)
n_features (int)
top_patterns (list[dict[str, Any]])
feature_lineage (list[dict[str, Any]])
model_composition (dict[str, Any])
augmented_pair_effects (list[dict[str, Any]])
stability (dict[str, Any] | None)
shap_available (bool)

to_json(indent=2)[source]

Serialize the report to a JSON string.

Parameters:: indent (int)
Return type:: str

save(path)[source]

Write the report to a JSON file.

Parameters:: path (str)
Return type:: None

class hugiml.explainability.FeatureLineage(feature_name, feature_type, derived_patterns=<factory>, pattern_indices=<factory>, derived_augmented_pairs=<factory>, total_importance=0.0, pattern_importance=0.0, augmented_pair_importance=0.0, original_feature_importance=0.0)[source]

Bases: object

Provenance record linking an original feature to downstream features.

Parameters:

feature_name (str)
feature_type (str)
derived_patterns (list[str])
pattern_indices (list[int])
derived_augmented_pairs (list[str])
total_importance (float)
pattern_importance (float)
augmented_pair_importance (float)
original_feature_importance (float)

feature_name

Original feature name from the training DataFrame.

Type:: str

feature_type

One of ‘integer’, ‘float’, ‘categorical’.

Type:: str

derived_patterns

Human-readable HUG pattern labels that include this feature.

Type:: list of str

pattern_indices

Indices into the pattern list for each derived pattern.

Type:: list of int

derived_augmented_pairs

Augmented-pair feature names that use this source feature.

Type:: list of str

total_importance

Sum of absolute downstream coefficients for original, HUG pattern, and augmented-pair features linked to this source feature.

Type:: float

pattern_importance

Pattern-only contribution to total_importance.

Type:: float

augmented_pair_importance

Augmented-pair contribution to total_importance.

Type:: float

original_feature_importance

Direct original-feature contribution when original features are included in the downstream estimator.

Type:: float

class hugiml.explainability.ExplanationStabilityMetrics(jaccard_similarity=0.0, rank_correlation=0.0, pattern_overlap_count=0, n_patterns_a=0, n_patterns_b=0, by_feature_type=<factory>)[source]

Bases: object

Stability metrics for pattern-based explanations.

The top-level fields report stability for mined HUG patterns only. When original or augmented-pair downstream features are present, per-feature-type metrics are available in by_feature_type so derived feature stability is not conflated with human-readable pattern-rule stability.

Parameters:

jaccard_similarity (float)
rank_correlation (float)
pattern_overlap_count (int)
n_patterns_a (int)
n_patterns_b (int)
by_feature_type (dict[str, dict[str, float | int]])

class hugiml.explainability.HUGPatternExplainer(classifier)[source]

Bases: object

Enterprise explainability layer over a fitted HUGIMLClassifierNative.

Extracts feature lineage, computes explanation stability, and provides a SHAP-compatible interface where available. Designed to operate on the already-mined HUG patterns without re-running the algorithm.

Parameters:: classifier (HUGIMLClassifierNative) – A fitted classifier instance.

feature_lineage()[source]

Build feature lineage mapping each input feature to its patterns.

Returns:: One entry per original input feature.
Return type:: list of FeatureLineage

explanation_stability(X_a, y_a, X_b, y_b, top_n=20)[source]

Measure explanation stability across two data splits.

Fits two copies of the classifier on split A and split B. The headline metrics compare only mined HUG patterns. Additional metrics are returned by feature type so original features, HUG patterns, and augmented-pair transforms are not mixed into a single stability score.

Parameters:

X_a (split A data)
y_a (split A data)
X_b (split B data)
y_b (split B data)
top_n (int) – How many top patterns to compare.

Return type:

ExplanationStabilityMetrics

generate_report(model_id='hugiml_model', top_n=20)[source]

Generate a complete explainability report.

Parameters:

model_id (str) – Identifier for this model instance.
top_n (int) – Number of top patterns to include.

Return type:

ExplainabilityReport

hugiml.explainability.shap_values_from_pattern_matrix(classifier, X, *, background_samples=100, check_additivity=False, allow_incomplete=False)[source]

Compute SHAP values over the HUG pattern feature space.

Applies SHAP’s LinearExplainer (or KernelExplainer as fallback) on the binary pattern-presence matrix produced by the classifier’s transform() method. The resulting SHAP values are in pattern-space; use aggregate_shap_to_features() to roll them back to original features.

When the fitted downstream estimator also uses original or augmented-pair features, pattern-space SHAP is incomplete relative to the fitted model. In that case this function warns and returns None unless allow_incomplete=True is passed explicitly.

Requires the optional shap package (pip install shap).

Parameters:

classifier (HUGIMLClassifierNative) – A fitted classifier.
X (array-like) – Input data to explain.
background_samples (int) – Number of background samples for KernelExplainer.
check_additivity (bool) – Pass to SHAP’s explain call.
allow_incomplete (bool) – If False, return None when the fitted downstream estimator uses original or augmented-pair features in addition to HUG patterns.

Returns:

SHAP values in pattern space. Returns None when shap is not installed.

Return type:

np.ndarray of shape (n_samples, n_patterns) or None

Monitoring

Operational monitoring for HUGIMLClassifierNative.

Provides thread-safe prediction statistics tracking and multi-method distribution drift detection combining PSI, KL divergence, and label drift monitoring.

class hugiml.monitoring.PredictionMonitor(window_size=1000)[source]

Bases: object

Thread-safe prediction statistics tracker.

Attach to a fitted classifier via clf.enable_monitoring(). Access statistics via clf.monitor.report() or clf.monitor.stats.

Tracks prediction count, confidence distribution, per-class frequency, and latency percentiles over a rolling window.

Parameters:: window_size (int)

reset()[source]

Clear all accumulated statistics.

Return type:: None

record(proba, latency_ms)[source]

Record one batch of predictions.

Parameters:

proba (np.ndarray, shape (n_samples, n_classes)) – Predicted class probabilities.
latency_ms (float) – Wall-clock time for this batch in milliseconds.

Return type:

None

property stats: dict: Current monitoring statistics as a plain dict.

report()[source]

Human-readable monitoring report.

Return type:: str

class hugiml.monitoring.DriftDetector(n_bins=10)[source]

Bases: object

Multi-method distribution drift detector.

Combines Population Stability Index (PSI) and symmetric KL divergence for robust drift assessment. Optionally tracks label drift when ground truth is available.

PSI thresholds:: < 0.1 — stable 0.1–0.25 — moderate shift > 0.25 — significant drift

Parameters:: n_bins (int) – Number of histogram bins for numerical features.

fit_baseline(X, cat_mask, col_names=None, y=None)[source]

Store training distribution for later comparison.

Parameters:

X (np.ndarray, shape (n, p))
cat_mask (np.ndarray of bool, shape (p,))
col_names (list of str, optional)
y (np.ndarray of int, optional) – Training labels for label-drift baseline.

Return type:

None

compute_psi(X_test)[source]

Compute PSI per numerical feature between training and test.

Return type:: dict mapping column name to PSI value.
Parameters:: X_test (ndarray)

compute_kl(X_test)[source]

Compute symmetric KL divergence per feature.

Return type:: dict mapping column name to KL value.
Parameters:: X_test (ndarray)

compute_label_drift(y_test)[source]

Compute per-class proportion shift between training and test labels.

Returns None when no training label baseline is available.

Parameters:: y_test (ndarray)
Return type:: dict[str, float] | None

detect(X_test, y_test=None, threshold=0.1)[source]

Run full multi-method drift detection.

Parameters:

X_test (np.ndarray)
y_test (np.ndarray of int, optional)
threshold (float) – PSI threshold above which a feature is flagged.

Return type:

DriftReport

report(X_test, threshold=0.1)[source]

Return a human-readable drift report string (PSI only).

Parameters:

X_test (ndarray)
threshold (float)

Return type:

str

class hugiml.monitoring.DriftReport(psi, kl_divergence, label_drift, threshold)[source]

Bases: object

Structured result from a drift detection run.

Parameters:

psi (dict)
kl_divergence (dict)
label_drift (dict | None)
threshold (float)

psi

Population Stability Index per feature.

Type:: dict[str, float]

kl_divergence

Symmetric KL divergence per feature.

Type:: dict[str, float]

label_drift

Per-class label proportion shift (requires y_test).

Type:: dict[str, float] or None

overall_psi

Mean PSI across all numerical features.

Type:: float

overall_kl

Mean KL divergence across all numerical features.

Type:: float

drifted_features

Features exceeding the PSI threshold.

Type:: list[str]

severity

One of ‘none’, ‘moderate’, ‘significant’.

Type:: str

to_dict()[source]

Return all drift metrics as a plain dictionary.

Return type:: dict

Multiclass and imbalance

Helpers for three common HUG-IML deployment scenarios:

Multiclass classification — HUGIMLClassifierNative supports multiclass natively via its base_estimator (LogisticRegression with solver='lbfgs' when n_classes > 2). This module provides a MulticlassHUGReport that extracts per-class pattern importances.
Imbalanced data — wraps the classifier in a cost-sensitive or resampling pipeline via make_imbalanced_pipeline.
High-cardinality categoricals — encode_high_cardinality replaces columns with many unique values with target-mean encoding or a frequency encoding before passing data to prepareXy.

class hugiml.multiclass.MulticlassHUGReport(clf)[source]

Bases: object

Per-class pattern importances for a multiclass HUG-IML model.

When the downstream estimator is LogisticRegression with > 2 classes, coef_ has shape (n_classes, n_patterns). This class exposes per-class top patterns.

Parameters:: clf (fitted HUGIMLClassifierNative)

importances_for_class(class_label, top_n=20)[source]

Return the top-N patterns for a specific class.

Parameters:

class_label (class value in clf.classes_)
top_n (int)

Returns:

pd.DataFrame with columns

Return type:

pattern, coefficient, abs_coefficient, support

summary(top_n=10)[source]

Human-readable summary of top patterns per class.

Parameters:: top_n (int)
Return type:: str

hugiml.multiclass.make_imbalanced_pipeline(clf, strategy='class_weight', sampling_ratio=1.0, random_state=42)[source]

Wrap a HUGIMLClassifierNative for use with imbalanced data.

Parameters:

clf (HUGIMLClassifierNative (unfitted))
strategy ({'class_weight', 'smote', 'random_oversample', 'random_undersample'}) –
- class_weight — sets class_weight='balanced' on the downstream LR. Zero overhead; recommended first choice.
- smote — SMOTE oversampling via imbalanced-learn.
- random_oversample — random oversampling via imbalanced-learn.
- random_undersample — random undersampling via imbalanced-learn.
sampling_ratio (float) – Target minority:majority ratio (only for imbalanced-learn strategies).
random_state (int)

Returns:

Fitted wrapper or HUGIMLClassifierNative (for ‘class_weight’) — the returned
object has fit(X, y), predict_proba(X), and predict(X) methods.

Notes

For ‘class_weight’: returns a copy of clf with base_estimator set to LogisticRegression(class_weight=’balanced’). For SMOTE/resampling: returns an ImbalancedHUGPipeline that applies resampling to the pattern matrix (post-transform) inside fit(). This ensures the HUG patterns are mined on the original distribution (as intended) while the downstream classifier trains on the resampled binary matrix.

hugiml.multiclass.encode_high_cardinality(X, y=None, threshold=20, method='target_mean', min_samples_leaf=5, smoothing=1.0, random_state=42)[source]

Replace high-cardinality categorical columns with numerical encodings.

This should be called before prepareXy; the returned mapping can be applied to test data via apply_encoding.

Parameters:

X (pd.DataFrame)
y (array-like, optional) – Required when method='target_mean'.
threshold (int) – Columns with more than this many unique values are considered high-cardinality.
method ({'target_mean', 'frequency', 'ordinal'}) –
- target_mean — replace each category with its mean target value (smoothed towards the global mean). Reduces categories to a single float — most informative for tree/rule-based models.
- frequency — replace with the category’s relative frequency.
- ordinal — assign arbitrary integer codes (fast, no leakage, but loses any ordering meaning).
min_samples_leaf (int) – Minimum observations per category before smoothing kicks in (target_mean only).
smoothing (float) – Smoothing strength (target_mean only).
random_state (int) – Used internally for any random operations.

Returns:

X_encoded (pd.DataFrame (copy — original is unchanged))
encoding_map (dict) – Mapping {column_name: dict_or_array} to apply to unseen data via apply_encoding(X_test, encoding_map).

Return type:

tuple[DataFrame, dict]

Notes

Data-leakage safety: call encode_high_cardinality on the training split only. Use apply_encoding on test/validation data with the map returned from training. Never fit the encoding on combined train+test data.

hugiml.multiclass.apply_encoding(X, encoding_map, fill_value=0.0)[source]

Apply an encoding map (produced by encode_high_cardinality) to new data.

Parameters:

X (pd.DataFrame)
encoding_map (dict (from encode_high_cardinality))
fill_value (float) – Value for unseen categories.

Return type:

pd.DataFrame (copy)

Pattern pruning

Regulated “remove / refit / calibrate” workflow for HUG-IML.

EBMs are valued partly because model terms can be inspected and sometimes edited (e.g. to remove an ethically problematic interaction term). This module gives HUG-IML an analogous controlled editing workflow that is rigorous enough for regulated-domain review cycles.

Workflow

Inspect patterns via clf.feature_importances() or clf.get_pattern_info().
Create a PatternEditor and call remove() with a list of pattern indices (or keyword filters).
Call refit(X_tr, y_tr) to re-train the downstream classifier on the pruned pattern matrix. The C++ mining results are unchanged.
Optionally call calibrate(X_cal, y_cal) to wrap the refitted model with Platt scaling / isotonic regression.
Call finalize() to get a new classifier instance with the edited pattern set baked in, and audit_report() for a JSON audit trail.

Example

from hugiml.pruning import PatternEditor

editor = PatternEditor(clf) editor.remove([3, 7, 12], reason=”pattern references protected attribute ‘gender’”) editor.remove_by_keyword(“income”, reason=”unstable feature (high PSI)”) new_clf = editor.refit(X_tr, y_tr).calibrate(X_cal, y_cal).finalize()

print(editor.audit_report()) new_clf.predict_proba(X_te)

class hugiml.pruning.PatternEditor(clf, operator_name='analyst')[source]

Bases: object

Controlled pattern editing with full audit trail.

Parameters:

clf (fitted HUGIMLClassifierNative) – The original model. This object is not mutated; all edits produce a fresh copy stored internally.
operator_name (str) – Human-readable identifier of the person/process making the edits (for the audit trail).

remove(pattern_indices, reason='unspecified')[source]

Remove patterns by index (0-based, relative to the current working set).

Parameters:

pattern_indices (list of int) – Indices into the current pattern list. Use list_patterns() to preview indices.
reason (str) – Audit reason (e.g. ‘protected attribute’, ‘operationally invalid’).

Return type:

self (for method chaining)

remove_by_keyword(keyword, reason='keyword match', case_sensitive=False)[source]

Remove all patterns whose label contains keyword.

Parameters:

keyword (str)
reason (str)
case_sensitive (bool)

Return type:

self

remove_low_support(min_support=0.01, reason='support below threshold')[source]

Remove patterns with training support below min_support.

Parameters:

min_support (float) – Minimum fraction of training samples (0 to 1).
reason (str)

Return type:

self

refit(X_tr, y_tr, estimator=None)[source]

Refit the downstream classifier on the (pruned) pattern matrix.

The HUG mining results (patterns_) are unchanged; only the downstream Pipeline (model_) is replaced.

Parameters:

X_tr (array-like or DataFrame) – Training data (should be the same split used to fit the original model).
y_tr (array-like)
estimator (sklearn estimator, optional) – If None, uses the original downstream estimator class with the same hyperparameters.

Return type:

self

calibrate(X_cal, y_cal, method='isotonic')[source]

Wrap the refitted downstream model with probability calibration.

Uses sklearn.calibration.CalibratedClassifierCV applied post-fit to a calibration set that should be held out from both training and test.

Parameters:

X_cal (array-like or DataFrame)
y_cal (array-like)
method ({'sigmoid', 'isotonic'})

Return type:

self

finalize()[source]

Return the edited classifier as a new standalone instance.

After calling finalize(), further edits on this editor are blocked. The returned object is a fully independent copy.

Return type:: HUGIMLClassifierNative (edited copy)

list_patterns()[source]

Return editable HUG patterns in the current working model.

PatternEditor edits mined HUG patterns only. Original features and augmented-pair downstream features are visible through list_downstream_features() but are not directly removable by this editor.

Return type:: DataFrame

list_downstream_features()[source]

Return all downstream features with PatternEditor editability.

The returned table includes original features, HUG patterns, and augmented-pair transforms when present. Only rows with feature_type == 'pattern' are directly editable through remove() and related PatternEditor methods.

Return type:: DataFrame

diff()[source]

Return a summary of changes made relative to the original model.

Returns:: dict with keys
Return type:: n_original, n_current, n_removed, removed_patterns

audit_report(indent=2)[source]

Return a JSON string describing all edits made.

The report includes operator name, timestamps, reasons, and the diff summary.

Parameters:: indent (int)
Return type:: str

save_audit_report(path)[source]

Write the audit report to a JSON file.

Parameters:: path (str)
Return type:: None

class hugiml.pruning.RemovalRecord(timestamp, pattern_indices, pattern_labels, reason, removed_by)[source]

Bases: object

Audit record for a single pattern-removal action.

Parameters:

timestamp (str)
pattern_indices (list[int])
pattern_labels (list[str])
reason (str)
removed_by (str)

Serialization

Versioned serialization and SBOM generation for HUGIMLClassifier.

Format (v3+ — current writer)

A ZIP archive containing JSON manifests and NumPy array bundles. Built-in LogisticRegression, SGDClassifier, and RPTE downstream models are stored as structured configuration plus fitted NumPy state. OneVsRestClassifier and Pipeline containers are serialized recursively. Estimators without a native serializer continue to use the restricted custom-estimator fallback.

Archive layout:

manifest.json          – format_version, schema_version, timestamp
clf_init.json          – __init__ hyperparameters
clf_fit.json           – scalar / list fitted attributes
patterns.json          – list of {utility, items, ig} dicts
arrays.npz             – cat_cols_mask_, is_int_mask_, classes_
td_config.json         – TransactionDataWrapper non-array state
td_arrays.npz          – TransactionDataWrapper numpy arrays
estimator.json         – downstream estimator class + parameters
estimator_arrays.npz   – downstream estimator numpy arrays
hmac.sig               – HMAC-SHA256 over all content files (hex)

Authentication

Set HUGIML_MODEL_HMAC_KEY (hex-encoded, 32+ bytes) before saving or loading. Files saved without a key have an all-zero hmac.sig and can still be loaded unless HUGIML_REQUIRE_MODEL_HMAC=true is set.

Backward compatibility (v1/v2)

Models saved with schema version 1 or 2 (the legacy HMAC-pickle format) are still loadable via a restricted Unpickler that permits only known HUG-IML and sklearn modules. v1/v2 writing is not supported.

hugiml.serialization.save_model(clf, path)[source]

Persist a fitted classifier to a v3 ZIP/JSON/NumPy model file.

Parameters:

clf (HUGIMLClassifier) – A fitted classifier.
path (str or Path)

Raises:

HUGIMLSerializationError – When the model is unfitted, a component cannot be serialized, or the write fails.

Return type:

None

hugiml.serialization.load_model(path, expected_type=None)[source]

Load a classifier from a file saved by save_model().

Supports: * v3 — ZIP/JSON/NumPy format (default since 2.1) * v1/v2 — legacy HMAC-pickle format (read-only; still authenticated)

Parameters:

path (str or Path)
expected_type (type, optional)

Return type:

HUGIMLClassifier

Raises:

HUGIMLVersionError – When schema version is incompatible.
HUGIMLSerializationError – When the file is corrupt, missing, has an invalid HMAC, or contains an unexpected type.

hugiml.serialization.generate_sbom(output_path=None)[source]

Generate a Software Bill of Materials for the installed hugiml-core.

Parameters:: output_path (str, optional)
Return type:: dict — CycloneDX-lite SBOM document.

Telemetry

OpenTelemetry and Prometheus instrumentation for HUGIMLClassifierNative.

Both integrations are strictly optional: if the respective packages are not installed the module degrades gracefully to no-op stubs. Import and use of this module never breaks the classifier itself.

OpenTelemetry

Wraps fit(), predict_proba(), and predict() with OTEL spans and attributes. Set HUGIML_OTEL_ENABLED=1 to activate.

Prometheus

Exposes prediction count, latency histogram, and confidence gauge. Set HUGIML_PROMETHEUS_ENABLED=1 to activate.

Debug logging

All non-fatal telemetry and metrics failures are logged at DEBUG level (logger = logging.getLogger("hugiml.telemetry")) with exc_info=True so that stack traces are available when the root logger is configured at DEBUG without any user-visible noise at INFO or above.

class hugiml.telemetry.HUGIMLTracer[source]

Bases: object

OpenTelemetry tracer wrapper for HUGIMLClassifierNative.

Emits spans for fit, predict_proba, and predict with attributes including n_samples, n_patterns, and latency.

When opentelemetry-api is not installed all operations are no-ops.

classmethod span(name, attributes=None)[source]

Context manager yielding an OTEL span (or no-op).

Parameters:

name (str)
attributes (dict | None)

Return type:

Generator[Any, None, None]

class hugiml.telemetry.HUGIMLMetrics[source]

Bases: object

Prometheus metrics for HUGIMLClassifierNative.

Exposes:

hugiml_predictions_total counter
hugiml_prediction_latency_seconds histogram
hugiml_confidence_mean gauge
hugiml_drift_psi gauge (per-feature)

When prometheus_client is not installed all metrics are no-ops.

classmethod record_prediction(model_id, n_samples, latency_s, mean_confidence, success=True)[source]

Record prediction metrics.

Parameters:

model_id (str)
n_samples (int)
latency_s (float)
mean_confidence (float)
success (bool)

Return type:

None

classmethod record_drift(model_id, psi_dict)[source]

Update per-feature PSI gauges.

Parameters:

model_id (str)
psi_dict (dict)

Return type:

None

hugiml.telemetry.instrument_classifier(classifier, model_id='default')[source]

Wrap a fitted classifier with telemetry instrumentation.

Patches predict_proba and predict methods in-place to emit OTEL spans and Prometheus metrics. The classifier itself is modified and returned.

Parameters:

classifier (HUGIMLClassifierNative)
model_id (str)

Return type:

The same classifier instance with patched methods.

Exceptions

Structured exception and warning hierarchy for HUG-IML.

Taxonomy:

HUGIMLError (base)
├── HUGIMLFitError          — any failure during fit()
│   ├── HUGIMLMiningError   — pattern mining specifically
│   ├── HUGIMLTimeoutError  — max_fit_seconds exceeded
│   └── HUGIMLMemoryError   — native/Python memory budget exceeded
├── HUGIMLValidationError   — input data / param validation
│   ├── HUGIMLSchemaError   — column mismatch at predict time
│   └── HUGIMLParamError    — bad hyperparameter values / types
├── HUGIMLSerializationError — load/save failures
│   └── HUGIMLVersionError  — schema version incompatibility
└── HUGIMLPredictionError   — failures during predict/transform

HUGIMLWarning (base, UserWarning subclass)
├── HUGIMLConvergenceWarning — model converged to minimal patterns
├── HUGIMLDtypeDriftWarning  — categorical column dtype changed
├── HUGIMLRangeWarning       — feature values outside training range
├── HUGIMLDegradedWarning    — model degraded due to timeout/memory
└── HUGIMLDeprecationWarning — deprecated API usage

exception hugiml.exceptions.HUGIMLError[source]

Bases: Exception

Base exception for all HUG-IML errors.

exception hugiml.exceptions.HUGIMLFitError[source]

Bases: HUGIMLError

Raised when fit() fails for any reason.

exception hugiml.exceptions.HUGIMLMiningError[source]

Bases: HUGIMLFitError

Raised when pattern mining fails or produces zero patterns.

exception hugiml.exceptions.HUGIMLTimeoutError[source]

Bases: HUGIMLFitError

Raised when fit exceeds max_fit_seconds.

exception hugiml.exceptions.HUGIMLMemoryError[source]

Bases: HUGIMLFitError, MemoryError

Raised when fit cannot safely allocate required memory.

exception hugiml.exceptions.HUGIMLValidationError[source]

Bases: HUGIMLError, ValueError

Raised when input data or configuration is invalid.

Inherits from ValueError for backward compatibility with existing except-ValueError handlers.

exception hugiml.exceptions.HUGIMLSchemaError[source]

Bases: HUGIMLValidationError

Raised when predict-time data does not match training schema (wrong columns, wrong order, wrong count).

exception hugiml.exceptions.HUGIMLParamError[source]

Bases: HUGIMLValidationError, TypeError

Raised when hyperparameters have wrong types or values.

Inherits from both TypeError and ValueError for backward compatibility.

exception hugiml.exceptions.HUGIMLSerializationError[source]

Bases: HUGIMLError

Raised when model save/load fails.

exception hugiml.exceptions.HUGIMLVersionError[source]

Bases: HUGIMLSerializationError

Raised when loading a model whose schema version is incompatible.

exception hugiml.exceptions.HUGIMLPredictionError[source]

Bases: HUGIMLError, RuntimeError

Raised when predict/transform fails on a fitted model.

Inherits from RuntimeError for backward compatibility.

exception hugiml.exceptions.HUGIMLWarning[source]

Bases: UserWarning

Base warning for all HUG-IML warnings.

exception hugiml.exceptions.HUGIMLConvergenceWarning[source]

Bases: HUGIMLWarning

Issued when the model converges to a minimal number of patterns (e.g. due to very restrictive G or low-information data).

exception hugiml.exceptions.HUGIMLDtypeDriftWarning[source]

Bases: HUGIMLWarning

Issued when a categorical column is passed as numeric at predict time.

exception hugiml.exceptions.HUGIMLRangeWarning[source]

Bases: HUGIMLWarning

Issued when feature values fall far outside the training range.

exception hugiml.exceptions.HUGIMLDegradedWarning[source]

Bases: HUGIMLWarning

Issued when the model entered degraded mode due to timeout or memory pressure during fit().

exception hugiml.exceptions.HUGIMLDeprecationWarning[source]

Bases: HUGIMLWarning, DeprecationWarning

Issued for deprecated API usage.

Dashboard modules

Streamlit app for HUGIML Governance Studio.

Run directly:: python -m streamlit run src/hugiml/dashboard/app.py

Model training/scoring helpers for the dashboard.

class hugiml.dashboard.runner.SimpleTuneResult(best_estimator_: 'Any | None', best_params_: 'dict', best_score_: 'float | None', results_: 'list[dict]', cv_splits_: 'list | None' = None, fast_path_used_: 'bool' = False, status_: 'str' = 'ok', error_: 'str | None' = None)[source]

Bases: object

Parameters:

best_estimator_ (Any | None)
best_params_ (dict)
best_score_ (float | None)
results_ (list[dict])
cv_splits_ (list | None)
fast_path_used_ (bool)
status_ (str)
error_ (str | None)

class hugiml.dashboard.runner.PrunedRepresentationResult(label: 'str', estimator: 'Any', score: 'float | None', rows: 'list[dict]', kept_columns: 'list[str]', removed_columns: 'list[str]', family: 'str')[source]

Bases: object

Parameters:

label (str)
estimator (Any)
score (float | None)
rows (list[dict])
kept_columns (list[str])
removed_columns (list[str])
family (str)

hugiml.dashboard.runner.fit_hugiml_config(X, y, params=None, cv=5, scoring='roc_auc', random_state=2026, *, raise_on_error=False)[source]

Fit one explicit HUGIML configuration and return a tune-like result.

Candidate configurations can legitimately mine zero patterns for a given dataset/G/L/topK combination. In that case HUGIML may raise “patterns list is empty — nothing to build”. The dashboard should display that as a failed candidate run, not crash.

Parameters:

X (DataFrame)
params (dict | None)
cv (int)
scoring (str)
random_state (int)
raise_on_error (bool)

hugiml.dashboard.runner.fit_feature_pruned_hugiml(X, y, base_model=None, remove_features=None, params=None, cv=5, scoring='roc_auc', random_state=2026)[source]

Remove selected original input features, rerun HUGIML, and return result + pruned frame.

Parameters:

X (DataFrame)
base_model (Any | None)
remove_features (list[str] | None)
params (dict | None)
cv (int)
scoring (str)
random_state (int)

hugiml.dashboard.runner.fit_representation_pruned_downstream(base_model, X, y, remove_columns, family, cv=5, scoring='roc_auc', random_state=2026)[source]

Remove selected direct LR representation columns for non-RPTE models.

RPTE models containing leaf indicators are rejected because a plain LR refit on the HUGIML source matrix would discard the fitted leaf block. The RPTE final LR may also contain direct source terms, but those terms cannot be isolated with this generic source-matrix refit.

Parameters:

base_model (Any)
X (DataFrame)
remove_columns (list[str])
family (str)
cv (int)
scoring (str)
random_state (int)

Return type:

tuple[PrunedRepresentationResult, DataFrame]

Experiment Workbench UI for HUGIML Studio.

The workbench is intentionally separated from Governance. It is the place to configure HUGIML and optional comparison models, run evaluations, compare metrics/plots, and promote a fitted HUGIML run into the governance workspace.

class hugiml.dashboard.workbench.ModelSpec(name: 'str', category: 'str', description: 'str', optional_dependency: 'str | None' = None)[source]

Bases: object

Parameters:

name (str)
category (str)
description (str)
optional_dependency (str | None)

class hugiml.dashboard.workbench.RuleFitClassifierAdapter(tree_size=4, max_rules=100, random_state=2026)[source]

Bases: object

RuleFit classifier adapter with explicit implementation labelling.

Preferred backend: imodels.RuleFitClassifier. Compatibility backend: legacy standalone rulefit.RuleFit when present. Fallback backend: sklearn-generated tree-leaf rules + logistic regression.

The fallback is intentionally labelled as “RuleFit-style fallback” in the UI so users do not confuse it with the official imodels implementation.

Parameters:

tree_size (int)
max_rules (int)
random_state (int)

Display helpers for Streamlit dashboard tables.

hugiml.dashboard.display.dataframe_for_display(df, stringify_mixed_object=True)[source]

Return a Streamlit/Arrow-safe dataframe for display.

Streamlit serializes dataframes through Arrow. Object columns that contain mixed Python types, such as integers/floats plus strings like original_plus_patterns, can trigger ArrowInvalid because Arrow attempts to coerce the column to a numeric type. Audit/config tables often have this shape by design.

This helper preserves numeric columns when they are truly numeric, and stringifies only object columns that contain mixed Python scalar types. Cells containing numpy arrays are safely converted to their list repr so that pd.isna() is never called on a multi-element array (which raises ValueError).

Parameters:

df (Any)
stringify_mixed_object (bool)

Return type:

DataFrame

High-value governance evidence panels for the HUGIML dashboard.

These panels surface model evidence that is already produced by fitted HUGIML models, without adding new model APIs. Every renderer is deliberately guarded so older model objects or non-HUGIML baselines degrade to an explanatory info banner instead of failing the dashboard.

hugiml.dashboard.components.governance_evidence.adaptive_binning_table(model)[source]

Return long-form adaptive-binning evidence from per-feature B and IG scores.

Parameters:: model (Any)
Return type:: DataFrame

hugiml.dashboard.components.governance_evidence.feature_shape_frame(model, feature, X=None)[source]

Build an EBM-like per-feature shape table from singleton HUG coefficients.

The y-value is the downstream LR coefficient for a singleton pattern bin, i.e. a log-odds contribution. Compound patterns are excluded because their coefficient cannot be assigned to one feature unambiguously. Bins that exist in _bin_edges_ but have no mined singleton pattern are retained with a zero coefficient.

Parameters:

model (Any)
feature (str)
X (Any)

Return type:

DataFrame

hugiml.dashboard.components.governance_evidence.plot_feature_shape(model, X, feature)[source]

Plot an EBM-like HUGIML feature shape using singleton log-odds coefficients.

Parameters:

model (Any)
X (Any)
feature (str)

Return type:

Any | None

hugiml.dashboard.components.governance_evidence.render_feature_effect_profiles(model=None, X=None, *args, **kwargs)[source]

Render EBM-style HUGIML 1-D shape profiles in Representation Audit.

This uses singleton pattern coefficients from feature_importances(). A singleton pattern such as income=[38500, 55000) is a binary item; its downstream logistic-regression coefficient is the marginal log-odds contribution of being in that bin, which is the HUGIML analogue of an EBM shape value. Compound patterns are excluded by design.

Parameters:

model (Any)
X (Any)
args (Any)
kwargs (Any)

Return type:

DataFrame

hugiml.dashboard.components.governance_evidence.render_adaptive_binning_evidence(model=None, X=None, *args, **kwargs)[source]

Render bin-profile/IG evidence for adaptive binning decisions.

Parameters:

model (Any)
X (Any)
args (Any)
kwargs (Any)

Return type:

DataFrame

hugiml.dashboard.components.governance_evidence.survivor_led_patterns_frame(model)[source]

Return mined-pattern audit rows admitted through relaxed interaction evidence.

Parameters:: model (Any)
Return type:: DataFrame

hugiml.dashboard.components.governance_evidence.render_survivor_led_pattern_audit(model=None, *args, **kwargs)[source]

Render audit evidence for patterns surfaced by interaction-relaxed mining.

Parameters:

model (Any)
args (Any)
kwargs (Any)

Return type:

DataFrame

hugiml.dashboard.components.governance_evidence.rpte_rule_evidence_frame(model)[source]

Return the complete fitted RPTE final-LR representation.

The frame contains RPTE leaf indicators and every direct HUGIML source term source column carried directly into the final LR. Zero-valued direct coefficients are retained so the audit remains aligned with the fitted coefficient vector.

Parameters:: model (Any)
Return type:: DataFrame

hugiml.dashboard.components.governance_evidence.render_rpte_rule_evidence(model=None, *args, **kwargs)[source]

Render leaf-tree and direct-source evidence separately.

Parameters:

model (Any)
args (Any)
kwargs (Any)

Return type:

DataFrame

hugiml.dashboard.components.governance_evidence.render_augmented_pair_traceability(model=None, *args, **kwargs)[source]

Render audit traceability for augmented pair features.

Parameters:

model (Any)
args (Any)
kwargs (Any)

Return type:

DataFrame

HUGIML Governance Studio Dash interface.

The application keeps workspace selection and page navigation in a pinned header. Workbench provides equal Setup and Results views. Dataset selection, uploads, column roles, training controls, and experiment configuration live only in Setup. Governance exposes its audit pages through the same pinned navigation region.

HUGIML Dashboard launcher for Dash or the lightweight Streamlit interface.

RPTE representation and governance helpers.

The fitted HUGIML/RPTE pipeline has distinct stages: raw inputs -> HUGIML source columns -> RPTE tree splits and leaf indicators -> final LR terms. In the current representation, source columns used by accepted RPTE splits are represented through leaf indicators, while source columns not used by any accepted split are carried directly into the final LR.

hugiml.dashboard.components.rpte_governance.rpte_is_active(model)[source]

Return True when the fitted model exposes an RPTE representation.

Parameters:: model (Any)
Return type:: bool

hugiml.dashboard.components.rpte_governance.rpte_has_tree_representation(model)[source]

Return True when fitted RPTE leaf indicators are part of final LR.

Parameters:: model (Any)
Return type:: bool

hugiml.dashboard.components.rpte_governance.rpte_source_feature_names(model)[source]

Names of HUGIML source columns supplied to the RPTE estimator.

Parameters:: model (Any)
Return type:: list[str]

hugiml.dashboard.components.rpte_governance.rpte_direct_source_terms_frame(model, include_zero=True)[source]

Direct final-LR source terms not selected in accepted RPTE splits.

Fitted estimator attributes retain zero-coefficient terms; unified explanation rows provide structured metadata for non-zero terms.

Parameters:

model (Any)
include_zero (bool)

Return type:

DataFrame

hugiml.dashboard.components.rpte_governance.rpte_split_usage_frame(model)[source]

Summarise source and RPTE-synthesized columns used in accepted splits.

Parameters:: model (Any)
Return type:: DataFrame

hugiml.dashboard.components.rpte_governance.rpte_source_inventory_frame(model)[source]

Inventory every HUGIML source column and its fitted RPTE role.

Parameters:: model (Any)
Return type:: DataFrame

hugiml.dashboard.components.rpte_governance.rpte_raw_input_lineage_frame(model, X=None)[source]

Trace raw inputs into tree-based and direct final-LR terms.

Parameters:

model (Any)
X (DataFrame | None)

Return type:

DataFrame

hugiml.dashboard.components.rpte_governance.rpte_representation_flow_frame(model, X=None)[source]

Fitted representation contract for Governance UI.

Parameters:

model (Any)
X (DataFrame | None)

Return type:

DataFrame

hugiml.dashboard.components.rpte_governance.rpte_model_comparison_row(label, model, score, X=None)[source]

Separated RPTE source, leaf, and direct-term counts for comparisons.

Parameters:

label (str)
model (Any)
score (float | None)
X (DataFrame | None)

Return type:

dict[str, Any]

LLM assistant add-on

Strict action schemas for the optional HUGIML natural-language interface.

The LLM add-on intentionally exposes a narrow, deterministic action surface. Models may propose these actions as JSON, but Python validation decides what can run. No arbitrary code, shell execution, source editing, or package modification action exists in this schema.

class hugiml.llm.schemas.DatasetInfo(name, source, path=None, task_type='binary_classification', target=None, rows=None, features=None, description='', origin_detail='')[source]

Bases: object

A compact dataset registry entry.

Parameters:

name (str)
source (str)
path (str | None)
task_type (str)
target (str | None)
rows (int | None)
features (int | None)
description (str)
origin_detail (str)

class hugiml.llm.schemas.ActionRequest(action, dataset=None, target=None, metric=None, strategy='balanced', params=<factory>, output_format='table', session_id=None, question=None, pattern_indices=<factory>, keyword=None, min_support=None, reason=None, limit=10)[source]

Bases: object

Validated action request proposed by an LLM or by deterministic routing.

Parameters:

action (str)
dataset (str | None)
target (str | None)
metric (str | None)
strategy (str)
params (dict[str, Any])
output_format (str)
session_id (str | None)
question (str | None)
pattern_indices (list[int])
keyword (str | None)
min_support (float | None)
reason (str | None)
limit (int)

classmethod from_dict(data)[source]

Build an ActionRequest from a dict without validating yet.

Validation is intentionally deferred to the caller (normally HUGIMLActionOrchestrator.execute) so that missing-but-recoverable fields, such as a dataset that can be inferred from an active session, can be filled in before validation runs.

Parameters:: data (dict[str, Any])
Return type:: ActionRequest

class hugiml.llm.schemas.ActionResult(ok, action, message, data=<factory>, tables=<factory>, artifacts=<factory>, refusal_reason=None)[source]

Bases: object

Structured result returned by the deterministic orchestrator.

Parameters:

ok (bool)
action (str)
message (str)
data (dict[str, Any])
tables (dict[str, list[dict[str, Any]]])
artifacts (dict[str, str])
refusal_reason (str | None)

Dataset discovery and loading for the optional HUGIML NLP interface.

The registry keeps dataset origins logically separate while presenting a merged view to the user:

llm_builtin: curated small first-run datasets under LLM/datasets/builtin
user: files accepted into LLM/datasets/user with explicit target metadata
benchmark: public/package-backed benchmark datasets, loaded through the existing experiments/benchmark/benchmark_dashboard.py interface when the source checkout is available

class hugiml.llm.dataset_registry.DatasetRegistry(repo_root=None)[source]

Bases: object

Merged dataset registry for LLM-driven HUGIML workflows.

Parameters:: repo_root (str | Path | None)

list_datasets(include_profiles=True, *, include_benchmarks=True)[source]

Return discovered datasets.

include_benchmarks defaults to True for backward compatibility and for command-line discovery. UI surfaces can set it to False for a calmer first-run experience that shows only curated built-in datasets and registered user uploads.

Parameters:

include_profiles (bool)
include_benchmarks (bool)

Return type:

list[DatasetInfo]

register_user_dataset(source_path, *, target_column, dataset_name=None, overwrite=False)[source]

Accept a user dataset after an explicit target-column selection.

This method is intended for the optional chat UI. The uploaded file is not exposed through list_datasets until a target has been selected and persisted in the .target.json sidecar.

Parameters:

source_path (str | Path)
target_column (str)
dataset_name (str | None)
overwrite (bool)

Return type:

DatasetInfo

Runtime profile and local-model detection for the optional NLP interface.

class hugiml.llm.runtime.MemoryInfo(total_gb: 'float | None', available_gb: 'float | None', source: 'str')[source]

Bases: object

Parameters:

total_gb (float | None)
available_gb (float | None)
source (str)

class hugiml.llm.runtime.ModelProfile(name: 'str', recommended_model: 'str', max_context_tokens: 'int', planning_mode: 'str', branches: 'int', lookahead: 'int', description: 'str')[source]

Bases: object

Parameters:

name (str)
recommended_model (str)
max_context_tokens (int)
planning_mode (str)
branches (int)
lookahead (int)
description (str)

class hugiml.llm.runtime.ModelOption(profile, model, label, min_available_gb, max_context_tokens, max_output_tokens, usage, notes)[source]

Bases: object

One visible local-model choice in the workbench model picker.

min_available_gb is intentionally conservative. It is a UI/runtime guardrail, not a claim about the model file size: browser, OS, Ollama KV cache, dataset profiling, and HUGIML fitting all share the same free RAM.

Parameters:

profile (str)
model (str)
label (str)
min_available_gb (float)
max_context_tokens (int)
max_output_tokens (int)
usage (str)
notes (str)

hugiml.llm.runtime.get_profiles(repo_root=None)[source]

Tier profiles, preferring models.yaml over the hardcoded defaults.

Parameters:: repo_root (str | Path | None)
Return type:: dict[str, ModelProfile]

hugiml.llm.runtime.get_model_catalog(repo_root=None)[source]

Visible local-model catalog used by CLI/UI.

The list is fixed by config and independent of which models are currently pulled in Ollama. UI callers should display every entry, then disable entries that are not feasible for the current machine or not installed.

Parameters:: repo_root (str | Path | None)
Return type:: list[ModelOption]

hugiml.llm.runtime.estimated_model_size_b(model_name)[source]

Best-effort parameter-size estimate parsed from common Ollama tags.

Examples: llama3.2:3b -> 3.0, gemma3:270m -> 0.27. The value is used for RAM/UI guidance; explicitly supported lightweight models such as qwen3:1.7b, gemma3:1b, and llama3.2:1b are allowed even though they are below 3B.

Parameters:: model_name (str)
Return type:: float | None

hugiml.llm.runtime.is_lightweight_supported_model(model_name)[source]

Return True for explicitly supported sub-3B local LLM choices.

Parameters:: model_name (str)
Return type:: bool

hugiml.llm.runtime.is_below_minimum_llm_model(model_name)[source]

Return True for unsupported tiny models.

The configured tiny models are intentionally allowed for default/light/fallback modes, but arbitrary sub-3B manual models remain disabled because they often produce brittle planning JSON and generic explanations.

Parameters:: model_name (str)
Return type:: bool

hugiml.llm.runtime.get_memory_info()[source]

Best-effort total/available RAM detection without mandatory psutil.

Return type:: MemoryInfo

hugiml.llm.runtime.recommend_profile(memory=None, *, repo_root=None)[source]

Recommend a profile from available RAM, not merely total RAM.

Parameters:

memory (MemoryInfo | None)
repo_root (str | Path | None)

Return type:

ModelProfile

hugiml.llm.runtime.model_availability(option, memory=None, pulled_models=None, *, ollama_ok=True)[source]

Return UI-friendly selectability information for one model option.

Parameters:

option (ModelOption)
memory (MemoryInfo | None)
pulled_models (list[str] | set[str] | tuple[str, ...] | None)
ollama_ok (bool)

Return type:

dict[str, Any]

hugiml.llm.runtime.check_ollama(base_url=None, timeout=2.0, *, use_cache=True)[source]

Return Ollama server/model status. Never starts or downloads models.

Results are cached per base URL for _OLLAMA_STATUS_TTL_SECONDS to avoid redundant round trips when multiple callers check status within the same user-facing turn. Pass use_cache=False to force a fresh check (e.g. immediately after the user pulls/starts Ollama).

Parameters:

base_url (str | None)
timeout (float)
use_cache (bool)

Return type:

dict[str, Any]

Planning helpers for the optional HUGIML natural-language interface.

hugiml.llm.planner.plan_request(user_text, *, model=None, context=None, prefer_llm=True, repo_root=None)[source]

Plan a user request into a validated, schema-safe action request.

Parameters:

user_text (str)
model (str | None)
context (dict[str, Any] | None)
prefer_llm (bool)
repo_root (str | Path | None)

Return type:

ActionRequest | ActionResult

Deterministic executor for validated HUGIML NLP actions.

The orchestrator is the safety boundary. LLMs may propose ActionRequest JSON, but only this module loads data, fits models, tunes hyperparameters, produces tables, prunes patterns, and writes governance artifacts.

class hugiml.llm.orchestrator.ModelSession(session_id: 'str', dataset: 'str', target: 'str', info: 'dict[str, Any]', X_train: 'pd.DataFrame', X_test: 'pd.DataFrame', y_train: 'np.ndarray', y_test: 'np.ndarray', model: 'Any', metrics: 'dict[str, Any]', created_at: 'float' = <factory>, artifacts: 'dict[str, str]' = <factory>)[source]

Bases: object

Parameters:

session_id (str)
dataset (str)
target (str)
info (dict[str, Any])
X_train (DataFrame)
X_test (DataFrame)
y_train (ndarray)
y_test (ndarray)
model (Any)
metrics (dict[str, Any])
created_at (float)
artifacts (dict[str, str])

class hugiml.llm.orchestrator.HUGIMLActionOrchestrator(repo_root=None, session_dir=None, random_state=42)[source]

Bases: object

Execute validated natural-language actions using existing HUGIML APIs.

Parameters:

repo_root (str | Path | None)
session_dir (str | Path | None)
random_state (int)

hugiml.llm.orchestrator.generate_qna_html(output_path, *, title, dataset_description, build_result, explain_result, prune_result, governance_result, prediction_result=None)[source]

Create a standalone, visual Q&A HTML page for an end-to-end use case.

Parameters:

output_path (str | Path)
title (str)
dataset_description (dict[str, Any])
build_result (ActionResult)
explain_result (ActionResult)
prune_result (ActionResult)
governance_result (ActionResult)
prediction_result (ActionResult | None)

Return type:

str

Command-line interface for the optional HUGIML NLP add-on.

After installation, the intended entry point is:

hugiml-llm

which launches the Streamlit UI. Subcommands provide status, dataset listing, one-shot requests, terminal chat, and demo HTML generation. Optional UI/LLM imports are lazy so importing the base package remains lightweight.