API reference
This page documents the public Python API exposed by hugiml-core. The
manual sections in the user guide explain how these APIs fit together in a
modeling workflow; the reference below is generated from the source docstrings.
Core estimator
HUGIMLClassifier — C++ accelerated, scikit-learn compatible classifier.
HUGIMLClassifier is the primary public class name.
HUGIMLClassifierNative remains as a backward-compatible alias.
Implements the High Utility Gain Interpretable Machine Learning (HUG-IML) algorithm from:
Krishnamoorthy, S. (2024). Interpretable Classifier Models for Decision Support Using High Utility Gain Patterns. IEEE Access, 12, 126088–126107. DOI: 10.1109/ACCESS.2024.3455563
Computationally intensive stages (discretisation, transaction construction, pattern mining, matrix assembly) run at native speed via a compiled C++ extension with optional OpenMP parallelism. The Python layer handles DataFrame ingestion, column-type detection, downstream estimation, explanation methods, monitoring, and drift detection.
Architecture
- C++ extension (_hugiml_core):
Discretisation, transaction construction, top-K HUI pattern mining with information-gain filtering, bitmap-accelerated matrix assembly, OpenMP parallel pattern matching.
- Python layer:
Column-type detection (prepareXy), NaN/Inf imputation, downstream sklearn estimator (LogisticRegression default), explanation methods (get_hug_features, get_pattern_info, feature_importances), versioned model serialisation, prediction monitoring, multi-method drift detection, latency SLA enforcement, and graceful degradation under memory pressure.
Quick start
Two usage paths are supported:
Path A — prepareXy (recommended when the full dataset is available upfront):
from hugiml import HUGIMLClassifier
clf = HUGIMLClassifier()
X, y = clf.prepareXy(X_df, y_series)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y)
clf.fit(X_tr, y_tr)
proba = clf.predict_proba(X_te)
print(clf.model_summary())
print(clf.feature_importances())
Path B — allCols + origColumns (cross-validation loops):
clf = HUGIMLClassifier(
allCols=[int_cols, float_cols, cat_cols],
origColumns=X_df.columns.tolist(),
)
clf.fit(X_train, y_train)
Monitoring and drift detection:
clf.enable_monitoring()
clf.predict_proba(X_new)
print(clf.monitor.report())
drift = clf.detect_drift(X_new)
print(drift)
Versioned serialisation:
clf.save_model("model.hugiml")
clf2 = HUGIMLClassifier.load_model("model.hugiml")
- hugiml.classifier.HUGIMLClassifier
alias of
HUGIMLClassifierNative
- class hugiml.classifier.HUGIMLClassifierNative(allCols=None, origColumns=None, B=8, L=1, G=0.001, topK=30, base_estimator=None, n_jobs=1, max_predict_ms=None, max_fit_seconds=None, verbose=False, adaptive_binning=False, b_candidates=None, min_marginal_gain_ratio=0.02, feature_mode='patterns_only', use_hotpath=True, augmented_pair_transforms=True, augmented_pair_max_features=10, topk_budget_strict=False, dense_downstream_max_width=200, execution_mode='audit')[source]
Bases:
TransformerMixin,ClassifierMixin,BaseEstimatorHUG-IML interpretable classifier — C++ accelerated, scikit-learn compatible.
Extracts High Utility Gain (HUG) patterns from labelled tabular data, transforms the input into a binary pattern-presence matrix, and fits an interpretable downstream classifier. The mined patterns are human-readable and serve as the primary source of model explanations.
- Parameters:
allCols (list of 3 lists, optional) –
[int_col_names, float_col_names, cat_col_names]. Must be paired withorigColumns.origColumns (list of str, optional) – Ordered column names matching the columns of X passed to fit/predict.
B (int, default 8) – Number of quantile bins per numerical feature. Use -1 for supervised auto-selection (maximises IG over [2, 20]).
L (int, default 2) – Maximum HUG pattern length. 1 = singletons; 2 = pairs; -1 = unlimited.
G (float, default 1e-4) – Minimum information-gain threshold.
topK (int, default 200) – Maximum number of patterns to retain. -1 computes automatically.
base_estimator (sklearn estimator, optional) – Downstream classifier trained on the binary pattern matrix. Defaults to LogisticRegression.
n_jobs (int, default 1) – Number of OpenMP threads. -1 uses all available cores.
max_predict_ms (float or None) – Prediction latency budget in milliseconds.
max_fit_seconds (float or None) – Wall-clock budget for the pattern-mining stage of fit(). Transaction preparation and downstream model fitting (e.g. LogisticRegression) are not bounded — total fit() time may exceed this value. When the budget is exhausted mid-mine, graceful degradation produces a smaller pattern set; if even the minimal fallback cannot finish in time,
HUGIMLTimeoutErroris raised.verbose (bool, default False) – Emit INFO-level log messages during fit.
fit) (Attributes (available after)
----------------------------------
classes (ndarray — unique class labels.)
n_features_in (int — number of input features.)
feature_names_in (list or None — column names from training data.)
cat_cols_mask (ndarray[bool] — True for categorical columns.)
is_int_mask (ndarray[bool] — True for integer columns.)
td (_TransactionDataWrapper — discretisation artefacts.)
patterns (list — mined HUG patterns.)
x_train_hup (csr_matrix — binary training pattern matrix.)
model (Pipeline — fitted downstream estimator.)
fit_metadata (FitMetadata — timings, memory, pattern stats.)
monitor (PredictionMonitor or None — prediction statistics.)
adaptive_binning (bool)
b_candidates (list | None)
min_marginal_gain_ratio (float)
feature_mode (str)
use_hotpath (bool)
augmented_pair_transforms (bool)
augmented_pair_max_features (int)
topk_budget_strict (bool)
dense_downstream_max_width (int)
execution_mode (str)
- classmethod from_preset(name, **overrides)[source]
Create a classifier from a named configuration preset.
- Parameters:
name ({'quick', 'balanced', 'thorough'}) – quick — B=5, L=1, G=1e-2, topK=50 balanced — B=7, L=1, G=5e-3, topK=-1 thorough — B=-1, L=2, G=1e-4, topK=-1
overrides (Any)
- Return type:
- classmethod default_param_grid()[source]
Return the default validation grid for compact HUGIML tuning.
The grid uses adaptive binning (
B=-1), searchesLin{1, 2}, searchesfeature_modein{'patterns_only', 'original_plus_patterns'}, keepsGfixed at 1e-3, and searchestopKin{30, 50, 100}. ForL > 1andaugmented_pair_transforms=True, native augmented-pair transforms are created internally from the top-10 native-IG numeric features and capped to the sametopKbudget by transform IG.- Return type:
dict[str, list]
- get_params(deep=True)[source]
Return constructor parameters as a dict (sklearn protocol).
- Parameters:
deep (bool)
- Return type:
dict
- set_params(**params)[source]
Set constructor parameters in-place and return self (sklearn protocol).
- Parameters:
params (Any)
- Return type:
- save_model(path)[source]
Persist the fitted model to a binary file with schema versioning.
- Parameters:
path (str or Path)
- Raises:
- Return type:
None
- classmethod load_model(path)[source]
Load a model previously saved with
save_model().- Parameters:
path (str or Path)
- Return type:
- Raises:
- prepareXy(X, y)[source]
Detect column types and encode the target variable.
Call on the full dataset before any train/test split. Records which columns are integer, float, or categorical, and performs basic label validation.
- Parameters:
X (pd.DataFrame)
y (pd.Series or array-like)
- Returns:
X (pd.DataFrame (copy with string column names))
y (np.ndarray of int64)
- Return type:
tuple[DataFrame, ndarray]
- fit(X, y)[source]
Fit the HUG-IML model on training data.
- Parameters:
- Returns:
self
Thread safety
————-
fit() acquires an exclusive lock. Concurrent fit() calls on the same
instance are serialized. predict/predict_proba/transform are read-only
on fitted state and safe for concurrent use after fit() returns.
- Return type:
- predict_proba(X_test)[source]
Predict class probabilities for X_test.
When
max_predict_msis set large batches are processed in chunks. Rows exceeding the time budget receive uniform probabilities and a warning is emitted.
- predict(X_test)[source]
Predict class labels for X_test.
- Parameters:
X_test (array-like or DataFrame)
- Return type:
np.ndarray, shape (n_samples,)
- get_augmented_pair_transforms()[source]
Return augmented pair transforms used by the downstream estimator.
Each catalog entry includes the raw pair formula, source-feature IG provenance, candidate coverage, unavailable-pair policy, and the standardization parameters used before the downstream estimator sees the feature. Candidate IG is scored on rows where both source values are observed. For selected features, rows where the pair value cannot be computed receive the pair feature’s training reference value before standardization, yielding a neutral standardized value.
- Return type:
list[dict[str, Any]]
- get_augmented_pair_standardization()[source]
Return standardization metadata for augmented pair features.
The returned columns are aligned to
get_augmented_pair_transforms()and make the raw-to-estimator transformation explicit.- Return type:
DataFrame
- explain_augmented_pair_effects()[source]
Explain augmented-pair effects in standardized and raw units.
The downstream estimator is fit on standardized augmented-pair values. This method converts each standardized coefficient back to the raw pair scale and states that the reference value is the training-cohort mean of the observed pair term, not a domain-specific baseline. Candidate scoring uses rows where both source values are observed. For selected features, rows where the pair cannot be computed receive the pair feature’s training reference raw value before standardization, yielding a neutral standardized value for that pair term. HUGIML pattern features keep their native missing-value handling.
For logistic-regression downstream models, coefficient columns are log-odds effects. Product-term effects are expressed on the product scale; changing one individual input does not have a fixed marginal effect because it depends on the current value of the other input.
- Return type:
DataFrame
- transform(X)[source]
Return the binary HUG pattern matrix for X.
Each column corresponds to one mined pattern. Entry (i, j) is 1 when all items of pattern j appear in row i.
- Parameters:
X (array-like or DataFrame)
- Return type:
csr_matrix, shape (n_samples, n_patterns)
- enable_monitoring(window_size=1000)[source]
Enable prediction monitoring. Access via
self.monitor.- Parameters:
window_size (int)
- Return type:
- detect_drift(X_test, y_test=None, threshold=0.1)[source]
Run multi-method drift detection and return a human-readable report.
Uses PSI + KL divergence. When
y_testis provided, also checks label distribution drift.Notes
Drift metrics are computed on the numeric array retained by the mining path. Fixed-B numeric columns that contained NaN/Inf during training are converted to the categorical bin-label path so missingness is handled consistently at fit/predict time; those columns are therefore not represented as continuous numeric drift baselines. PSI/KL alerts for such columns should be interpreted through pattern/feature-importance diagnostics rather than through
detect_drift().- Parameters:
X_test (array-like or DataFrame)
y_test (array-like, optional)
threshold (float)
- Return type:
str
- get_drift_psi(X_test)[source]
Return per-feature PSI values as a dict.
See
detect_drift()for the fixed-B missing-numeric limitation: columns that were routed to categorical bin labels because they contained NaN/Inf during training do not have meaningful continuous PSI baselines.- Parameters:
X_test (Any)
- Return type:
dict
- cross_validate_monitored(X, y, cv=None, scoring='roc_auc')[source]
Cross-validation with per-fold monitoring and drift detection.
- Parameters:
X (pd.DataFrame or ndarray)
y (array-like)
cv (int or CV splitter (default: StratifiedKFold(5)))
scoring (str)
- Returns:
dict with keys
- Return type:
test_scores, fit_times_ms, fold_monitors, fold_drift, fold_metadata
- get_hug_features()[source]
Return a human-readable label for each mined HUG pattern.
Singleton patterns use the format
feature=[lo,hi)for adaptive numerical columns (e.g.age=[35,50)) andfeature=valuefor categorical columns (e.g.gender=F). Compound patterns (L > 1) are comma-separated, e.g.age=[35,50), gender=F.When
adaptive_binning=Trueand the integer-code path was used, C++ stores bin labels asfeature=[k,k+1](integer range). These are transparently remapped to the original-scale[lo,hi)labels via_adaptive_code_label_map_so that the output is identical in appearance to the string-path output.Production mode
This method remains available in
execution_mode='production'because it only needs retained pattern labels.get_pattern_info()is intentionally audit-only because it additionally needs the retained training pattern matrix to compute support.- rtype:
list of str
- Return type:
list[str]
- get_transformed_shape()[source]
Return (n_samples, n_patterns) for the training pattern matrix.
In production mode the matrix itself is not retained, but its shape is persisted as lightweight diagnostic metadata.
- Return type:
tuple[int, int]
- get_pattern_info()[source]
Summary DataFrame with one row per mined HUG pattern.
Columns: pattern, utility, information_gain, support.
This is an audit/governance table. Unlike
get_hug_features(), it requires the retained training pattern matrix to compute support and therefore raises a clear error inexecution_mode='production'.- Return type:
DataFrame
- get_downstream_features()[source]
Return names aligned with the downstream estimator input columns.
The returned names include a namespace prefix so feature provenance is explicit:
orig:for original features,pattern:for mined HUG patterns, andaugmented_pair:for augmented pair transforms. Whentopk_budget_strict=True, the returned list is already filtered to the columns retained by the fitted strict TopK mask.- Return type:
list[str]
- get_model_composition()[source]
Return downstream feature composition and relevant fitted configuration.
The composition describes the actual feature families entering the downstream estimator after feature-mode construction and optional strict TopK filtering.
- Return type:
dict[str, Any]
- feature_importances()[source]
Map downstream estimator coefficients to final feature names.
Returns a DataFrame sorted by absolute coefficient magnitude. Feature names are aligned to the downstream estimator after feature-mode and optional strict TopK filtering have been applied. The
feature_typecolumn distinguishes original features, mined HUG patterns, and augmented pair transforms.pattern_supportis populated only for mined HUG patterns; original and augmented-pair features usesupport_type='not_applicable'andpattern_support=NaN.- Raises:
AttributeError – When the downstream estimator does not expose
coef_(e.g. non-linear models).- Return type:
DataFrame
- plot_bin_profiles(figsize=None)[source]
Bar chart of the chosen B per numerical feature (adaptive binning only).
Colour encodes position in the candidate range: blue = coarse end, green = mid, amber/red = fine end.
- Return type:
(fig, ax)
- Raises:
RuntimeError – When called on a non-adaptive or unfitted model.
ImportError – When matplotlib is not installed.
- Parameters:
figsize (tuple | None)
- ig_heatmap(figsize=None)[source]
Heatmap of IG score at every (feature, B) grid point (adaptive binning only).
The chosen B per feature is highlighted with a bounding box.
- Return type:
(fig, ax)
- Raises:
RuntimeError – When called on a non-adaptive or unfitted model, or when
ig_scores_is empty.ImportError – When matplotlib is not installed.
- Parameters:
figsize (tuple | None)
- classmethod fast_grid_tune(X_train, y_train, X_val, y_val, param_grid=None, *, base_params=None, scoring='roc_auc', refit_full=False, return_results=True)
Exact cached tuner for the compact adaptive HUGIML grid.
Requirements
adaptive_binning=True for every candidate.
G may vary; the tuner partitions candidates into fixed-G cache groups.
Only G, L, topK, and feature_mode vary. B may appear in the grid but is ignored for cache partitioning because adaptive binning chooses per-feature bins and fit() passes sentinel B=2 to the native transaction builder.
max_fit_seconds must be None to guarantee equivalence to the ordinary grid loop; timeout/degradation can make cached mining fits differ from standalone candidates.
Returns a dict with best_model, best_params, best_score, cv_results, and cache timings. Uses the same scorer as the ordinary grid path for all supported scoring values. During tuning it skips drift-baseline and rich final metadata; set refit_full=True to refit the selected model with normal fit().
- Parameters:
X_train (Any)
y_train (Any)
X_val (Any)
y_val (Any)
param_grid (dict[str, list] | None)
base_params (dict[str, Any] | None)
scoring (str)
refit_full (bool)
return_results (bool)
- Return type:
dict[str, Any]
- set_predict_proba_request(*, X_test='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter inpredict_proba.self (HUGIMLClassifierNative)
- Returns:
self – The updated object.
- Return type:
object
- classmethod tune(X, y, *, cv=5, scoring='roc_auc', param_grid=None, refit=True, base_params=None, random_state=42, shuffle=True, cv_splits=None, use_fast_path=True, return_dataframe=True)
Tune HUGIML on full X, y using stratified CV and optional fast-grid caching.
This is the main public convenience API for quick HUGIML model selection. The regular constructor remains a single-configuration estimator; this method owns grid search, cross-validation, aggregation, and optional refit.
- Parameters:
X (array-like or DataFrame/Series) – Full training data.
y (array-like or DataFrame/Series) – Full training data.
cv (int or splitter, default=5) – Number of stratified folds, or any sklearn-compatible splitter with split(X, y). Integer cv uses StratifiedKFold.
scoring ({'roc_auc', 'accuracy', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_weighted'}) – Validation metric. ‘roc_auc’ supports binary and multiclass OVR macro AUC.
param_grid (dict or None) – sklearn-style grid. None uses HUGIMLClassifier.default_param_grid().
refit (bool, default=True) – If True, refit the best configuration on the full X, y with normal fit().
base_params (dict or None) – Constructor parameters shared by every candidate.
random_state (int or None, default=42) – Random seed for StratifiedKFold when cv is an integer.
shuffle (bool, default=True) – Whether StratifiedKFold shuffles before splitting.
cv_splits (list of (train_idx, val_idx) or None, default=None) – Exact fold indices to use. When supplied, cv, shuffle, and random_state are ignored for split generation, and the same indices are returned in
result.cv_splits_for reuse by other models.use_fast_path (bool, default=True) – Use exact cached fast-grid evaluation when the grid qualifies; otherwise fall back to ordinary per-candidate evaluation.
return_dataframe (bool, default=True) – Return
results_as a pandas DataFrame when pandas is available.
- Returns:
GridSearchCV-like result object with
best_estimator_,best_params_,best_score_,results_,fast_path_used_,elapsed_seconds_, andn_splits_.- Return type:
- class hugiml.classifier.FitMetadata(n_samples, n_features, n_classes, n_items, n_patterns, n_compound, topK_used, stage_times_ms, total_fit_ms, matrix_density, config, n_augmented_pairs=0, n_downstream_features=0, downstream_feature_counts=<factory>, memory_peak_mb=0.0, memory_rss_mb=0.0, memory_cpp_mb=0.0, openmp_threads=1, degraded=False)[source]
Bases:
objectImmutable record of everything that happened during fit().
- Parameters:
n_samples (int)
n_features (int)
n_classes (int)
n_items (int)
n_patterns (int)
n_compound (int)
topK_used (int)
stage_times_ms (dict)
total_fit_ms (float)
matrix_density (float)
config (dict)
n_augmented_pairs (int)
n_downstream_features (int)
downstream_feature_counts (dict)
memory_peak_mb (float)
memory_rss_mb (float)
memory_cpp_mb (float)
openmp_threads (int)
degraded (bool)
- n_samples, n_features
Training set dimensions.
- Type:
int
- n_classes
Number of distinct target classes.
- Type:
int
- n_items
Number of utility-annotated items (bins + categories).
- Type:
int
- n_patterns
Number of HUG patterns mined and retained.
- Type:
int
- n_compound
Compound patterns (length > 1).
- Type:
int
- n_augmented_pairs
Number of augmented pair features retained for the downstream estimator.
- Type:
int
- n_downstream_features
Number of columns used by the downstream estimator after feature-mode construction and optional strict TopK filtering.
- Type:
int
- downstream_feature_counts
Counts by downstream feature family, for example original, pattern, and augmented_pair.
- Type:
dict
- topK_used
Effective topK budget used during mining.
- Type:
int
- stage_times_ms
Wall-clock milliseconds per fit stage.
- Type:
dict[str, float]
- total_fit_ms
Total fit wall-clock milliseconds.
- Type:
float
- matrix_density
Fraction of non-zero entries in the training pattern matrix.
- Type:
float
- config
Snapshot of (B, L, G, topK) as used.
- Type:
dict
- memory_peak_mb
Python-traced peak memory during fit.
- Type:
float
- memory_rss_mb
RSS delta during fit (Unix only).
- Type:
float
- memory_cpp_mb
Estimated C++ extension memory usage.
- Type:
float
- openmp_threads
Number of OpenMP threads used.
- Type:
int
- degraded
True when fit fell back to reduced parameters.
- Type:
bool
- class hugiml.classifier.HUGIMLTuneResult(best_estimator_, best_params_, best_score_, results_, fast_path_used_, elapsed_seconds_, n_splits_, scoring, cv_splits_, shuffle, random_state)[source]
Bases:
objectResult object returned by HUGIMLClassifier.tune().
Attributes mirror the small subset of GridSearchCV-style fields users need for quick HUGIML tuning while keeping the API lightweight.
- Parameters:
best_estimator_ (HUGIMLClassifierNative)
best_params_ (dict[str, Any])
best_score_ (float)
results_ (Any)
fast_path_used_ (bool)
elapsed_seconds_ (float)
n_splits_ (int)
scoring (str)
cv_splits_ (list[tuple[ndarray, ndarray]])
shuffle (bool)
random_state (int | None)
Adaptive binning
Per-feature adaptive binning for HUG-IML — HUGIMLAdaptive.
HUGIMLAdaptive is a thin, sklearn-compatible subclass of
HUGIMLClassifierNative that hard-wires adaptive_binning=True and
exposes a simplified constructor (no B, allCols, or origColumns
parameters — those are managed internally).
All adaptive-binning mathematics live in hugiml._binning (single source
of truth). Both this module and hugiml.classifier import from there;
neither imports from the other at module level, so there is no circular
dependency.
Adaptive-binning algorithm (three steps)
Per-feature B selection — for each numerical feature, evaluate candidate B values by computing information gain against
yand stop when the marginal gain from adding more bins drops belowmin_marginal_gain_ratioof the gain already achieved (elbow-stopping).Pre-discretisation — discretise each numerical feature to
B_jequal-frequency quantile bins, computed on the training split only. Bin boundaries are stored in_bin_edges_and reapplied at predict time. Each bin is encoded as a readable string label, e.g."[12.0,24.0)".Categorical pass-through — pre-binned columns are treated as categorical by the C++ layer; the global
Bparameter is set to the sentinel value2(no effect on already-categorical columns).
Non-finite value handling
Non-finite cells (NaN, ±Inf) in any pre-binned column receive np.nan
in the label array. The C++ transaction builder skips those cells,
generating no item for that (row, feature) pair — semantically
“not observed”, with no imputation.
Usage
Example:
from hugiml.adaptive import HUGIMLAdaptive
clf = HUGIMLAdaptive(b_candidates=[3, 5, 7, 10, 15],
min_marginal_gain_ratio=0.02,
L=2, G=1e-4)
X_enc, y_enc = clf.prepareXy(X_df, y)
X_tr, X_te, y_tr, y_te = train_test_split(X_enc, y_enc, stratify=y_enc)
clf.fit(X_tr, y_tr)
print(clf.per_feature_b_) # chosen B_j per feature
print(clf.model_summary())
clf.plot_bin_profiles() # requires matplotlib
clf.ig_heatmap() # requires matplotlib
Diagnostic plots (plot_bin_profiles, ig_heatmap) and fitted
attributes (per_feature_b_, ig_scores_, _bin_edges_) are
defined on HUGIMLClassifierNative and inherited here.
- class hugiml.adaptive.HUGIMLAdaptive(b_candidates=None, min_marginal_gain_ratio=0.02, L=1, G=0.005, topK=-1, n_jobs=1, verbose=False, max_fit_seconds=None)[source]
Bases:
HUGIMLClassifierNativeHUG-IML with per-feature adaptive binning via elbow-stopping IG search.
Thin subclass of
HUGIMLClassifierNativewithadaptive_binning=Truehard-wired and a simplified constructor that omits parameters which are managed internally (B,allCols,origColumns).All public methods, fitted attributes, serialisation, monitoring, drift detection, and explanation helpers are inherited from
HUGIMLClassifierNative. No logic is duplicated.- Parameters:
b_candidates (list of int, optional) – Candidate bin counts to evaluate per feature. Default:
[2, 3, 5, 7, 10, 15].min_marginal_gain_ratio (float, default 0.02) – Stop adding bins when the incremental IG gain relative to the current level falls below this fraction.
0.02means stop when a new candidate adds less than 2 % more IG than the previous step. Lower values allow finer bins; higher values enforce coarser bins.L (int, default 1) – Maximum HUG pattern length. 1 = singletons; 2 = pairs; -1 = unlimited.
G (float, default 5e-3) – Minimum information-gain threshold.
topK (int, default -1) – Maximum number of patterns to retain. -1 computes automatically.
n_jobs (int, default 1) – Number of OpenMP threads. -1 uses all available cores.
verbose (bool, default False) – Emit INFO-level log messages during fit.
max_fit_seconds (float or None) – Wall-clock budget for the pattern-mining stage of fit().
HUGIMLClassifierNative) (Attributes (after fit — inherited from)
--------------------------------------------------------------
per_feature_b (dict[str, int]) – Chosen bin count per numerical feature.
ig_scores (dict[str, dict[int, float]]) – Full IG score grid
{feature: {B: ig_value}}for diagnostics._bin_edges_ (dict[str, np.ndarray]) – Quantile edges used during
fit, reapplied at predict time.patterns (list) – Mined HUG patterns.
classes (ndarray) – Unique class labels.
fit_metadata (FitMetadata) – Timings, memory, pattern count stats.
- classmethod default_param_grid()[source]
Return the default compact tuning grid inherited from the native classifier.
- Return type:
dict[str, list]
- get_params(deep=True)[source]
Return the constructor parameters (sklearn protocol).
Only the parameters that
HUGIMLAdaptive.__init__accepts are returned, sosklearn.cloneand cross-validation helpers reconstruct the correct subclass.- Parameters:
deep (bool)
- Return type:
dict
- fit(X_train, y_train)[source]
Fit with per-feature adaptive binning.
Delegates entirely to
HUGIMLClassifierNative.fitwithadaptive_binning=True. WhenX_trainis a plain ndarray andprepareXyhas been called previously, column names fromfeature_names_in_are applied so that feature-name-aware operations (adaptive binning, bin-edge lookup, schema validation) work correctly.- Parameters:
X_train (pd.DataFrame or ndarray)
y_train (array-like of int)
- Return type:
self
- property clf_: HUGIMLAdaptive
Backward-compatibility alias.
Old code that accessed
adaptive_clf.clf_to reach the innerHUGIMLClassifierNativenow getsself, becauseHUGIMLAdaptiveis aHUGIMLClassifierNative. All methods and fitted attributes are directly onself.
- set_predict_proba_request(*, X_test='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter inpredict_proba.self (HUGIMLAdaptive)
- Returns:
self – The updated object.
- Return type:
object
Metrics
Interpretability-complexity metrics for a fitted HUGIMLClassifierNative.
All functions accept a fitted HUGIMLClassifierNative and (optionally) a
data matrix X to compute sample-level statistics. They never re-train
the model.
Quick reference
Example:
from hugiml.metrics import compute_all_metrics
m = compute_all_metrics(clf, X_test)
print(m)
Available metrics
n_patterns— total mined patterns.avg_pattern_length— mean number of items per pattern.coverage— fraction of samples matched by at least one pattern.overlap_rate— mean number of patterns active per sample.top_k_cumulative_contribution(k)— cumulative absolute-coefficient share of top-k patterns.active_patterns_per_prediction— per-sample array.explanation_sparsity— fraction of patterns never active on the supplied data.
- class hugiml.metrics.InterpretabilityMetrics(n_patterns=0, avg_pattern_length=0.0, max_pattern_length=0, coverage=0.0, mean_active_patterns=0.0, std_active_patterns=0.0, overlap_rate=0.0, explanation_sparsity=0.0, top_k_cumulative_contribution=<factory>, n_samples=0)[source]
Bases:
objectAll interpretability metrics for one fitted model + dataset.
- Parameters:
n_patterns (int)
avg_pattern_length (float)
max_pattern_length (int)
coverage (float)
mean_active_patterns (float)
std_active_patterns (float)
overlap_rate (float)
explanation_sparsity (float)
top_k_cumulative_contribution (dict)
n_samples (int)
- n_patterns
Total number of mined HUG patterns.
- Type:
int
- avg_pattern_length
Mean items (conditions) per pattern.
- Type:
float
- max_pattern_length
Length of the longest pattern.
- Type:
int
- coverage
Fraction of samples covered by at least one active pattern.
- Type:
float
- mean_active_patterns
Average number of patterns active per sample.
- Type:
float
- std_active_patterns
Standard deviation of active patterns per sample.
- Type:
float
- overlap_rate
Alias for mean_active_patterns / n_patterns (normalised).
- Type:
float
- explanation_sparsity
Fraction of patterns that are never active on X (“dead” patterns).
- Type:
float
- top_k_cumulative_contribution
Mapping from k to cumulative share of total absolute coefficient magnitude for the top-k patterns. Keys: [1, 5, 10, 20, 50].
- Type:
dict[int, float]
- n_samples
Number of rows in X used for sample-level metrics.
- Type:
int
- hugiml.metrics.compute_all_metrics(clf, X)[source]
Compute all interpretability metrics in a single call.
- Parameters:
clf (fitted HUGIMLClassifierNative)
X (array-like or DataFrame)
- Return type:
- hugiml.metrics.metrics_dataframe(results)[source]
Convert a mapping of {model_name: InterpretabilityMetrics} to a DataFrame.
Useful for side-by-side comparisons across models or configurations.
- Parameters:
results (dict) – Keys are model labels; values are InterpretabilityMetrics instances.
- Return type:
pd.DataFrame
Calibration
Calibration evaluation for HUGIMLClassifierNative.
Provides Expected Calibration Error (ECE), Brier score decomposition, reliability diagram data, and calibration curve computation consistent with best practices for interpretable classifiers.
- class hugiml.calibration.CalibrationResult(ece, mce, brier_score, brier_reliability, brier_resolution, brier_uncertainty, n_bins, bin_confidences=<factory>, bin_accuracies=<factory>, bin_counts=<factory>)[source]
Bases:
objectCalibration evaluation summary for a fitted classifier.
- Parameters:
ece (float)
mce (float)
brier_score (float)
brier_reliability (float)
brier_resolution (float)
brier_uncertainty (float)
n_bins (int)
bin_confidences (list[float])
bin_accuracies (list[float])
bin_counts (list[int])
- ece
Expected Calibration Error (lower is better; 0 = perfect).
- Type:
float
- mce
Maximum Calibration Error across all bins.
- Type:
float
- brier_score
Mean Brier score (lower is better; 0 = perfect).
- Type:
float
- brier_reliability
Brier reliability component (miscalibration contribution).
- Type:
float
- brier_resolution
Brier resolution component (sharpness contribution).
- Type:
float
- brier_uncertainty
Brier uncertainty component (base rate uncertainty).
- Type:
float
- n_bins
Number of calibration bins used.
- Type:
int
- bin_confidences
Mean predicted confidence per bin.
- Type:
list of float
- bin_accuracies
Empirical accuracy per bin.
- Type:
list of float
- bin_counts
Sample count per bin.
- Type:
list of int
- hugiml.calibration.evaluate_calibration(y_true, y_proba, n_bins=10, strategy='uniform')[source]
Compute ECE, MCE, and Brier score decomposition.
- Parameters:
y_true (np.ndarray of int, shape (n_samples,)) – True class labels (0 or 1 for binary; multi-class uses one-vs-rest).
y_proba (np.ndarray of float, shape (n_samples,) or (n_samples, n_classes)) – Predicted probabilities. For multi-class, pass the probability of the positive class or use the column for the class of interest.
n_bins (int) – Number of calibration bins.
strategy ({'uniform', 'quantile'}) – Bin strategy: uniform width or equal-frequency.
- Return type:
- hugiml.calibration.reliability_diagram_data(y_true, y_proba, n_bins=10)[source]
Return bin-level data for plotting a reliability diagram.
- Parameters:
y_true (np.ndarray)
y_proba (np.ndarray)
n_bins (int)
- Returns:
Three parallel lists, one entry per non-empty bin.
- Return type:
(mean_predicted, fraction_positives, bin_counts)
- hugiml.calibration.brier_decomposition(y_true, y_proba)[source]
Murphy decomposition of the Brier score.
Decomposes Brier = Reliability - Resolution + Uncertainty.
- Parameters:
y_true (np.ndarray of {0, 1})
y_proba (np.ndarray of float in [0, 1])
- Returns:
All three components as floats.
- Return type:
(reliability, resolution, uncertainty)
Plots
HUG-IML first-class visualizations using Plotly.
Public API
from hugiml.plots import HUGPlotter
plotter = HUGPlotter(clf) fig = plotter.plot_marginal_bin_profile(“age”, X) # EBM shape-function equivalent fig = plotter.plot_feature_combinations(“age”) # compound patterns for one feature fig = plotter.plot_feature_importance(top_n=15) fig = plotter.plot_utility_vs_ig() # scatter: utility × IG × support fig = plotter.plot_top_patterns(top_n=20) fig = plotter.plot_feature_coverage() fig = plotter.plot_pattern_lengths() fig = plotter.plot_support_distribution() fig = plotter.plot_active_patterns(X, sample_idx=0) # local explanation fig = plotter.plot_dashboard(X) # full multi-panel HTML
- class hugiml.plots.HUGPlotter(clf, height_default=380)[source]
Bases:
objectUnified Plotly-based visualization interface for a fitted HUGIMLClassifierNative.
- Parameters:
clf (fitted HUGIMLClassifierNative)
height_default (int) – Default figure height.
- plot_marginal_bin_profile(feature_name, X=None, height=None, title=None)[source]
1-D HUG profile — EBM shape function equivalent.
For a given feature, shows every singleton pattern bin as a bar (x = bin label, y = utility, colour = information gain). An orange dotted line overlays the training support fraction on the right y-axis, mirroring the dashboard’s “Marginal Bin Profile” card.
- Parameters:
feature_name (str)
X (ignored) – Support uses training data stored in
clf.x_train_hup_.height (int, optional)
title (str, optional)
- Return type:
plotly.graph_objects.Figure
- plot_feature_combinations(feature_name, top_n=25, height=None, title=None)[source]
Compound patterns that include a specific feature.
Each bar = one compound pattern; bars coloured by the number of extra features (+1 = green, +2 = orange, +3 = red), matching the dashboard’s “Feature Combinations” card.
- Parameters:
feature_name (str)
top_n (int)
height (int, optional)
title (str, optional)
- Return type:
go.Figure
- plot_feature_importance(top_n=15, height=None, title=None)[source]
Feature importance: mean utility per feature, coloured by mean IG.
Matches the “Feature Importance” card in the governance dashboard.
- Parameters:
top_n (int)
height (int, optional)
title (str, optional)
- Return type:
go.Figure
- plot_utility_vs_ig(feature_filter=None, height=None, title=None)[source]
Scatter: utility (x) × information gain (y), coloured by support.
Matches the “Utility vs Info Gain” card in the governance dashboard. Optionally filter to patterns containing one feature.
- Parameters:
feature_filter (str, optional) – If given, highlight only patterns for this feature.
height (int, optional)
title (str, optional)
- Return type:
go.Figure
- plot_top_patterns(top_n=20, height=None, title=None)[source]
Horizontal bar chart of top-N patterns by utility, coloured by IG.
Matches the “Top Patterns” card in the governance dashboard.
- Parameters:
top_n (int)
height (int, optional)
title (str, optional)
- Return type:
go.Figure
- plot_feature_coverage(top_n=15, height=None, title=None)[source]
Horizontal bar: how many patterns reference each feature.
Matches the “Feature Coverage” card in the governance dashboard.
- Parameters:
top_n (int)
height (int | None)
title (str | None)
- Return type:
plotly.graph_objects.Figure
- plot_pattern_lengths(height=None, title=None)[source]
Bar chart of pattern length distribution.
Matches the “Pattern Lengths” card in the governance dashboard.
- Parameters:
height (int | None)
title (str | None)
- Return type:
plotly.graph_objects.Figure
- plot_support_distribution(height=None, title=None)[source]
Histogram of pattern support values.
Matches the “Support Distribution” card in the governance dashboard.
- Parameters:
height (int | None)
title (str | None)
- Return type:
plotly.graph_objects.Figure
- plot_active_patterns(X, sample_idx=0, max_patterns=20, height=None, title=None)[source]
Local explanation: active HUG patterns for a single sample.
Shows active patterns sorted by absolute coefficient magnitude, coloured blue for positive coefficients and red for negative coefficients.
- Parameters:
X (array-like or DataFrame)
sample_idx (int)
max_patterns (int)
height (int, optional)
title (str, optional)
- Return type:
go.Figure
- plot_performance_radar(metrics, dataset_name='Dataset', height=None)[source]
Radar / spider chart of classification performance metrics.
Matches the “Performance” card in the governance dashboard.
- Parameters:
metrics (dict) – Keys: ‘accuracy’, ‘balanced_accuracy’, ‘roc_auc’, ‘f1’ Values: floats in [0, 1].
dataset_name (str)
height (int, optional)
- Return type:
go.Figure
- plot_2d_profile(feature_a, feature_b, height=None, title=None)[source]
2-D HUG profile heatmap for compound patterns involving two features.
- Parameters:
feature_a (str)
feature_b (str)
height (int, optional)
title (str, optional)
- Return type:
go.Figure
- plot_dashboard(X, dataset_name='Dataset', feature_names_for_profile=None, output_path=None)[source]
Generate a self-contained multi-panel HTML dashboard.
Produces performance overview, feature importance, utility-vs-IG, top patterns, pattern lengths, support distribution, feature coverage, and per-feature marginal bin profiles.
- Parameters:
X (array-like or DataFrame) – Used for active-pattern coverage check.
dataset_name (str)
feature_names_for_profile (list of str, optional) – Which features to include marginal bin profiles for. Defaults to all features that have singleton patterns.
output_path (str, optional) – If given, writes the HTML to this path.
- Return type:
str (HTML string)
Governance
Governance artifacts for HUGIMLClassifierNative.
Provides model card generation, audit artifact packaging, and governance metadata consistent with responsible model deployment practices and the HUG-IML paper’s emphasis on interpretability.
- class hugiml.governance.ModelCard(model_id, model_type='HUGIMLClassifierNative', paper_reference='Krishnamoorthy, S. (2024). Interpretable Classifier Models for Decision Support Using High Utility Gain Patterns. IEEE Access, 12, 126088-126107. DOI: 10.1109/ACCESS.2024.3455563', license='Apache-2.0', intended_use='', out_of_scope_use='', training_data_description='', evaluation_data_description='', hyperparameters=<factory>, performance_metrics=<factory>, n_patterns=0, n_compound=0, top_patterns=<factory>, limitations=<factory>, ethical_considerations='', created_at=<factory>, framework_version='')[source]
Bases:
objectStructured model card for a fitted HUGIMLClassifierNative.
Follows the Google Model Cards framework adapted for rule-based interpretable classifiers.
- Parameters:
model_id (str)
model_type (str)
paper_reference (str)
license (str)
intended_use (str)
out_of_scope_use (str)
training_data_description (str)
evaluation_data_description (str)
hyperparameters (dict[str, Any])
performance_metrics (dict[str, Any])
n_patterns (int)
n_compound (int)
top_patterns (list[str])
limitations (list[str])
ethical_considerations (str)
created_at (str)
framework_version (str)
- model_id
Unique identifier for this model version.
- Type:
str
- model_type
Always ‘HUGIMLClassifierNative’.
- Type:
str
- paper_reference
Citation for the HUG-IML algorithm.
- Type:
str
- license
Software license.
- Type:
str
- intended_use
Describe the intended classification task.
- Type:
str
- out_of_scope_use
Describe uses not covered by this model.
- Type:
str
- training_data_description
Description of training data.
- Type:
str
- evaluation_data_description
Description of evaluation data.
- Type:
str
- hyperparameters
B, L, G, topK as used during training.
- Type:
dict
- performance_metrics
Accuracy, F1, AUC, ECE, Brier score, etc.
- Type:
dict
- n_patterns
Number of mined HUG patterns.
- Type:
int
- n_compound
Number of compound patterns.
- Type:
int
- top_patterns
Most important patterns.
- Type:
list of str
- limitations
Known limitations.
- Type:
list of str
- ethical_considerations
Fairness, bias, and ethical notes.
- Type:
str
- created_at
ISO 8601 timestamp of creation.
- Type:
str
- framework_version
hugiml-core version.
- Type:
str
- class hugiml.governance.AuditArtifact(model_id, created_at=<factory>, training_hash='', model_card=None, governance=None, fit_metadata=None, pattern_info=None, calibration=None, explainability=None, framework_version='')[source]
Bases:
objectAudit record for a model training run.
Captures all information needed for regulatory review or internal audit.
- Parameters:
model_id (str)
created_at (str)
training_hash (str)
model_card (dict[str, Any] | None)
governance (dict[str, Any] | None)
fit_metadata (dict[str, Any] | None)
pattern_info (list[dict[str, Any]] | None)
calibration (dict[str, Any] | None)
explainability (dict[str, Any] | None)
framework_version (str)
- class hugiml.governance.GovernanceMetadata(model_id, owner='', purpose='', data_classification='unclassified', review_status='draft', approved_by=None, approved_at=None, tags=<factory>)[source]
Bases:
objectMinimal governance metadata attached to a model instance.
- Parameters:
model_id (str)
owner (str)
purpose (str)
data_classification (str)
review_status (str)
approved_by (str | None)
approved_at (str | None)
tags (list[str])
- model_id
- Type:
str
- owner
Person or team responsible for this model.
- Type:
str
- purpose
Business or scientific purpose.
- Type:
str
- data_classification
Sensitivity of training data (e.g. ‘public’, ‘internal’, ‘confidential’).
- Type:
str
- review_status
One of ‘draft’, ‘reviewed’, ‘approved’, ‘deprecated’.
- Type:
str
- approved_by
- Type:
str or None
- approved_at
- Type:
str or None
- tags
- Type:
list of str
- hugiml.governance.generate_model_card(classifier, model_id, *, intended_use='', out_of_scope_use='', training_data_description='', evaluation_data_description='', performance_metrics=None, limitations=None, ethical_considerations='')[source]
Populate a ModelCard from a fitted classifier.
- Parameters:
classifier (HUGIMLClassifierNative) – A fitted classifier.
model_id (str) – Unique identifier.
intended_use (str)
out_of_scope_use (str)
training_data_description (str)
evaluation_data_description (str)
performance_metrics (dict[str, Any] | None)
limitations (list[str] | None)
ethical_considerations (str)
- Return type:
- hugiml.governance.package_audit_artifacts(classifier, model_id, output_dir, *, model_card=None, governance=None, calibration_result=None, explainability_report=None)[source]
Package all audit artifacts for a trained model.
Writes model card, governance metadata, fit metadata, pattern info, and optional calibration/explainability reports to
output_dir.- Returns:
Path to the audit manifest JSON file.
- Return type:
str
- Parameters:
classifier (Any)
model_id (str)
output_dir (str)
model_card (ModelCard | None)
governance (GovernanceMetadata | None)
calibration_result (Any | None)
explainability_report (Any | None)
Explainability
Enterprise explainability for HUGIMLClassifierNative.
Provides SHAP interoperability, feature lineage tracking, explanation stability metrics, and audit artifact generation. The core HUG patterns are human-readable by design; this module adds depth for downstream governance and audit workflows.
- class hugiml.explainability.ExplainabilityReport(model_id, n_patterns, n_features, top_patterns=<factory>, feature_lineage=<factory>, model_composition=<factory>, augmented_pair_effects=<factory>, stability=None, shap_available=False)[source]
Bases:
objectFull explainability report for a fitted classifier instance.
Contains pattern importances, feature lineage, and stability metrics. Serializable to JSON for audit workflows.
- Parameters:
model_id (str)
n_patterns (int)
n_features (int)
top_patterns (list[dict[str, Any]])
feature_lineage (list[dict[str, Any]])
model_composition (dict[str, Any])
augmented_pair_effects (list[dict[str, Any]])
stability (dict[str, Any] | None)
shap_available (bool)
- class hugiml.explainability.FeatureLineage(feature_name, feature_type, derived_patterns=<factory>, pattern_indices=<factory>, derived_augmented_pairs=<factory>, total_importance=0.0, pattern_importance=0.0, augmented_pair_importance=0.0, original_feature_importance=0.0)[source]
Bases:
objectProvenance record linking an original feature to downstream features.
- Parameters:
feature_name (str)
feature_type (str)
derived_patterns (list[str])
pattern_indices (list[int])
derived_augmented_pairs (list[str])
total_importance (float)
pattern_importance (float)
augmented_pair_importance (float)
original_feature_importance (float)
- feature_name
Original feature name from the training DataFrame.
- Type:
str
- feature_type
One of ‘integer’, ‘float’, ‘categorical’.
- Type:
str
- derived_patterns
Human-readable HUG pattern labels that include this feature.
- Type:
list of str
- pattern_indices
Indices into the pattern list for each derived pattern.
- Type:
list of int
- derived_augmented_pairs
Augmented-pair feature names that use this source feature.
- Type:
list of str
- total_importance
Sum of absolute downstream coefficients for original, HUG pattern, and augmented-pair features linked to this source feature.
- Type:
float
- pattern_importance
Pattern-only contribution to total_importance.
- Type:
float
- augmented_pair_importance
Augmented-pair contribution to total_importance.
- Type:
float
- original_feature_importance
Direct original-feature contribution when original features are included in the downstream estimator.
- Type:
float
- class hugiml.explainability.ExplanationStabilityMetrics(jaccard_similarity=0.0, rank_correlation=0.0, pattern_overlap_count=0, n_patterns_a=0, n_patterns_b=0, by_feature_type=<factory>)[source]
Bases:
objectStability metrics for pattern-based explanations.
The top-level fields report stability for mined HUG patterns only. When original or augmented-pair downstream features are present, per-feature-type metrics are available in
by_feature_typeso derived feature stability is not conflated with human-readable pattern-rule stability.- Parameters:
jaccard_similarity (float)
rank_correlation (float)
pattern_overlap_count (int)
n_patterns_a (int)
n_patterns_b (int)
by_feature_type (dict[str, dict[str, float | int]])
- class hugiml.explainability.HUGPatternExplainer(classifier)[source]
Bases:
objectEnterprise explainability layer over a fitted HUGIMLClassifierNative.
Extracts feature lineage, computes explanation stability, and provides a SHAP-compatible interface where available. Designed to operate on the already-mined HUG patterns without re-running the algorithm.
- Parameters:
classifier (HUGIMLClassifierNative) – A fitted classifier instance.
- feature_lineage()[source]
Build feature lineage mapping each input feature to its patterns.
- Returns:
One entry per original input feature.
- Return type:
list of FeatureLineage
- explanation_stability(X_a, y_a, X_b, y_b, top_n=20)[source]
Measure explanation stability across two data splits.
Fits two copies of the classifier on split A and split B. The headline metrics compare only mined HUG patterns. Additional metrics are returned by feature type so original features, HUG patterns, and augmented-pair transforms are not mixed into a single stability score.
- Parameters:
X_a (split A data)
y_a (split A data)
X_b (split B data)
y_b (split B data)
top_n (int) – How many top patterns to compare.
- Return type:
- hugiml.explainability.shap_values_from_pattern_matrix(classifier, X, *, background_samples=100, check_additivity=False, allow_incomplete=False)[source]
Compute SHAP values over the HUG pattern feature space.
Applies SHAP’s LinearExplainer (or KernelExplainer as fallback) on the binary pattern-presence matrix produced by the classifier’s transform() method. The resulting SHAP values are in pattern-space; use
aggregate_shap_to_features()to roll them back to original features.When the fitted downstream estimator also uses original or augmented-pair features, pattern-space SHAP is incomplete relative to the fitted model. In that case this function warns and returns
Noneunlessallow_incomplete=Trueis passed explicitly.Requires the optional
shappackage (pip install shap).- Parameters:
classifier (HUGIMLClassifierNative) – A fitted classifier.
X (array-like) – Input data to explain.
background_samples (int) – Number of background samples for KernelExplainer.
check_additivity (bool) – Pass to SHAP’s explain call.
allow_incomplete (bool) – If False, return None when the fitted downstream estimator uses original or augmented-pair features in addition to HUG patterns.
- Returns:
SHAP values in pattern space. Returns None when shap is not installed.
- Return type:
np.ndarray of shape (n_samples, n_patterns) or None
Monitoring
Operational monitoring for HUGIMLClassifierNative.
Provides thread-safe prediction statistics tracking and multi-method distribution drift detection combining PSI, KL divergence, and label drift monitoring.
- class hugiml.monitoring.PredictionMonitor(window_size=1000)[source]
Bases:
objectThread-safe prediction statistics tracker.
Attach to a fitted classifier via
clf.enable_monitoring(). Access statistics viaclf.monitor.report()orclf.monitor.stats.Tracks prediction count, confidence distribution, per-class frequency, and latency percentiles over a rolling window.
- Parameters:
window_size (int)
- property stats: dict
Current monitoring statistics as a plain dict.
- class hugiml.monitoring.DriftDetector(n_bins=10)[source]
Bases:
objectMulti-method distribution drift detector.
Combines Population Stability Index (PSI) and symmetric KL divergence for robust drift assessment. Optionally tracks label drift when ground truth is available.
- PSI thresholds:
< 0.1 — stable 0.1–0.25 — moderate shift > 0.25 — significant drift
- Parameters:
n_bins (int) – Number of histogram bins for numerical features.
- fit_baseline(X, cat_mask, col_names=None, y=None)[source]
Store training distribution for later comparison.
- Parameters:
X (np.ndarray, shape (n, p))
cat_mask (np.ndarray of bool, shape (p,))
col_names (list of str, optional)
y (np.ndarray of int, optional) – Training labels for label-drift baseline.
- Return type:
None
- compute_psi(X_test)[source]
Compute PSI per numerical feature between training and test.
- Return type:
dict mapping column name to PSI value.
- Parameters:
X_test (ndarray)
- compute_kl(X_test)[source]
Compute symmetric KL divergence per feature.
- Return type:
dict mapping column name to KL value.
- Parameters:
X_test (ndarray)
- compute_label_drift(y_test)[source]
Compute per-class proportion shift between training and test labels.
Returns None when no training label baseline is available.
- Parameters:
y_test (ndarray)
- Return type:
dict[str, float] | None
- class hugiml.monitoring.DriftReport(psi, kl_divergence, label_drift, threshold)[source]
Bases:
objectStructured result from a drift detection run.
- Parameters:
psi (dict)
kl_divergence (dict)
label_drift (dict | None)
threshold (float)
- psi
Population Stability Index per feature.
- Type:
dict[str, float]
- kl_divergence
Symmetric KL divergence per feature.
- Type:
dict[str, float]
- label_drift
Per-class label proportion shift (requires y_test).
- Type:
dict[str, float] or None
- overall_psi
Mean PSI across all numerical features.
- Type:
float
- overall_kl
Mean KL divergence across all numerical features.
- Type:
float
- drifted_features
Features exceeding the PSI threshold.
- Type:
list[str]
- severity
One of ‘none’, ‘moderate’, ‘significant’.
- Type:
str
Multiclass and imbalance
Helpers for three common HUG-IML deployment scenarios:
Multiclass classification — HUGIMLClassifierNative supports multiclass natively via its
base_estimator(LogisticRegression withsolver='lbfgs'when n_classes > 2). This module provides aMulticlassHUGReportthat extracts per-class pattern importances.Imbalanced data — wraps the classifier in a cost-sensitive or resampling pipeline via
make_imbalanced_pipeline.High-cardinality categoricals —
encode_high_cardinalityreplaces columns with many unique values with target-mean encoding or a frequency encoding before passing data toprepareXy.
- class hugiml.multiclass.MulticlassHUGReport(clf)[source]
Bases:
objectPer-class pattern importances for a multiclass HUG-IML model.
When the downstream estimator is LogisticRegression with > 2 classes,
coef_has shape(n_classes, n_patterns). This class exposes per-class top patterns.- Parameters:
clf (fitted HUGIMLClassifierNative)
- hugiml.multiclass.make_imbalanced_pipeline(clf, strategy='class_weight', sampling_ratio=1.0, random_state=42)[source]
Wrap a HUGIMLClassifierNative for use with imbalanced data.
- Parameters:
clf (HUGIMLClassifierNative (unfitted))
strategy ({'class_weight', 'smote', 'random_oversample', 'random_undersample'}) –
class_weight— setsclass_weight='balanced'on the downstream LR. Zero overhead; recommended first choice.smote— SMOTE oversampling viaimbalanced-learn.random_oversample— random oversampling viaimbalanced-learn.random_undersample— random undersampling viaimbalanced-learn.
sampling_ratio (float) – Target minority:majority ratio (only for imbalanced-learn strategies).
random_state (int)
- Returns:
Fitted wrapper or HUGIMLClassifierNative (for ‘class_weight’) — the returned
object has
fit(X, y),predict_proba(X), andpredict(X)methods.
Notes
For ‘class_weight’: returns a copy of clf with base_estimator set to LogisticRegression(class_weight=’balanced’). For SMOTE/resampling: returns an
ImbalancedHUGPipelinethat applies resampling to the pattern matrix (post-transform) inside fit(). This ensures the HUG patterns are mined on the original distribution (as intended) while the downstream classifier trains on the resampled binary matrix.
- hugiml.multiclass.encode_high_cardinality(X, y=None, threshold=20, method='target_mean', min_samples_leaf=5, smoothing=1.0, random_state=42)[source]
Replace high-cardinality categorical columns with numerical encodings.
This should be called before
prepareXy; the returned mapping can be applied to test data viaapply_encoding.- Parameters:
X (pd.DataFrame)
y (array-like, optional) – Required when
method='target_mean'.threshold (int) – Columns with more than this many unique values are considered high-cardinality.
method ({'target_mean', 'frequency', 'ordinal'}) –
target_mean— replace each category with its mean target value (smoothed towards the global mean). Reduces categories to a single float — most informative for tree/rule-based models.frequency— replace with the category’s relative frequency.ordinal— assign arbitrary integer codes (fast, no leakage, but loses any ordering meaning).
min_samples_leaf (int) – Minimum observations per category before smoothing kicks in (target_mean only).
smoothing (float) – Smoothing strength (target_mean only).
random_state (int) – Used internally for any random operations.
- Returns:
X_encoded (pd.DataFrame (copy — original is unchanged))
encoding_map (dict) – Mapping
{column_name: dict_or_array}to apply to unseen data viaapply_encoding(X_test, encoding_map).
- Return type:
tuple[DataFrame, dict]
Notes
Data-leakage safety: call
encode_high_cardinalityon the training split only. Useapply_encodingon test/validation data with the map returned from training. Never fit the encoding on combined train+test data.
- hugiml.multiclass.apply_encoding(X, encoding_map, fill_value=0.0)[source]
Apply an encoding map (produced by
encode_high_cardinality) to new data.- Parameters:
X (pd.DataFrame)
encoding_map (dict (from
encode_high_cardinality))fill_value (float) – Value for unseen categories.
- Return type:
pd.DataFrame (copy)
Pattern pruning
Regulated “remove / refit / calibrate” workflow for HUG-IML.
EBMs are valued partly because model terms can be inspected and sometimes edited (e.g. to remove an ethically problematic interaction term). This module gives HUG-IML an analogous controlled editing workflow that is rigorous enough for regulated-domain review cycles.
Workflow
Inspect patterns via
clf.feature_importances()orclf.get_pattern_info().Create a
PatternEditorand callremove()with a list of pattern indices (or keyword filters).Call
refit(X_tr, y_tr)to re-train the downstream classifier on the pruned pattern matrix. The C++ mining results are unchanged.Optionally call
calibrate(X_cal, y_cal)to wrap the refitted model with Platt scaling / isotonic regression.Call
finalize()to get a new classifier instance with the edited pattern set baked in, andaudit_report()for a JSON audit trail.
Example
from hugiml.pruning import PatternEditor
editor = PatternEditor(clf) editor.remove([3, 7, 12], reason=”pattern references protected attribute ‘gender’”) editor.remove_by_keyword(“income”, reason=”unstable feature (high PSI)”) new_clf = editor.refit(X_tr, y_tr).calibrate(X_cal, y_cal).finalize()
print(editor.audit_report()) new_clf.predict_proba(X_te)
- class hugiml.pruning.PatternEditor(clf, operator_name='analyst')[source]
Bases:
objectControlled pattern editing with full audit trail.
- Parameters:
clf (fitted HUGIMLClassifierNative) – The original model. This object is not mutated; all edits produce a fresh copy stored internally.
operator_name (str) – Human-readable identifier of the person/process making the edits (for the audit trail).
- remove(pattern_indices, reason='unspecified')[source]
Remove patterns by index (0-based, relative to the current working set).
- Parameters:
pattern_indices (list of int) – Indices into the current pattern list. Use
list_patterns()to preview indices.reason (str) – Audit reason (e.g. ‘protected attribute’, ‘operationally invalid’).
- Return type:
self (for method chaining)
- remove_by_keyword(keyword, reason='keyword match', case_sensitive=False)[source]
Remove all patterns whose label contains
keyword.- Parameters:
keyword (str)
reason (str)
case_sensitive (bool)
- Return type:
self
- remove_low_support(min_support=0.01, reason='support below threshold')[source]
Remove patterns with training support below
min_support.- Parameters:
min_support (float) – Minimum fraction of training samples (0 to 1).
reason (str)
- Return type:
self
- refit(X_tr, y_tr, estimator=None)[source]
Refit the downstream classifier on the (pruned) pattern matrix.
The HUG mining results (
patterns_) are unchanged; only the downstreamPipeline(model_) is replaced.- Parameters:
X_tr (array-like or DataFrame) – Training data (should be the same split used to fit the original model).
y_tr (array-like)
estimator (sklearn estimator, optional) – If None, uses the original downstream estimator class with the same hyperparameters.
- Return type:
self
- calibrate(X_cal, y_cal, method='isotonic')[source]
Wrap the refitted downstream model with probability calibration.
Uses
sklearn.calibration.CalibratedClassifierCVapplied post-fit to a calibration set that should be held out from both training and test.- Parameters:
X_cal (array-like or DataFrame)
y_cal (array-like)
method ({'sigmoid', 'isotonic'})
- Return type:
self
- finalize()[source]
Return the edited classifier as a new standalone instance.
After calling
finalize(), further edits on this editor are blocked. The returned object is a fully independent copy.- Return type:
HUGIMLClassifierNative (edited copy)
- list_patterns()[source]
Return editable HUG patterns in the current working model.
PatternEditor edits mined HUG patterns only. Original features and augmented-pair downstream features are visible through
list_downstream_features()but are not directly removable by this editor.- Return type:
DataFrame
- list_downstream_features()[source]
Return all downstream features with PatternEditor editability.
The returned table includes original features, HUG patterns, and augmented-pair transforms when present. Only rows with
feature_type == 'pattern'are directly editable throughremove()and related PatternEditor methods.- Return type:
DataFrame
- diff()[source]
Return a summary of changes made relative to the original model.
- Returns:
dict with keys
- Return type:
n_original, n_current, n_removed, removed_patterns
Serialization
Versioned serialization and SBOM generation for HUGIMLClassifierNative.
Format (v3 — default)
A ZIP archive containing JSON manifests and NumPy array bundles. No
pickle is required to round-trip the model, eliminating the gadget-chain
attack surface that exists in any pickle-based format.
Archive layout:
manifest.json – format_version, schema_version, timestamp
clf_init.json – __init__ hyperparameters
clf_fit.json – scalar / list fitted attributes
patterns.json – list of {utility, items, ig} dicts
arrays.npz – cat_cols_mask_, is_int_mask_, classes_
td_config.json – TransactionDataWrapper non-array state
td_arrays.npz – TransactionDataWrapper numpy arrays
estimator.json – downstream estimator class + parameters
estimator_arrays.npz – downstream estimator numpy arrays
hmac.sig – HMAC-SHA256 over all content files (hex)
Authentication
Set HUGIML_MODEL_HMAC_KEY (hex-encoded, 32+ bytes) before saving or
loading. Files saved without a key have an all-zero hmac.sig and can
still be loaded unless HUGIML_REQUIRE_MODEL_HMAC=true is set.
Backward compatibility (v1/v2)
Models saved with schema version 1 or 2 (the legacy HMAC-pickle format) are still loadable via a restricted Unpickler that permits only known HUG-IML and sklearn modules. v1/v2 writing is not supported.
- hugiml.serialization.save_model(clf, path)[source]
Persist a fitted classifier to a v3 ZIP/JSON/NumPy model file.
- Parameters:
clf (HUGIMLClassifierNative) – A fitted classifier.
path (str or Path)
- Raises:
HUGIMLSerializationError – When the model is unfitted, a component cannot be serialized, or the write fails.
- Return type:
None
- hugiml.serialization.load_model(path, expected_type=None)[source]
Load a classifier from a file saved by
save_model().Supports: * v3 — ZIP/JSON/NumPy format (default since 2.1) * v1/v2 — legacy HMAC-pickle format (read-only; still authenticated)
- Parameters:
path (str or Path)
expected_type (type, optional)
- Return type:
- Raises:
HUGIMLVersionError – When schema version is incompatible.
HUGIMLSerializationError – When the file is corrupt, missing, has an invalid HMAC, or contains an unexpected type.
Telemetry
OpenTelemetry and Prometheus instrumentation for HUGIMLClassifierNative.
Both integrations are strictly optional: if the respective packages are not installed the module degrades gracefully to no-op stubs. Import and use of this module never breaks the classifier itself.
OpenTelemetry
Wraps fit(), predict_proba(), and predict() with OTEL spans and attributes.
Set HUGIML_OTEL_ENABLED=1 to activate.
Prometheus
Exposes prediction count, latency histogram, and confidence gauge.
Set HUGIML_PROMETHEUS_ENABLED=1 to activate.
Debug logging
All non-fatal telemetry and metrics failures are logged at DEBUG level
(logger = logging.getLogger("hugiml.telemetry")) with exc_info=True
so that stack traces are available when the root logger is configured at
DEBUG without any user-visible noise at INFO or above.
- class hugiml.telemetry.HUGIMLTracer[source]
Bases:
objectOpenTelemetry tracer wrapper for HUGIMLClassifierNative.
Emits spans for fit, predict_proba, and predict with attributes including n_samples, n_patterns, and latency.
When
opentelemetry-apiis not installed all operations are no-ops.
- class hugiml.telemetry.HUGIMLMetrics[source]
Bases:
objectPrometheus metrics for HUGIMLClassifierNative.
- Exposes:
hugiml_predictions_totalcounterhugiml_prediction_latency_secondshistogramhugiml_confidence_meangaugehugiml_drift_psigauge (per-feature)
When
prometheus_clientis not installed all metrics are no-ops.
- hugiml.telemetry.instrument_classifier(classifier, model_id='default')[source]
Wrap a fitted classifier with telemetry instrumentation.
Patches predict_proba and predict methods in-place to emit OTEL spans and Prometheus metrics. The classifier itself is modified and returned.
- Parameters:
classifier (HUGIMLClassifierNative)
model_id (str)
- Return type:
The same classifier instance with patched methods.
Exceptions
Structured exception and warning hierarchy for HUG-IML.
Taxonomy:
HUGIMLError (base)
├── HUGIMLFitError — any failure during fit()
│ ├── HUGIMLMiningError — pattern mining specifically
│ ├── HUGIMLTimeoutError — max_fit_seconds exceeded
│ └── HUGIMLMemoryError — native/Python memory budget exceeded
├── HUGIMLValidationError — input data / param validation
│ ├── HUGIMLSchemaError — column mismatch at predict time
│ └── HUGIMLParamError — bad hyperparameter values / types
├── HUGIMLSerializationError — load/save failures
│ └── HUGIMLVersionError — schema version incompatibility
└── HUGIMLPredictionError — failures during predict/transform
HUGIMLWarning (base, UserWarning subclass)
├── HUGIMLConvergenceWarning — model converged to minimal patterns
├── HUGIMLDtypeDriftWarning — categorical column dtype changed
├── HUGIMLRangeWarning — feature values outside training range
├── HUGIMLDegradedWarning — model degraded due to timeout/memory
└── HUGIMLDeprecationWarning — deprecated API usage
- exception hugiml.exceptions.HUGIMLError[source]
Bases:
ExceptionBase exception for all HUG-IML errors.
- exception hugiml.exceptions.HUGIMLFitError[source]
Bases:
HUGIMLErrorRaised when fit() fails for any reason.
- exception hugiml.exceptions.HUGIMLMiningError[source]
Bases:
HUGIMLFitErrorRaised when pattern mining fails or produces zero patterns.
- exception hugiml.exceptions.HUGIMLTimeoutError[source]
Bases:
HUGIMLFitErrorRaised when fit exceeds max_fit_seconds.
- exception hugiml.exceptions.HUGIMLMemoryError[source]
Bases:
HUGIMLFitError,MemoryErrorRaised when fit cannot safely allocate required memory.
- exception hugiml.exceptions.HUGIMLValidationError[source]
Bases:
HUGIMLError,ValueErrorRaised when input data or configuration is invalid.
Inherits from ValueError for backward compatibility with existing except-ValueError handlers.
- exception hugiml.exceptions.HUGIMLSchemaError[source]
Bases:
HUGIMLValidationErrorRaised when predict-time data does not match training schema (wrong columns, wrong order, wrong count).
- exception hugiml.exceptions.HUGIMLParamError[source]
Bases:
HUGIMLValidationError,TypeErrorRaised when hyperparameters have wrong types or values.
Inherits from both TypeError and ValueError for backward compatibility.
- exception hugiml.exceptions.HUGIMLSerializationError[source]
Bases:
HUGIMLErrorRaised when model save/load fails.
- exception hugiml.exceptions.HUGIMLVersionError[source]
Bases:
HUGIMLSerializationErrorRaised when loading a model whose schema version is incompatible.
- exception hugiml.exceptions.HUGIMLPredictionError[source]
Bases:
HUGIMLError,RuntimeErrorRaised when predict/transform fails on a fitted model.
Inherits from RuntimeError for backward compatibility.
- exception hugiml.exceptions.HUGIMLWarning[source]
Bases:
UserWarningBase warning for all HUG-IML warnings.
- exception hugiml.exceptions.HUGIMLConvergenceWarning[source]
Bases:
HUGIMLWarningIssued when the model converges to a minimal number of patterns (e.g. due to very restrictive G or low-information data).
- exception hugiml.exceptions.HUGIMLDtypeDriftWarning[source]
Bases:
HUGIMLWarningIssued when a categorical column is passed as numeric at predict time.
- exception hugiml.exceptions.HUGIMLRangeWarning[source]
Bases:
HUGIMLWarningIssued when feature values fall far outside the training range.
- exception hugiml.exceptions.HUGIMLDegradedWarning[source]
Bases:
HUGIMLWarningIssued when the model entered degraded mode due to timeout or memory pressure during fit().
- exception hugiml.exceptions.HUGIMLDeprecationWarning[source]
Bases:
HUGIMLWarning,DeprecationWarningIssued for deprecated API usage.