mercurial.utils.validation module

Validation utilities: cross‑validation, regularization, early stopping.

class mercurial.utils.validation.CrossValidator(n_splits: int = 5, shuffle: bool = True, random_seed: int = 42)[source]

Bases: object

k‑fold cross‑validation for Atlas case simulations.

Splits cases into training and validation folds, runs simulations, and returns performance metrics per fold.

Methods

validate(case_functions, parameter_sets[, ...])

Perform cross‑validation.

validate(case_functions: ~typing.Dict[str, ~typing.Callable], parameter_sets: ~typing.List[~typing.Dict[str, ~typing.Any]], metric: ~typing.Callable[[~typing.Dict], float] = <function CrossValidator.<lambda>>) → Dict[str, Any][source]

Perform cross‑validation.

Parameters:

case_functionsdict: Mapping case_name -> function that runs simulation and returns results dict.
parameter_setslist of dict: Different parameter configurations to evaluate (e.g., different λ values).
metriccallable: Function to compute accuracy/score from simulation results.

Returns:

resultsdict: Contains fold metrics, mean scores, std devs per parameter set.

class mercurial.utils.validation.EarlyStopping(patience: int = 5, min_delta: float = 0.0001)[source]

Bases: object

Early stopping to prevent overfitting during iterative calibration. Stops when validation loss stops improving.

Methods

step(current_loss, current_params)

Update state.

reset

reset()[source]

step(current_loss: float, current_params: Any) → bool[source]: Update state. Returns True if training should continue, False if stop.

class mercurial.utils.validation.Regularization(lambda_reg: float = 0.01)[source]

Bases: object

L2 regularization (ridge) for free energy or loss function. Adds λ * ||θ||₂² to the objective.

Methods

`gradient`(parameters)	Gradient of penalty: 2λ * θ.
`penalty`(parameters)	Compute L2 penalty: λ * Σ θ_i².

gradient(parameters: ndarray) → ndarray[source]: Gradient of penalty: 2λ * θ.

penalty(parameters: ndarray) → float[source]: Compute L2 penalty: λ * Σ θ_i².

mercurial.utils.validation.bootstrap_confidence_intervals(scores: List[float], n_resamples: int = 1000, ci: float = 0.95) → Tuple[float, float, float][source]: Compute bootstrap CI for mean of scores. Returns (mean, lower, upper).

mercurial.utils.validation.generate_permuted_scores(model_func, X: ndarray, y: ndarray, n_permutations: int = 100, cv_folds: int = 5) → List[List[float]][source]

Generate permuted scores by shuffling the target variable y.

Parameters:

model_funccallable: Function that takes (X_train, y_train, X_test) and returns predictions.
Xnp.ndarray: Feature matrix.
ynp.ndarray: Target values (to be shuffled).
n_permutationsint: Number of permutations.
cv_foldsint: Number of cross‑validation folds.

Returns:

list of list: For each permutation, a list of fold scores.

mercurial.utils.validation.learning_curve_analysis(train_scores: List[float], val_scores: List[float], train_sizes: List[int], baseline_error: float | None = None) → Dict[source]

Analyze learning curves for underfitting signs using relative metrics.

Parameters:

train_scoreslist: Training scores (e.g., MAE) at different training set sizes.
val_scoreslist: Validation scores.
train_sizeslist: Number of training samples used.
baseline_errorfloat, optional: Error of trivial baseline (e.g., predict mean). If provided, used to judge underfitting.

Returns:

dict: Contains: - ‘converged’: bool (if validation score plateaued) - ‘gap’: float (final train - val gap) - ‘underfitting_suspected’: bool (if model is not clearly better than baseline) - ‘final_val_score’: float - ‘improvement_over_baseline’: float (if baseline provided)

mercurial.utils.validation.permutation_test(actual_scores: List[float], shuffled_scores: List[List[float]], n_permutations: int = 1000) → Dict[str, float][source]

Perform a permutation test to assess statistical significance.

Parameters:

actual_scoreslist of float: Model accuracy scores on the original data (e.g., cross‑validation folds).
shuffled_scoreslist of list of float: For each permutation, a list of scores (same length as actual_scores) obtained by shuffling the relationship between inputs and outputs.
n_permutationsint: Number of permutations performed.

Returns:

dict: Contains ‘p_value’ (two‑tailed), ‘mean_shuffled’, ‘std_shuffled’, ‘original_mean’, and ‘is_significant’ (True if p < 0.05).

mercurial.utils.validation.underfitting_detection(train_errors: List[float], val_errors: List[float], baseline_error: float, threshold_ratio: float = 0.8) → Dict[source]

Detect underfitting by comparing model performance to baseline.

Parameters:

train_errorslist: Training errors (e.g., MAE) for each fold or epoch.
val_errorslist: Validation errors (same length).
baseline_errorfloat: Error of a trivial baseline (e.g., predicting mean).
threshold_ratiofloat: If model’s validation error > baseline_error * threshold_ratio, consider it underfitting (baseline is better or comparable).

Returns:

dict: Contains: - ‘is_underfitting’: bool - ‘reason’: str - ‘model_val_error’: float - ‘baseline_error’: float - ‘improvement_ratio’: float (baseline / model error, >1 means model better)