pysubgroup package ¶

statistic_types = ('size_sg', 'size_dataset', 'positives_sg', 'positives_dataset', 'size_complement', 'relative_size_sg', 'relative_size_complement', 'coverage_sg', 'coverage_complement', 'target_share_sg', 'target_share_complement', 'target_share_dataset', 'lift')¶

class pysubgroup.binary_target.ChiSquaredQF(direction='both', min_instances=5, stat='chi2')[source]¶

Bases: SimplePositivesQF

ChiSquaredQF tests for statistical independence of a subgroup against its complement.

Calculates the chi-squared statistic or p-value to measure the significance of the difference between the subgroup and the dataset.

static chi_squared_qf(instances_dataset, positives_dataset, instances_subgroup, positives_subgroup, min_instances=5, bidirect=True, direction_positive=True, index=0)[source]¶

Perform chi-squared test of statistical independence.

Tests whether a subgroup is statistically independent from its complement (see scipy.stats.chi2_contingency).

Parameters:

instances_dataset (int) – Total number of instances in the dataset.
positives_dataset (int) – Total number of positive instances in the dataset.
instances_subgroup (int) – Number of instances in the subgroup.
positives_subgroup (int) – Number of positive instances in the subgroup.
min_instances (int, optional) – Minimum required instances; return -inf if less.
bidirect (bool, optional) – If True, both directions are considered interesting.
direction_positive (bool, optional) – If bidirect is False, specifies the direction.
index (int, optional) – Whether to return statistic (0) or p-value (1).

Returns:

Chi-squared statistic or p-value, depending on the index parameter.

Return type:

static chi_squared_qf_weighted(subgroup, data, weighting_attribute, effective_sample_size=0, min_instances=5)[source]¶

Perform chi-squared test for weighted data.

Parameters:

subgroup – The subgroup for which to calculate the statistic.
data (pandas DataFrame) – The dataset.
weighting_attribute (str) – The attribute used for weighting.
effective_sample_size (int, optional) – Effective sample size.
min_instances (int, optional) – Minimum required instances.

Returns:

The p-value from the chi-squared test.

Return type:

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup using the chi-squared test.

Parameters:

subgroup – The subgroup to evaluate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

The chi-squared statistic or p-value.

Return type:

class pysubgroup.binary_target.GeneralizationAware_StandardQF(a, optimistic_estimate_strategy='default')[source]¶

Bases: GeneralizationAwareQF_stats, BoundedInterestingnessMeasure

Generalization-Aware Standard Quality Function.

Extends the StandardQF to consider generalizations during subgroup discovery, providing methods for optimistic estimates and aggregate statistics.

difference_based_agg_function(stats_subgroup, list_of_pairs)[source]¶

Aggregate statistics using the difference-based strategy.

Parameters:

stats_subgroup – Statistics of the current subgroup.
list_of_pairs – List of (stats, agg_tuple) for all generalizations.

Returns:

Aggregate statistics tuple.

Return type:

namedtuple

difference_based_optimistic_estimate(subgroup, target, data, statistics)[source]¶

Compute the optimistic estimate using the difference-based strategy.

Parameters:

subgroup – The subgroup for which to compute the estimate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any) – Current statistics.

Returns:

The optimistic estimate of the quality value.

Return type:

difference_based_read_p(agg_tuple)[source]¶

Read the p-value from the aggregate tuple using the difference-based strategy.

Parameters:: agg_tuple – The aggregate statistics tuple.
Returns:: The maximum percentage of positives.
Return type:: float

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup considering generalizations.

Parameters:

subgroup – The subgroup to evaluate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

The computed quality value.

Return type:

class ga_sQF_agg_tuple(max_p, min_delta_negatives, min_negatives)¶

Bases: tuple

max_p¶: Alias for field number 0

min_delta_negatives¶: Alias for field number 1

min_negatives¶: Alias for field number 2

max_based_aggregate_statistics(stats_subgroup, list_of_pairs)[source]¶

Aggregate statistics using the maximum-based strategy.

Parameters:

stats_subgroup – Statistics of the current subgroup.
list_of_pairs – List of (stats, agg_tuple) for all generalizations.

Returns:

The aggregated statistics.

max_based_optimistic_estimate(subgroup, target, data, statistics=None)[source]¶

Compute the optimistic estimate using the maximum-based strategy.

Parameters:

subgroup – The subgroup for which to compute the estimate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

The optimistic estimate of the quality value.

Return type:

max_based_read_p(agg_tuple)[source]¶

Read the p-value from the aggregate tuple using the maximum-based strategy.

Parameters:: agg_tuple – The aggregate statistics tuple.
Returns:: The ratio of positives in the aggregate statistics.
Return type:: float

class pysubgroup.binary_target.LiftQF[source]¶

Bases: StandardQF

Lift Quality Function.

LiftQF is a StandardQF with a=0. Thus it treats the difference in ratios as the quality without caring about the relative size of a subgroup.

class pysubgroup.binary_target.SimpleBinomialQF[source]¶

Bases: StandardQF

Simple Binomial Quality Function.

SimpleBinomialQF is a StandardQF with a=0.5. It is an order-equivalent approximation of the full binomial test if the subgroup size is much smaller than the size of the entire dataset.

class pysubgroup.binary_target.SimplePositivesQF[source]¶

Quality function for binary targets based on positive instances.

calculate_constant_statistics(data, target)[source]¶

Calculate statistics that remain constant for the dataset.

Parameters:

data (pandas DataFrame) – The dataset.
target (BinaryTarget) – The target definition.

Raises:

AssertionError – If the target is not an instance of BinaryTarget.

calculate_statistics(subgroup, target, data, statistics=None)[source]¶

Calculate statistics specific to the subgroup.

Parameters:

subgroup – The subgroup for which to calculate statistics.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

Contains size_sg and positives_count for the subgroup.

Return type:

namedtuple

gp_get_null_vector()[source]¶

Get a null vector for initialization in GP-Growth algorithms.

Returns:: Zero-initialized array of size 2.
Return type:: numpy.ndarray

gp_get_params(_cover_arr, v)[source]¶

Extract parameters from the statistics vector.

Parameters:

_cover_arr – Unused parameter.
v (numpy.ndarray) – Statistics vector.

Returns:

Contains size_sg and positives_count.

Return type:

namedtuple

gp_get_stats(row_index)[source]¶

Get statistics for a single row (used in GP-Growth algorithms).

Parameters:: row_index (int) – The index of the row.
Returns:: Array containing [1, positives[row_index]].
Return type:: numpy.ndarray

gp_merge(left, right)[source]¶

Merge two statistics vectors by summing them.

Parameters:

left (numpy.ndarray) – Left statistics vector.
right (numpy.ndarray) – Right statistics vector.

property gp_requires_cover_arr¶

Indicate whether the GP-Growth algorithm requires a cover array.

Returns:: False, since cover array is not required.
Return type:: bool

gp_size_sg(stats)[source]¶

Get the size of the subgroup from the statistics.

Parameters:: stats (numpy.ndarray) – Statistics vector.
Returns:: Size of the subgroup.
Return type:: int

gp_to_str(stats)[source]¶

Convert statistics to a string representation.

Parameters:: stats (numpy.ndarray) – Statistics vector.
Returns:: String representation of the statistics.
Return type:: str

tpl¶: alias of PositivesQF_parameters

class pysubgroup.binary_target.StandardQF(a)[source]¶

Bases: SimplePositivesQF, BoundedInterestingnessMeasure

StandardQF which weights the relative size against the difference in averages.

The StandardQF is a general form of quality function which for different values of ‘a’ is order equivalent to many popular quality measures.

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup using the standard quality function.

Parameters:

subgroup – The subgroup to evaluate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

The computed quality value.

Return type:

optimistic_estimate(subgroup, target, data, statistics=None)[source]¶

Compute the optimistic estimate of the quality function.

Parameters:

subgroup – The subgroup for which to compute the optimistic estimate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

The optimistic estimate of the quality value.

Return type:

optimistic_generalisation(subgroup, target, data, statistics=None)[source]¶

Compute the optimistic generalization of the quality function.

Parameters:

subgroup – The subgroup for which to compute the optimistic generalization.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

The optimistic generalization of the quality value.

Return type:

static standard_qf(a, instances_dataset, positives_dataset, instances_subgroup, positives_subgroup)[source]¶

Compute the standard quality function.

Parameters:

a (float) – Exponent to trade-off the relative size with difference in means.
instances_dataset (int) – Total number of instances in the dataset.
positives_dataset (int) – Total number of positive instances in the dataset.
instances_subgroup (int) – Number of instances in the subgroup.
positives_subgroup (int) – Number of positive instances in the subgroup.

Returns:

The computed quality value.

Return type:

class pysubgroup.binary_target.WRAccQF[source]¶

Bases: StandardQF

Weighted Relative Accuracy Quality Function.

WRAccQF is a StandardQF with a=1. It is order-equivalent to the difference in the observed and expected number of positive instances.

pysubgroup.constraints module¶

class pysubgroup.constraints.ContainsValueConstraint(attribute_name, value)[source]¶

Bases: object

A constraint that ensures a subgroup contains in its cover at least one instance that has a specified value for a specified attribute.

attribute_name¶: The attribute that needs to contain the specified value in at least one instance.

value¶: The value that needs to be present in the specified attribute in at least one instance.

property is_monotone¶

Indicates whether the constraint is monotone.

Returns:: True if the constraint is monotone, False otherwise.
Return type:: bool

is_satisfied(subgroup, statistics=None, data=None)[source]¶

Checks if the subgroup satisfies the constraint.

Parameters:

subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset being analyzed (optional).

Returns:

True if the subgroup’s cover contains at least one instance that has the specified value for the specified attribute (as defined during object construction),: False otherwise.

Return type:

class pysubgroup.constraints.MinSupportConstraint(min_support)[source]¶

Bases: object

A constraint that ensures a subgroup has at least a minimum support.

min_support¶

The minimum number of instances that a subgroup must cover.

Type:: int

gp_is_satisfied(node)[source]¶

Checks if a node satisfies the constraint in the GP-Growth algorithm.

Parameters:

node – The node to be evaluated.

Returns:

True if the node’s size is at least the minimum support,: False otherwise.

Return type:

gp_prepare(qf)[source]¶

Prepares the constraint for the GP-Growth algorithm by accessing the size function.

Parameters:: qf – The quality function used in the GP-Growth algorithm.

property is_monotone¶

Indicates whether the constraint is monotone.

Returns:: True if the constraint is monotone, False otherwise.
Return type:: bool

is_satisfied(subgroup, statistics=None, data=None)[source]¶

Checks if the subgroup satisfies the minimum support constraint.

Parameters:

subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset being analyzed (optional).

Returns:

True if the subgroup’s size is at least the minimum support,: False otherwise.

Return type:

class pysubgroup.constraints.MinUniqueValuesConstraint(attribute_name, min_unique_values)[source]¶

Bases: object

A constraint that ensures a subgroup contains in its cover a minimum number of unique values for a specified attribute.

attribute_name¶: The attribute that needs to contain at least the specified number of values.

min_unique_values¶: The minimum number of unique values that must be present in the attribute in a subgroup cover.

property is_monotone¶

Indicates whether the constraint is monotone.

Returns:: True if the constraint is monotone, False otherwise.
Return type:: bool

is_satisfied(subgroup, statistics=None, data=None)[source]¶

Checks if the subgroup satisfies the constraint.

Parameters:

subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset being analyzed (optional).

Returns:

True if the subgroup’s cover contains the minimum number of unique values for the specified attribute (as defined during object construction),: False otherwise.

Return type:

pysubgroup.datasets module¶

This module provides functions to load example datasets for testing and demonstration purposes. The datasets included are the German Credit Data and the Titanic dataset.

pysubgroup.datasets.get_credit_data()[source]¶

Load the German Credit Data dataset.

The dataset is provided in ARFF format and includes various attributes related to creditworthiness.

Returns:: A DataFrame containing the credit data.
Return type:: pandas.DataFrame

pysubgroup.datasets.get_titanic_data()[source]¶

Load the Titanic dataset.

The dataset includes information about the passengers on the Titanic, such as age, sex, class, and survival status.

Returns:: A DataFrame containing the Titanic data.
Return type:: pandas.DataFrame

pysubgroup.fi_target module¶

Created on 29.09.2017

@author: lemmerfn

This module defines the FITarget and related quality functions for frequent itemset mining using the pysubgroup package.

class pysubgroup.fi_target.AreaQF[source]¶

Bases: SimpleCountQF

Quality function that evaluates subgroups based on their area.

The area is computed as the size of the subgroup multiplied by the number of contained items

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup.

Parameters:

subgroup – The subgroup to evaluate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The area of the subgroup (size_sg * depth).

Return type:

class pysubgroup.fi_target.CountQF[source]¶

Bases: SimpleCountQF, BoundedInterestingnessMeasure

Quality function that evaluates subgroups based on their size.

Extends SimpleCountQF and BoundedInterestingnessMeasure.

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup.

Parameters:

subgroup – The subgroup to evaluate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The size of the subgroup.

Return type:

optimistic_estimate(subgroup, target, data, statistics=None)[source]¶

Compute the optimistic estimate of the quality function.

Parameters:

subgroup – The subgroup for which to compute the optimistic estimate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The size of the subgroup.

Return type:

class pysubgroup.fi_target.FITarget[source]¶

Bases: BaseTarget

Target class for frequent itemset mining.

Represents the target for mining frequent itemsets, extending the BaseTarget class from pysubgroup.

calculate_statistics(subgroup_description, data, cached_statistics=None)[source]¶

Calculate statistics for the subgroup.

Parameters:

subgroup_description – The description of the subgroup.
data – The dataset.
cached_statistics (dict, optional) – Previously computed statistics.

Returns:

A dictionary containing ‘size_sg’ and ‘size_dataset’.

Return type:

dict

get_attributes()[source]¶: Return an empty list as attributes are not used in FITarget.

get_base_statistics(subgroup, data)[source]¶

Compute the base statistics for the subgroup.

Parameters:

subgroup – The subgroup for which to compute statistics.
data – The dataset.

Returns:

The size of the subgroup.

Return type:

statistic_types = ('size_sg', 'size_dataset')¶

class pysubgroup.fi_target.SimpleCountQF[source]¶

Quality function that counts the number of instances in a subgroup.

Provides basic counting functionality, useful for frequent itemset mining.

calculate_constant_statistics(data, target)[source]¶

Calculate statistics that remain constant for the dataset.

Parameters:

data – The dataset.
target – The target definition (unused in this implementation).

calculate_statistics(subgroup_description, target, data, statistics=None)[source]¶

Calculate statistics specific to the subgroup.

Parameters:

subgroup_description – The description of the subgroup.
target – The target definition (unused in this implementation).
data – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

Contains ‘size_sg’ for the subgroup.

Return type:

namedtuple

gp_get_null_vector()[source]¶

Get a null vector for initialization in GP-Growth algorithms.

Returns:: A dictionary with ‘size_sg’ set to 0.
Return type:: dict

gp_get_params(_cover_arr, v)[source]¶

Extract parameters from the statistics dictionary.

Parameters:

_cover_arr – Unused parameter.
v (dict) – Statistics dictionary.

Returns:

Contains ‘size_sg’ from the statistics.

Return type:

namedtuple

gp_get_stats(_)[source]¶

Get statistics for a single instance (used in GP-Growth algorithms).

Returns:: A dictionary with ‘size_sg’ set to 1.
Return type:: dict

gp_merge(left, right)[source]¶

Merge two statistics dictionaries by summing ‘size_sg’.

Parameters:

left (dict) – Left statistics dictionary.
right (dict) – Right statistics dictionary.

gp_requires_cover_arr = False¶

gp_size_sg(stats)[source]¶

Get the size of the subgroup from the statistics.

Parameters:: stats (dict) – Statistics dictionary.
Returns:: Size of the subgroup.
Return type:: int

gp_to_str(stats)[source]¶

Convert statistics to a string representation.

Parameters:: stats (dict) – Statistics dictionary.
Returns:: String representation of ‘size_sg’.
Return type:: str

tpl¶: alias of CountQF_parameters

pysubgroup.gp_growth module¶

class pysubgroup.gp_growth.GpGrowth(mode='b_u')[source]¶

Bases: object

Implementation of the GP-Growth algorithm.

GP-Growth is a generalization of FP-Growth and SD-Map capable of working with different Exceptional Model Mining targets on top of Frequent Itemset Mining and Subgroup Discovery.

This class provides methods to perform pattern mining using GP-Growth, supporting both bottom-up (‘b_u’) and top-down (‘t_d’) modes.

GP_node¶

Structure representing a node in the GP-tree.

Type:: namedtuple

minSupp¶

Minimum support threshold (currently unused).

Type:: int

tqdm¶

Function for progress bars (default is identity function).

Type:: function

depth¶

Maximum depth of the search.

Type:: int

mode¶

Mode of the algorithm (‘b_u’ for bottom-up, ‘t_d’ for top-down).

Type:: str

constraints_monotone¶

List of monotonic constraints.

Type:: list

results¶

List to store the resulting subgroups.

Type:: list

task¶

The subgroup discovery task to execute.

Type:: SubgroupDiscoveryTask

add_if_required(prefix, gp_stats)[source]¶

Adds a pattern to the result set if it meets the quality threshold.

Parameters:

prefix (tuple) – The current pattern (tuple of class indices).
gp_stats – The aggregated statistics for the pattern.

calculate_quality_function_for_patterns(task, results, arrs)[source]¶

Calculates the quality function for the given patterns.

Parameters:

task (SubgroupDiscoveryTask) – The task containing the quality function.
results (list) – List of patterns with their aggregated parameters.
arrs (ndarray) – The coverage arrays of the selectors.

Returns:

A list of tuples containing quality, indices, and statistics.

Return type:

check_constraints(node)[source]¶

Checks if a node satisfies all monotonic constraints.

Parameters:: node – The node to check.
Returns:: True if the node satisfies all constraints, False otherwise.
Return type:: bool

check_tree_is_ordered(root, prefix=None)[source]¶

Verifies that the nodes of a tree are sorted in ascending order.

Parameters:

root (GP_node) – The root node of the tree.
prefix (list) – The current path prefix.

Returns:

A set of class labels in the tree.

Return type:

set

convert_results_to_subgroups(results, selectors_sorted)[source]¶

Converts patterns (indices) to actual subgroups.

Parameters:

results (list) – List of results containing qualities, indices, and statistics.
selectors_sorted (list) – The list of sorted selectors.

Returns:

A list of tuples containing quality, subgroup, and statistics.

Return type:

create_copy_of_path(nodes, new_nodes, stats)[source]¶

Creates a copy of a path in the tree, updating statistics.

Parameters:

nodes (list) – The list of nodes in the path.
new_nodes (dict) – Dictionary to store new nodes.
stats – The statistics to merge into the nodes.

create_copy_of_tree_top_down(from_root, nodes=None, parent=None, is_valid_class=None)[source]¶

Creates a copy of the tree starting from a specific root in top-down mode.

Parameters:

from_root (GP_node) – The root node to copy from.
nodes (list) – List to store the new nodes.
parent (GP_node) – The parent of the new root node.
is_valid_class (dict) – Dictionary indicating valid classes.

Returns:

The new root node of the copied subtree.

Return type:

GP_node

create_initial_tree(arrs)[source]¶

Creates the initial FP-tree from the coverage arrays.

Parameters:

arrs (ndarray) – A 2D NumPy array where each column corresponds to the coverage array of a selector.

Returns:

A tuple containing:

root (GP_node): The root node of the tree.
nodes (list): A list of all nodes in the tree.

Return type:

create_new_tree_from_nodes(nodes)[source]¶

Creates a new tree from a list of nodes for recursive mining.

Parameters:: nodes (list) – List of nodes to build the new tree from.
Returns:: A dictionary mapping class labels to nodes in the new tree.
Return type:: defaultdict

execute(task)[source]¶

Executes the GP-Growth algorithm on the given task.

Parameters:: task (SubgroupDiscoveryTask) – The subgroup discovery task to execute.
Returns:: The result of the subgroup discovery.
Return type:: SubgroupDiscoveryResult

get_nodes_upwards(node)[source]¶

Retrieves all nodes from a given node up to the root.

Parameters:: node (GP_node) – The starting node.
Returns:: A list of nodes from the given node up to the root.
Return type:: list

get_stats_for_class(cls_nodes)[source]¶

Aggregates statistics for each class label.

Parameters:: cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.
Returns:: A dictionary mapping class labels to aggregated statistics.
Return type:: dict

get_top_down_tree_for_class(cls_nodes, cls, is_valid_class)[source]¶

Creates a subtree for a specific class in top-down mode.

Parameters:

cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.
cls (int) – The class label to create the subtree for.
is_valid_class (dict) – Dictionary indicating valid classes.

Returns:

A tuple containing:

base_root (GP_node): The root of the new subtree.
nodes (list): A list of nodes in the new subtree.

Return type:

merge_trees_top_down(nodes, mutable_root, from_root, is_valid_class)[source]¶

Merges two trees in top-down mode.

Parameters:

nodes (list) – List of nodes in the mutable tree.
mutable_root (GP_node) – The root of the mutable tree to merge into.
from_root (GP_node) – The root of the tree to merge from.
is_valid_class (dict) – Dictionary indicating valid classes.

nodes_to_cls_nodes(nodes)[source]¶

Groups nodes by their class labels.

Parameters:: nodes (list) – List of nodes to group.
Returns:: A dictionary mapping class labels to lists of nodes.
Return type:: defaultdict

normal_insert(root, nodes, new_stats, classes)[source]¶

Inserts a transaction into the FP-tree.

Parameters:

root (GP_node) – The root node of the tree.
nodes (list) – List of all nodes in the tree.
new_stats – The statistics associated with the transaction.
classes (array-like) – The class labels (selectors) present in the transaction.

Returns:

The leaf node where the transaction ends.

Return type:

GP_node

prepare_selectors(search_space, data)[source]¶

Prepares the selectors by computing their coverage arrays and filtering based on constraints.

Parameters:

search_space (list) – The list of selectors to consider.
data (DataFrame) – The dataset to be analyzed.

Returns:

A tuple containing:

selectors_sorted (list): The sorted list of selectors after filtering.
arrs (ndarray): A 2D NumPy array where each column corresponds to the
coverage array of a selector.

Return type:

recurse(cls_nodes, prefix, is_single_path=False)[source]¶

Recursively mines patterns in bottom-up mode.

Parameters:

cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.
prefix (tuple) – The current pattern prefix.
is_single_path (bool) – Flag indicating if the current path is a single path.

recurse_top_down(cls_nodes, root, depth_in=0)[source]¶

Recursively mines patterns in top-down mode.

Parameters:

cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.
root (GP_node) – The current root node.
depth_in (int) – The current depth in the recursion.

Returns:

A list of patterns with their aggregated statistics.

Return type:

remove_selectors_with_low_optimistic_estimate(s, search_space_size)[source]¶

Removes selectors from the list that have an optimistic estimate below the minimum required quality.

Parameters:

s (list) – List of selectors with their size and coverage arrays.
search_space_size (int) – The size of the initial search space.

setup(task)[source]¶

Prepares the algorithm by setting up the task, depth, constraints, and quality function.

Parameters:: task (SubgroupDiscoveryTask) – The task to execute.

setup_constraints(constraints, qf)[source]¶

Prepares constraints for use in the algorithm.

Parameters:

constraints (list) – List of constraints to apply.
qf – The quality function used in the task.

setup_from_quality_function(qf)[source]¶

Sets up function pointers from the quality function.

Parameters:: qf – The quality function used in the task.

to_file(task, path)[source]¶

Writes the tree to a file in a specific format.

Parameters:

task (SubgroupDiscoveryTask) – The task containing the quality function.
path (str or Path) – The file path to write to.

pysubgroup.gp_growth.identity(x, *args, **kwargs)[source]¶

Identity function used as a placeholder for tqdm when progress bars are not needed.

Parameters:

x – The input value to return.
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

The input value x.

pysubgroup.measures module¶

Created on 28.04.2016

@author: lemmerfn

class pysubgroup.measures.AbstractInterestingnessMeasure[source]¶

Bases: ABC

ensure_statistics(subgroup, target, data, statistics=None)[source]¶

class pysubgroup.measures.BoundedInterestingnessMeasure[source]¶: Bases: AbstractInterestingnessMeasure

class pysubgroup.measures.CombinedInterestingnessMeasure(measures, weights=None)[source]¶

calculate_constant_statistics(data, target)[source]¶

calculate_statistics(subgroup, target, data, cached_statistics=None)[source]¶

evaluate(subgroup, target, data, statistics=None)[source]¶

evaluate_from_statistics(instances_dataset, positives_dataset, instances_subgroup, positives_subgroup)[source]¶

optimistic_estimate(subgroup, target, data, statistics=None)[source]¶

class pysubgroup.measures.CountCallsInterestingMeasure(qf)[source]¶

calculate_statistics(sg, target, data, statistics=None)[source]¶

class pysubgroup.measures.GeneralizationAwareQF(qf)[source]¶

A class that computes the generalization aware qf as follows: qf(sg) = qf(sg) - max_{generalizations} qf(sq)

calculate_constant_statistics(data, target)[source]¶

calculate_statistics(subgroup, target, data, statistics=None)[source]¶

evaluate(subgroup, target, data, statistics=None)[source]¶

class ga_tuple(subgroup_quality, generalisation_quality)¶

Bases: tuple

generalisation_quality¶: Alias for field number 1

subgroup_quality¶: Alias for field number 0

get_qual_and_previous_qual(subgroup, target, data)[source]¶

class pysubgroup.measures.GeneralizationAwareQF_stats(qf)[source]¶

An abstract base class that implements aggregation of stats of generalisations

calculate_constant_statistics(data, target)[source]¶

calculate_statistics(subgroup, target, data, statistics=None)[source]¶

evaluate(subgroup, target, data, statistics=None)[source]¶

ga_tuple¶: alias of ga_stats_tuple

get_stats_and_previous_stats(subgroup, target, data)[source]¶

pysubgroup.measures.maximum_statistic_filter(result_set, statistic, maximum)[source]¶

pysubgroup.measures.minimum_quality_filter(result_set, minimum)[source]¶

pysubgroup.measures.minimum_statistic_filter(result_set, statistic, minimum, data)[source]¶

pysubgroup.measures.overlap_filter(result_set, data, similarity_level=0.9)[source]¶

pysubgroup.measures.overlaps_list(sg, list_of_sgs, data, similarity_level=0.9)[source]¶

pysubgroup.measures.unique_attributes(result_set, data)[source]¶

pysubgroup.model_predictions_target module¶

Created on 16.08.2025

@author: Tom Siegl

class pysubgroup.model_predictions_target.ARLQF(label_column: str, positive_label_value: any, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶

Bases: BaseSoftClassifierPerformanceQF

A quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s average ranking loss (ARL) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a greater ARL) compared to the entire dataset, then the quality is positive.

Weighting factors are provided to let the subgroup size and class balance influence the quality.

The overall quality is captured by the formula q = (ARL(subgroup) - ARL(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).

Implementation of phi^{rasl}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).

class pysubgroup.model_predictions_target.BaseSoftClassifierPerformanceQF(performance_measure, performance_measure_type: Literal['score', 'loss'], performance_measure_bound=None, performance_measure_constraints: list[any] = [], subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶

calculate_constant_statistics(data: DataFrame, target: SoftClassifierTarget)[source]¶: This function is called once for every search execution, it should do any preparation that is necessary prior to an execution.

calculate_statistics(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics={})[source]¶: calculates necessary statistics this statistics object is passed on to the evaluate and optimistic_estimate functions

evaluate(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics=None)[source]¶: return the quality calculated from the statistics

optimistic_estimate(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics=None)[source]¶: returns optimistic estimate if one is available return it otherwise infinity

class pysubgroup.model_predictions_target.PRAUCQF(label_column: str, positive_label_value: any, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶

Bases: BaseSoftClassifierPerformanceQF

A quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s Area Under the Precision-Recall Curve (PR AUC) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a lower PR AUC) compared to the entire dataset, then the quality is positive.

Weighting factors are provided to let the subgroup size and class balance influence the quality.

The overall quality is captured by the formula q = (PRAUC(subgroup) - PRAUC(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).

Implementation of phi^{rPRAUC}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).

class pysubgroup.model_predictions_target.ROCAUCQF(label_column: str, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶

Bases: BaseSoftClassifierPerformanceQF

A quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s Area Under the Receiver Operating Characteristic Curve (ROC AUC) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a lower ROC AUC) compared to the entire dataset, then the quality is positive.

Weighting factors are provided to let the subgroup size and class balance influence the quality.

The overall quality is captured by the formula q = (ROCAUC(subgroup) - ROCAUC(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).

Implementation of phi^{rROCAUC}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).

class pysubgroup.model_predictions_target.SoftClassifierTarget(label_column='label', prediction_column='prediction')[source]¶

Bases: object

Minimal target concept implementation to select label and prediction columns for binary soft classifier performance measures.

calculate_statistics(subgroup, data: DataFrame, statistics={})[source]¶

get_target_columns(data: DataFrame)[source]¶: Select the label and prediction columns from object initialization.

statistic_types = ()¶

pysubgroup.model_predictions_target.average_ranking_loss(y_true, y_pred)[source]¶

Implementation of the Average Ranking Loss (ARL) performance measure for binary soft classifiers based on the definitions in the paper [“Understanding Where Your Classifier Does (Not) Work – The SCaPE Model Class for EMM”](https://doi.org/10.1109/ICDM.2014.10).

Parameters:

y_true – Binary Labels, must be ordered to match y_pred.
y_pred – Predicted Scores, must be in ascending order.

pysubgroup.model_predictions_target.pr_auc_score(y_true, y_pred)[source]¶

Area Under the Precision-Recall Curve (PR AUC) performance measure for binary soft classifiers.

Parameters:

y_true – Binary Labels, must be ordered to match y_pred.
y_pred – Predicted Scores.

pysubgroup.model_target module¶

class pysubgroup.model_target.EMM_Likelihood(model)[source]¶

Exceptional Model Mining likelihood-based interestingness measure.

This class computes the difference in likelihoods between a subgroup model and the inverse (complement) model, providing a measure of how exceptional the subgroup is with respect to the entire dataset.

calculate_constant_statistics(data, target)[source]¶

Calculate statistics that remain constant over all subgroups.

Parameters:

data – The dataset as a pandas DataFrame.
target – The target variable (unused in this context).

calculate_statistics(subgroup, target, data, statistics=None)[source]¶

Calculate statistics specific to a subgroup.

Parameters:

subgroup – The subgroup description.
target – The target variable (unused in this context).
data – The dataset as a pandas DataFrame.
statistics – Previously calculated statistics (optional).

Returns:

An EMM_Likelihood.tpl namedtuple containing model parameters, subgroup likelihood, inverse likelihood, and subgroup size.

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the interestingness of a subgroup.

Parameters:

subgroup – The subgroup description.
target – The target variable (unused in this context).
data – The dataset as a pandas DataFrame.
statistics – Previously calculated statistics (optional).

Returns:

The difference between subgroup likelihood and inverse likelihood.

get_tuple(sg_size, params, cover_arr)[source]¶

Compute the likelihoods for the subgroup and its complement.

Parameters:

sg_size – Size of the subgroup.
params – Model parameters obtained from fitting the subgroup.
cover_arr – Boolean array indicating the instances in the subgroup.

Returns:

An EMM_Likelihood.tpl namedtuple with the computed statistics.

gp_get_params(cover_arr, v)[source]¶

Get parameters for GP-Growth algorithm.

Parameters:

cover_arr – Boolean array indicating the instances in the subgroup.
v – Statistics vector from GP-Growth.

Returns:

An EMM_Likelihood.tpl namedtuple with the computed statistics.

property gp_requires_cover_arr¶

Indicate whether the GP-Growth algorithm requires a cover array.

Returns:: True, since the cover array is required.

tpl¶: alias of EMM_Likelihood

class pysubgroup.model_target.PolyRegression_ModelClass(x_name='x', y_name='y', degree=1)[source]¶

Bases: object

Polynomial Regression Model Class for Exceptional Model Mining.

Provides methods to fit a polynomial regression model to a subgroup and compute likelihoods for Exceptional Model Mining.

calculate_constant_statistics(data, target)[source]¶

Calculate statistics that remain constant over all subgroups.

Parameters:

data – The dataset as a pandas DataFrame.
target – The target variable (unused in this context).

fit(subgroup, data=None)[source]¶

Fit the polynomial regression model to the subgroup data.

Parameters:

subgroup – The subgroup description.
data – The dataset as a pandas DataFrame (optional).

Returns:

Contains regression coefficients and subgroup size.

Return type:

beta_tuple

gp_get_null_vector()[source]¶

Get a null vector for initialization in the GP-Growth algorithm.

Returns:: Zero-initialized array of size 5.
Return type:: numpy.ndarray

gp_get_params(v)[source]¶

Extract model parameters from the statistics vector.

Parameters:: v (numpy.ndarray) – Statistics vector.
Returns:: Contains regression coefficients and subgroup size.
Return type:: beta_tuple

gp_get_stats(row_index)[source]¶

Get statistics for a single row (used in GP-Growth algorithm).

Parameters:: row_index (int) – Index of the row in the dataset.
Returns:: Statistics vector for the given row.
Return type:: numpy.ndarray

static gp_merge(u, v)[source]¶

Merge two statistics vectors for the GP-Growth algorithm.

Parameters:

u (numpy.ndarray) – Left statistics vector.
v (numpy.ndarray) – Right statistics vector.

property gp_requires_cover_arr¶

Indicate whether the GP-Growth algorithm requires a cover array.

Returns:: False, since the cover array is not required.

gp_size_sg(stats)[source]¶

Get the size of the subgroup from the statistics.

Parameters:: stats (numpy.ndarray) – Statistics vector.
Returns:: Size of the subgroup.
Return type:: float

gp_to_str(stats)[source]¶

Convert statistics to a string representation.

Parameters:: stats (numpy.ndarray) – Statistics vector.
Returns:: String representation of the statistics.
Return type:: str

likelihood(stats, sg)[source]¶

Compute the likelihoods for the subgroup instances.

Parameters:

stats (beta_tuple) – Regression parameters and subgroup size.
sg (numpy.ndarray) – Boolean array indicating subgroup instances.

Returns:

Likelihood values for the subgroup instances.

Return type:

numpy.ndarray

loglikelihood(stats, sg)[source]¶

Compute the log-likelihoods for the subgroup instances.

Parameters:

stats (beta_tuple) – Regression parameters and subgroup size.
sg (numpy.ndarray) – Boolean array indicating subgroup instances.

Returns:

Log-likelihood values for the subgroup instances.

Return type:

numpy.ndarray

class pysubgroup.model_target.beta_tuple(beta, size_sg)¶

Bases: tuple

beta¶: Alias for field number 0

size_sg¶: Alias for field number 1

pysubgroup.numeric_target module¶

This module defines the NumericTarget and associated quality functions for subgroup discovery when the target variable is numeric.

class pysubgroup.numeric_target.GeneralizationAware_StandardQFNumeric(a, invert=False, estimator='default', centroid='mean')[source]¶

Bases: GeneralizationAwareQF_stats

Generalization-Aware Standard Quality Function for Numeric Targets.

Extends StandardQFNumeric to consider generalizations during subgroup discovery, providing methods for optimistic estimates and aggregate statistics.

aggregate_statistics(stats_subgroup, list_of_pairs)[source]¶

Aggregate statistics from generalizations.

Parameters:

stats_subgroup – Statistics of the current subgroup.
list_of_pairs – List of (stats, agg_stats) tuples from generalizations.

Returns:

The aggregated statistics.

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup considering generalizations.

Parameters:

subgroup – The subgroup to evaluate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The computed quality value.

Return type:

class pysubgroup.numeric_target.NumericTarget(target_variable)[source]¶

Bases: object

Target class for numeric variables in subgroup discovery.

Represents a target where the variable of interest is numeric, and computes statistics such as mean, median, standard deviation within subgroups.

calculate_statistics(subgroup, data, cached_statistics=None)[source]¶

Calculate various statistics for the subgroup and dataset.

Parameters:

subgroup – The subgroup for which to calculate statistics.
data (pandas.DataFrame) – The dataset.
cached_statistics (dict, optional) – Previously computed statistics.

Returns:

A dictionary containing statistical measures.

Return type:

dict

get_attributes()[source]¶

Get a list of attribute names used by the target.

Returns:: A list containing the target variable name.
Return type:: list

get_base_statistics(subgroup, data)[source]¶

Compute basic statistics for the subgroup and dataset.

Parameters:

subgroup – The subgroup for which to compute statistics.
data (pandas.DataFrame) – The dataset.

Returns:

(instances_dataset, mean_dataset, instances_subgroup, mean_sg)

Return type:

statistic_types = ('size_sg', 'size_dataset', 'mean_sg', 'mean_dataset', 'std_sg', 'std_dataset', 'median_sg', 'median_dataset', 'max_sg', 'max_dataset', 'min_sg', 'min_dataset', 'mean_lift', 'median_lift')¶

class pysubgroup.numeric_target.StandardQFNumeric(a, invert=False, estimator='default', centroid='mean')[source]¶

Standard Quality Function for numeric targets.

This quality function computes interestingness of subgroups based on the difference between subgroup mean (or median) and dataset mean (or median), weighted by the size of the subgroup raised to the power of ‘a’.

a¶

Exponent to trade off between subgroup size and difference in means.

Type:: float

invert¶

Whether to invert the quality function (not used currently).

Type:: bool

estimator¶

Strategy for optimistic estimation (‘sum’, ‘max’, ‘order’).

Type:: str

centroid¶

Central tendency measure (‘mean’, ‘median’, ‘sorted_median’).

Type:: str

class Max_Estimator(qf)[source]¶

Bases: object

Estimator for optimistic estimate using maximum value strategy.

This estimator calculates the optimistic estimate based on the maximum value greater than the dataset centroid.

From Florian Lemmerich’s Dissertation [section 4.2.2.1, Theorem 4 (page 82)]:

\[oe(sg) = n_{>\mu_0}^a (T^{\max}(sg) - \mu_0)\]

calculate_constant_statistics(data, target)[source]¶

Calculate constant statistics needed for estimation.

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

get_data(data, target)[source]¶

Prepare data for estimation (no changes for this estimator).

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

Returns:

The unmodified dataset.

Return type:

pandas.DataFrame

get_estimate(subgroup, sg_size, sg_centroid, cover_arr, _)[source]¶

Compute the optimistic estimate for the subgroup.

Parameters:

subgroup – The subgroup description.
sg_size (int) – Size of the subgroup.
sg_centroid (float) – Mean or median of the subgroup.
cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.
_ – Unused parameter.

Returns:

The optimistic estimate.

Return type:

class MeanOrdering_Estimator(qf)[source]¶

Bases: object

Estimator for optimistic estimate using mean ordering strategy.

This estimator sorts the target values and computes the optimal subgroup by considering prefixes of the sorted list.

calculate_constant_statistics(data, target)[source]¶

Set up the estimation function, possibly using Numba for speed.

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

get_data(data, target)[source]¶

Prepare data by sorting according to the target variable.

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

Returns:

The sorted dataset.

Return type:

pandas.DataFrame

get_estimate(subgroup, sg_size, sg_mean, cover_arr, target_values_sg)[source]¶

Compute the optimistic estimate for the subgroup.

Parameters:

subgroup – The subgroup description.
sg_size (int) – Size of the subgroup.
sg_mean (float) – Mean of the subgroup.
cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.
target_values_sg (numpy.ndarray) – Target values in the subgroup.

Returns:

The optimistic estimate.

Return type:

get_estimate_numpy(values_sg, _, mean_dataset)[source]¶

Compute the optimistic estimate using NumPy.

Parameters:

values_sg (numpy.ndarray) – Sorted target values in the subgroup.
_ – Unused parameter.
mean_dataset (float) – Mean of the dataset.

Returns:

The optimistic estimate.

Return type:

class Summation_Estimator(qf)[source]¶

Bases: object

Estimator for optimistic estimate using summation strategy.

This estimator calculates the optimistic estimate as a hypothetical subgroup which contains only instances with value greater than the dataset mean and is of maximal size.

From Florian Lemmerich’s Dissertation [section 4.2.2.1, Theorem 2 (page 81)]:

\[oe(sg) = \sum_{x \in sg, T(x)>0} (T(sg) - \mu_0)\]

calculate_constant_statistics(data, target)[source]¶

Calculate constant statistics needed for estimation.

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

get_data(data, target)[source]¶

Prepare data for estimation (no changes for this estimator).

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

Returns:

The unmodified dataset.

Return type:

pandas.DataFrame

get_estimate(subgroup, sg_size, sg_centroid, cover_arr, _)[source]¶

Compute the optimistic estimate for the subgroup.

Parameters:

subgroup – The subgroup description.
sg_size (int) – Size of the subgroup.
sg_centroid (float) – Mean or median of the subgroup.
cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.
_ – Unused parameter.

Returns:

The optimistic estimate.

Return type:

calculate_constant_statistics(data, target)[source]¶

Calculate statistics that remain constant for the dataset.

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

calculate_statistics(subgroup, target, data, statistics=None)[source]¶

Calculate statistics specific to the subgroup.

Parameters:

subgroup – The subgroup for which to calculate statistics.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

Contains size_sg, mean or median, and estimate.

Return type:

namedtuple

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup using the standard quality function.

Parameters:

subgroup – The subgroup to evaluate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The computed quality value.

Return type:

mean_tpl¶: alias of StandardQFNumeric_parameters

median_tpl¶: alias of StandardQFNumeric_median_parameters

optimistic_estimate(subgroup, target, data, statistics=None)[source]¶

Compute the optimistic estimate of the quality function.

Parameters:

subgroup – The subgroup for which to compute the optimistic estimate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The optimistic estimate of the quality value.

Return type:

static standard_qf_numeric(a, _, mean_dataset, instances_subgroup, mean_sg)[source]¶

Compute the standard quality function for numeric targets.

Parameters:

a (float) – Exponent for weighting the subgroup size.
_ – Unused parameter (size of dataset).
mean_dataset (float) – Mean of the target variable in the dataset.
instances_subgroup (int) – Number of instances in the subgroup.
mean_sg (float) – Mean of the target variable in the subgroup.

Returns:

The computed quality value.

Return type:

tpl¶: alias of StandardQFNumeric_parameters

class pysubgroup.numeric_target.StandardQFNumericMedian[source]¶

Quality function for numeric targets using median (deprecated).

Note

This class is no longer supported. Use StandardQFNumeric with centroid=’median’ instead.

tpl¶: alias of StandardQFNumericMedian_parameters

class pysubgroup.numeric_target.StandardQFNumericTscore(invert=False)[source]¶

Quality function for numeric targets using T-score.

calculate_constant_statistics(data, target)[source]¶

Calculate statistics that remain constant for the dataset.

Parameters:

data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.

calculate_statistics(subgroup, target, data, statistics=None)[source]¶

Calculate statistics specific to the subgroup.

Parameters:

subgroup – The subgroup for which to calculate statistics.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.

Returns:

Contains size_sg, mean, std, and estimate.

Return type:

namedtuple

evaluate(subgroup, target, data, statistics=None)[source]¶

Evaluate the quality of the subgroup using the T-score.

Parameters:

subgroup – The subgroup to evaluate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The computed T-score.

Return type:

optimistic_estimate(subgroup, target, data, statistics=None)[source]¶

Compute the optimistic estimate of the quality function.

Parameters:

subgroup – The subgroup for which to compute the optimistic estimate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.

Returns:

The optimistic estimate of the quality value.

Return type:

static t_score(mean_dataset, instances_subgroup, mean_sg, std_sg)[source]¶

Compute the T-score for the subgroup.

Parameters:

mean_dataset (float) – Mean of the dataset.
instances_subgroup (int) – Number of instances in the subgroup.
mean_sg (float) – Mean of the subgroup.
std_sg (float) – Standard deviation of the subgroup.

Returns:

The computed T-score.

Return type:

tpl¶: alias of StandardQFNumericTscore_parameters

pysubgroup.numeric_target.calc_sorted_median(arr)[source]¶

Calculate the median of a sorted array.

Parameters:: arr (numpy.ndarray) – A sorted array.
Returns:: The median value.
Return type:: float

pysubgroup.numeric_target.read_mean(tpl)[source]¶

Extract the mean value from a namedtuple.

Parameters:: tpl (namedtuple) – A namedtuple containing a ‘mean’ field.
Returns:: The mean value.
Return type:: float

pysubgroup.numeric_target.read_median(tpl)[source]¶

Extract the median value from a namedtuple.

Parameters:: tpl (namedtuple) – A namedtuple containing a ‘median’ field.
Returns:: The median value.
Return type:: float

pysubgroup.permutation_test module¶

class pysubgroup.permutation_test.NegativeClassCountRandomSelector(size, negative_class_count, np_rng, positive_class_indices, negative_class_indices)[source]¶

Bases: object

A selector that covers a random subset of the given indices, such that the number of covered instances as well as the number of negatives instances are always the same.

covers(data_instance)[source]¶

select()[source]¶: Randomize the cover of this selector.

property selectors¶

set_descriptions(size, negative_class_count, *args, **kwargs)[source]¶

pysubgroup.permutation_test.permutation_test(qf: any, result: any, target: ~pysubgroup.model_predictions_target.SoftClassifierTarget, data: ~pandas.core.frame.DataFrame, num_random_samples: int, np_rng: ~numpy.random._generator.Generator = Generator(PCG64) at 0x75AC2054B4C0, max_random_sampling_retries: int = 10, alpha: float = 0.05, pos_label: any = 1, neg_label: any = 0, multitest_correction_method: str = 'fdr_by', tqdm: any = None)[source]¶

Test the subgroups in the result for statistical significance by comparison to qualities of random samples from the data. Random samples are drawn such that the number of instances from each class in the sample is the same as in the tested subgroup.

Only for SoftClassifierTargets.

Parameters:

qf – Quality function to use as the test statistic.
result – ps.SubgroupDiscoveryResult object holding the subgroups to test.
target – Target concept to use in the quality function.
data – Dataset to compute all qualities from. The qualities of the given subgroups are also recomputed on this data for the test.
num_random_samples – How many random samples to draw. More samples allow to distinguish p-values more fine-grained.
np_rng – Random generator object to use for drawing the samples. Use this to get reproducible results.
max_random_sampling_retries – How often to repeat the drawing process for each sample to get a quality. Repetitions are used when the quality is undefined on a random sample.
pos_label – Which value in the dataset to count as a positive class.
neg_label – Which value in the dataset to count as a negative class.
multitest_correction_method – Which method to correct the p-values against the multiple comparison problem with. Refer to statsmodels.stats.multitest.multipletests for all possible values.

Return p_values_raw:

Uncorrected p-values for each subgroup

Return reject:

Test result after multiple testing correction.

Return p_values_corrected:

P-values after multiple testing correction.

Return qualities:

Subgroup qualities on the testing data.

Return samples:

The full random sample of qualities that was generated for each subgroup.

pysubgroup.refinement_operator module¶

class pysubgroup.refinement_operator.RefinementOperator[source]¶

Bases: object

Base class for refinement operators.

class pysubgroup.refinement_operator.StaticGeneralizationOperator(selectors)[source]¶

Bases: object

Refinement operator for static generalization.

This operator generalizes subgroups by adding selectors from a predefined list, ensuring that each selector is used in a specific order.

refinements(sG)[source]¶

Generate refinements of the given subgroup.

Parameters:: sG – The subgroup to refine.
Returns:: A generator of refined subgroups.

class pysubgroup.refinement_operator.StaticSpecializationOperator(selectors)[source]¶

Bases: object

Refinement operator for static specialization.

This operator specializes subgroups by adding selectors in a predefined order, ensuring that each attribute is used only once in a subgroup description.

refinements(subgroup)[source]¶

Generate refinements of the given subgroup.

Parameters:: subgroup – The subgroup to refine.
Returns:: A generator of refined subgroups.

pysubgroup.representations module¶

class pysubgroup.representations.BitSetRepresentation(df, selectors_to_patch)[source]¶

Bases: RepresentationBase

Representation class that uses bitsets for selectors and conjunctions.

Conjunction¶: alias of BitSet_Conjunction

Disjunction¶: alias of BitSet_Disjunction

patch_classes()[source]¶: Patch class-level attributes before entering the context.

patch_selector(sel)[source]¶

Patch a selector by computing its bitset representation.

Parameters:: sel – Selector to patch.

class pysubgroup.representations.BitSet_Conjunction(*args, **kwargs)[source]¶

Bases: Conjunction

Conjunction subclass that uses bitsets for representation.

Provides efficient computation of the conjunction using numpy boolean arrays.

append_and(to_append)[source]¶

Append a selector using logical AND and update the representation.

Parameters:: to_append – Selector to append.

compute_representation()[source]¶

Compute the bitset representation of the conjunction.

Returns:: Numpy boolean array representing the instances covered by the conjunction.

n_instances = 0¶

property size_sg¶: Size of the subgroup represented by the conjunction.

class pysubgroup.representations.BitSet_Disjunction(*args, **kwargs)[source]¶

Bases: Disjunction

Disjunction subclass that uses bitsets for representation.

Provides efficient computation of the disjunction using numpy boolean arrays.

append_or(to_append)[source]¶

Append a selector using logical OR and update the representation.

Parameters:: to_append – Selector to append.

compute_representation()[source]¶

Compute the bitset representation of the disjunction.

Returns:: Numpy boolean array representing the instances covered by the disjunction.

property size_sg¶: Size of the subgroup represented by the disjunction.

class pysubgroup.representations.NumpySetRepresentation(df, selectors_to_patch)[source]¶

Bases: RepresentationBase

Representation class that uses numpy arrays for selectors and conjunctions.

Conjunction¶: alias of NumpySet_Conjunction

patch_classes()[source]¶: Patch class-level attributes before entering the context.

patch_selector(sel)[source]¶

Patch a selector by computing its numpy array representation.

Parameters:: sel – Selector to patch.

class pysubgroup.representations.NumpySet_Conjunction(*args, **kwargs)[source]¶

Bases: Conjunction

Conjunction subclass that uses numpy arrays for set representation.

all_set = None¶

append_and(to_append)[source]¶

Append a selector using logical AND and update the representation.

Parameters:: to_append – Selector to append.

compute_representation()[source]¶

Compute the numpy array representation of the conjunction.

Returns:: Numpy array of indices representing the instances covered by the conjunction.

property size_sg¶: Size of the subgroup represented by the conjunction.

class pysubgroup.representations.RepresentationBase(new_conjunction, selectors_to_patch)[source]¶

Bases: object

Base class for different representation strategies.

Provides methods to patch selectors and manage class-level patches. Can be used as a context manager to ensure patches are applied and removed properly.

patch_all_selectors()[source]¶: Patch all selectors in the selectors_to_patch list.

patch_classes()[source]¶

Patch the required classes.

Can be overridden by subclasses to patch class-level attributes or methods.

patch_selector(sel)[source]¶

Patch a single selector.

This method should be implemented by subclasses.

undo_patch_classes()[source]¶

Undo patches applied to classes.

Can be overridden by subclasses to remove class-level patches.

class pysubgroup.representations.SetRepresentation(df, selectors_to_patch)[source]¶

Bases: RepresentationBase

Representation class that uses sets for selectors and conjunctions.

Conjunction¶: alias of Set_Conjunction

patch_classes()[source]¶: Patch class-level attributes before entering the context.

patch_selector(sel)[source]¶

Patch a selector by computing its set representation.

Parameters:: sel – Selector to patch.

class pysubgroup.representations.Set_Conjunction(*args, **kwargs)[source]¶

Bases: Conjunction

Conjunction subclass that uses sets for representation.

all_set = {}¶

append_and(to_append)[source]¶

Append a selector using logical AND and update the representation.

Parameters:: to_append – Selector to append.

compute_representation()[source]¶

Compute the set representation of the conjunction.

Returns:: Set of indices representing the instances covered by the conjunction.

property size_sg¶: Size of the subgroup represented by the conjunction.

pysubgroup.subgroup_description module¶

Created on 28.04.2016

@author: lemmerfn

class pysubgroup.subgroup_description.BooleanExpressionBase[source]¶

Bases: ABC

Base class for boolean expressions (conjunctions and disjunctions).

abstractmethod append_and(to_append)[source]¶: Append a selector or expression using logical AND.

abstractmethod append_or(to_append)[source]¶: Append a selector or expression using logical OR.

class pysubgroup.subgroup_description.Conjunction(selectors)[source]¶

Bases: BooleanExpressionBase

Conjunction of selectors (logical AND).

append_and(to_append)[source]¶: Append a selector or conjunction using logical AND.

append_or(to_append)[source]¶: Append a selector or expression using logical OR (not supported).

covers(instance)[source]¶

Determine which instances are covered by the conjunction.

Parameters:: instance – pandas DataFrame containing the data.
Returns:: A boolean array indicating which instances are covered.

property depth¶: Return the number of selectors in the conjunction.

static from_str(s)[source]¶

Create a Conjunction from a string representation.

Parameters:: s – String representation of the conjunction.
Returns:: A Conjunction instance.

pop_and()[source]¶: Remove and return the last selector added using AND.

pop_or()[source]¶: Pop operation for OR is not supported in Conjunction.

property selectors¶: Return the selectors in the conjunction as a tuple.

class pysubgroup.subgroup_description.DNF(selectors=None)[source]¶

Bases: Disjunction

Disjunctive Normal Form expression.

append_and(to_append)[source]¶: Append a selector using logical AND to all conjunctions.

append_or(to_append)[source]¶: Append a selector or conjunction using logical OR.

pop_and()[source]¶: Remove and return the last selector added using AND from all conjunctions.

class pysubgroup.subgroup_description.Disjunction(selectors=None)[source]¶

Bases: BooleanExpressionBase

Disjunction of selectors (logical OR).

append_and(to_append)[source]¶: Append a selector or expression using logical AND (not supported).

append_or(to_append)[source]¶: Append a selector or disjunction using logical OR.

covers(instance)[source]¶

Determine which instances are covered by the disjunction.

Parameters:: instance – pandas DataFrame containing the data.
Returns:: A boolean array indicating which instances are covered.

property selectors¶: Return the selectors in the disjunction as a tuple.

class pysubgroup.subgroup_description.EqualitySelector(*args, **kwargs)[source]¶

Bases: SelectorBase

Selector that checks for equality with a specific value.

property attribute_name¶: Name of the attribute.

property attribute_value¶: Value of the attribute to compare for equality.

classmethod compute_descriptions(attribute_name, attribute_value, selector_name)[source]¶: Compute the descriptions (hash, query, string) for the selector.

covers(data)[source]¶

Determine which instances in data are covered by this selector.

Parameters:: data – pandas DataFrame containing the data.
Returns:: A boolean array indicating which instances are covered.

static from_str(s)[source]¶

Create an EqualitySelector from a string representation.

Parameters:: s – String representation of the selector.
Returns:: An EqualitySelector instance.

property selectors¶: Return the selector itself as a tuple (for compatibility).

set_descriptions(attribute_name, attribute_value, selector_name=None)[source]¶: Set the descriptions (query, string, hash) for the selector.

class pysubgroup.subgroup_description.IntervalSelector(*args, **kwargs)[source]¶

Bases: SelectorBase

Selector that checks if a value is within an interval.

property attribute_name¶: Name of the attribute.

classmethod compute_descriptions(attribute_name, lower_bound, upper_bound, selector_name=None)[source]¶: Compute the descriptions (hash, query, string) for the interval selector.

classmethod compute_string(attribute_name, lower_bound, upper_bound, rounding_digits)[source]¶: Compute the string representation of the interval selector.

covers(data_instance)[source]¶

Determine which instances are covered by this interval selector.

Parameters:: data_instance – pandas DataFrame containing the data.
Returns:: A boolean array indicating which instances are within the interval.

static from_str(s)[source]¶

Create an IntervalSelector from a string representation.

Parameters:: s – String representation of the interval selector.
Returns:: An IntervalSelector instance.

property lower_bound¶: Lower bound of the interval (inclusive).

property selectors¶: Return the selector itself as a tuple (for compatibility).

set_descriptions(attribute_name, lower_bound, upper_bound, selector_name=None)[source]¶: Set the descriptions (hash, query, string) for the interval selector.

property upper_bound¶: Upper bound of the interval (exclusive).

class pysubgroup.subgroup_description.NegatedSelector(*args, **kwargs)[source]¶

Bases: SelectorBase

Selector that negates another selector.

property attribute_name¶: Name of the attribute.

covers(data_instance)[source]¶

Determine which instances are not covered by the underlying selector.

Parameters:: data_instance – pandas DataFrame containing the data.
Returns:: A boolean array indicating which instances are not covered.

property selectors¶: Return the selector itself as a tuple (for compatibility).

set_descriptions(selector)[source]¶: Set the descriptions (query, hash) for the negated selector.

class pysubgroup.subgroup_description.SelectorBase(*args, **kwargs)[source]¶

Bases: ABC

Base class for selectors, ensuring each selector instance is unique.

abstractmethod set_descriptions(*args, **kwargs)[source]¶: Set the descriptions for the selector.

pysubgroup.subgroup_description.create_nominal_selectors(data, ignore=None)[source]¶

Create equality selectors for nominal attributes.

Parameters:

data – pandas DataFrame containing the data.
ignore – List of attribute names to ignore.

Returns:

List of EqualitySelector instances.

pysubgroup.subgroup_description.create_nominal_selectors_for_attribute(data, attribute_name, dtypes=None)[source]¶

Create equality selectors for a nominal attribute.

Parameters:

data – pandas DataFrame containing the data.
attribute_name – Name of the attribute.
dtypes – Data types of the data columns.

Returns:

List of EqualitySelector instances for the attribute.

pysubgroup.subgroup_description.create_numeric_selectors(data, nbins=5, intervals_only=True, weighting_attribute=None, ignore=None)[source]¶

Create selectors for numeric attributes.

Parameters:

data – pandas DataFrame containing the data.
nbins – Number of bins to use for discretization.
intervals_only – If True, only create interval selectors.
weighting_attribute – Optional attribute for weighting.
ignore – List of attribute names to ignore.

Returns:

List of numeric selectors.

pysubgroup.subgroup_description.create_numeric_selectors_for_attribute(data, attr_name, nbins=5, intervals_only=True, weighting_attribute=None)[source]¶

Create selectors for a numeric attribute.

Parameters:

data – pandas DataFrame containing the data.
attr_name – Name of the attribute.
nbins – Number of bins to use for discretization.
intervals_only – If True, only create interval selectors.
weighting_attribute – Optional attribute for weighting.

Returns:

List of numeric selectors for the attribute.

pysubgroup.subgroup_description.create_selectors(data, nbins=5, intervals_only=True, ignore=None)[source]¶

Create a list of selectors for all attributes in the data.

Parameters:

data – pandas DataFrame containing the data.
nbins – Number of bins to use for numeric attributes.
intervals_only – If True, only create interval selectors for numeric attributes.
ignore – List of attribute names to ignore.

Returns:

List of selectors.

pysubgroup.subgroup_description.get_cover_array_and_size(subgroup, data_len=None, data=None)[source]¶

Compute the cover array and its size for a given subgroup.

Parameters:

subgroup – The subgroup for which to compute the cover array and size.
data_len – Optional length of the data.
data – Optional data.

Returns:

Tuple of (cover array, size).

pysubgroup.subgroup_description.get_size(subgroup, data_len=None, data=None)[source]¶

Compute the size of the cover array for a given subgroup.

Parameters:

subgroup – The subgroup for which to compute the size.
data_len – Optional length of the data.
data – Optional data.

Returns:

Size of the cover array.

pysubgroup.subgroup_description.pandas_sparse_eq(col, value)[source]¶

Compare a pandas sparse column to a value.

Parameters:

col – pandas Series with SparseArray data.
value – The value to compare with.

Returns:

A pandas SparseArray of booleans indicating where col equals value.

pysubgroup.subgroup_description.remove_target_attributes(selectors, target)[source]¶

Remove selectors that are based on target attributes.

Parameters:

selectors – List of selectors.
target – The target object with get_attributes method.

Returns:

List of selectors not based on target attributes.

pysubgroup.utils module¶

Created on 02.05.2016

@author: lemmerfn

class pysubgroup.utils.BaseTarget[source]¶

Bases: object

Base class for defining targets in subgroup discovery.

Provides a method to check if all required statistics are present.

all_statistics_present(cached_statistics)[source]¶

Checks if all required statistics are present in the cached statistics.

Parameters:: cached_statistics (dict) – The dictionary of cached statistics.
Returns:: True if all required statistics are present, False otherwise.
Return type:: bool

class pysubgroup.utils.SubgroupDiscoveryResult(results, task)[source]¶

Bases: object

Represents the result of a subgroup discovery task.

Contains methods to convert results to different formats.

to_dataframe(statistics_to_show=None, autoround=False, include_target=False)[source]¶

Converts the results to a pandas DataFrame.

Parameters:

statistics_to_show (list, optional) – The statistics to include in the DataFrame.
autoround (bool) – If True, automatically rounds numerical columns.
include_target (bool) – If True, includes the target in the DataFrame.

Returns:

A pandas DataFrame representing the results.

Return type:

DataFrame

to_descriptions(include_stats=False)[source]¶

Converts the results to a list of subgroup descriptions.

Parameters:: include_stats (bool) – If True, includes statistics in the output.
Returns:: A list of subgroup descriptions.
Return type:: list

to_latex(statistics_to_show=None, escape_underscore=True)[source]¶

Converts the results to a LaTeX-formatted table.

Parameters:

statistics_to_show (list, optional) – The statistics to include in the LaTeX table.
escape_underscore (bool) – If True, escapes underscores in strings.

Returns:

A string containing the LaTeX-formatted table.

Return type:

str

to_table(statistics_to_show=None, print_header=True, include_target=False)[source]¶

Converts the results to a table format.

Parameters:

statistics_to_show (list, optional) – The statistics to include in the table.
print_header (bool) – If True, includes a header row.
include_target (bool) – If True, includes the target in the table.

Returns:

A list of rows representing the table.

Return type:

pysubgroup.utils.add_if_required(result, sg, quality, task: SubgroupDiscoveryTask, check_for_duplicates=False, statistics=None, explicit_result_set_size=None)[source]¶

Adds a subgroup to the result set if it meets the required quality and constraints.

Important

Only add/remove subgroups from result by using heappop and heappush to ensure order of subgroups by quality.

Parameters:

result (list) – The current list of subgroups (heap).
sg – The subgroup to potentially add.
quality (float) – The quality of the subgroup.
task (SubgroupDiscoveryTask) – The task containing parameters and constraints.
check_for_duplicates (bool) – If True, checks for duplicates before adding.
statistics (optional) – Precomputed statistics for the subgroup.
explicit_result_set_size (int, optional) – Overrides the task’s result_set_size.

Returns:

None

pysubgroup.utils.conditional_invert(val, invert)[source]¶

Conditionally inverts a value based on a boolean flag.

Parameters:

val (float) – The value to potentially invert.
invert (bool) – If True, the value is inverted.

Returns:

The (possibly inverted) value.

Return type:

pysubgroup.utils.count_bits(bitset_as_int)[source]¶

Counts the number of set bits (1s) in a bitset represented as an integer.

Parameters:: bitset_as_int (int) – The bitset represented as an integer.
Returns:: The number of set bits.
Return type:: int

pysubgroup.utils.create_subgroup_with_representation(data, selectors)[source]¶

Create an object representing the conjunction of the given selectors, including a bitmask indicating which instances in the dataset are covered.

Parameters:

data – dataset to evaluate the cover on
selectors – list of selectors to form the conjunction

pysubgroup.utils.derive_effective_sample_size(weights)[source]¶

Calculates the effective sample size for weighted data.

Parameters:: weights (array-like) – The weights assigned to the samples.
Returns:: The effective sample size.
Return type:: float

pysubgroup.utils.equal_frequency_discretization(data, attribute_name, nbins=5, weighting_attribute=None)[source]¶

Discretizes a numerical attribute into bins with approximately equal frequency.

Parameters:

data (DataFrame) – The dataset containing the attribute to discretize.
attribute_name (str) – The name of the attribute to discretize.
nbins (int) – The number of bins to create.
weighting_attribute (str, optional) – An optional attribute to weight the instances.

Returns:

A list of cutpoints defining the bins.

Return type:

pysubgroup.utils.find_set_bits(bitset_as_int)[source]¶

Finds the indices of set bits in a bitset represented as an integer.

Parameters:: bitset_as_int (int) – The bitset represented as an integer.
Yields:: int – The index of each set bit.

pysubgroup.utils.float_formatter(x, digits=2)[source]¶

Formats a float to a specified number of decimal places.

Parameters:

x (float) – The value to format.
digits (int) – The number of decimal places.

Returns:

The formatted string.

Return type:

str

pysubgroup.utils.intersect_of_ordered_list(list_1, list_2)[source]¶

Computes the intersection of two ordered lists.

Parameters:

list_1 (list) – The first ordered list.
list_2 (list) – The second ordered list.

Returns:

The intersection of the two lists.

Return type:

pysubgroup.utils.is_categorical_attribute(data, attribute_name)[source]¶

Determines if an attribute in the dataset is categorical.

Parameters:

data (DataFrame) – The dataset.
attribute_name (str) – The name of the attribute.

Returns:

True if the attribute is categorical, False otherwise.

Return type:

pysubgroup.utils.is_numerical_attribute(data, attribute_name)[source]¶

Determines if an attribute in the dataset is numerical.

Parameters:

data (DataFrame) – The dataset.
attribute_name (str) – The name of the attribute.

Returns:

True if the attribute is numerical, False otherwise.

Return type:

pysubgroup.utils.minimum_required_quality(result, task)[source]¶

Determines the minimum quality required for a subgroup to be considered for inclusion in the result set.

Parameters:

result (list) – The current list of subgroups (heap).
task (SubgroupDiscoveryTask) – The task containing parameters like
min_quality. (result_set_size and)

Returns:

The minimum required quality for a subgroup to be added to the result set.

Return type:

pysubgroup.utils.overlap(sg, another_sg, data)[source]¶

Calculates the Jaccard similarity between two subgroups based on their coverage.

Parameters:

sg – The first subgroup.
another_sg – The second subgroup.
data (DataFrame) – The dataset.

Returns:

The Jaccard similarity between the two subgroups.

Return type:

pysubgroup.utils.perc_formatter(x)[source]¶

Formats a float as a percentage string with one decimal place.

Parameters:: x (float) – The value to format.
Returns:: The formatted percentage string.
Return type:: str

pysubgroup.utils.powerset(iterable, max_length=None)[source]¶

Generates the power set (all possible combinations) of an iterable up to a maximum length.

Parameters:

iterable (iterable) – The iterable to generate combinations from.
max_length (int, optional) – The maximum length of combinations.

Returns:

An iterator over the power set of the iterable.

Return type:

iterator

pysubgroup.utils.prepare_subgroup_discovery_result(result, task)[source]¶

Filters and sorts the result set of subgroups according to the task parameters.

Parameters:

result (list) – The list of subgroups (heap).
task (SubgroupDiscoveryTask) – The task containing parameters like result_set_size and min_quality.

Returns:

The filtered and sorted list of subgroups.

Return type:

pysubgroup.utils.remove_selectors_with_attributes(selector_list, attribute_list)[source]¶

Removes selectors that are based on specified attributes.

Parameters:

selector_list (list) – The list of selectors to filter.
attribute_list (list) – The list of attribute names to remove selectors for.

Returns:

The filtered list of selectors.

Return type: