pysubgroup package

Submodules

pysubgroup.algorithms module

Created on 29.04.2016

@author: lemmerfn

class pysubgroup.algorithms.Apriori(representation_type=None, combination_name='Conjunction', use_numba=True)[source]

Bases: object

Implementation of the Apriori algorithm for subgroup discovery.

This class provides methods to perform level-wise search for subgroups using the Apriori algorithm.

execute(task)[source]

Executes the Apriori algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

get_next_level(promising_candidates)[source]

Generates the next level of candidates based on the current promising candidates.

Parameters:

promising_candidates – A list of promising candidate selectors.

Returns:

A list of new candidate selectors for the next level.

get_next_level_candidates(task, result, next_level_candidates)[source]

Evaluates candidates at the current level and filters promising ones for the next level.

Parameters:
  • task – The subgroup discovery task.

  • result – The current list of discovered subgroups.

  • next_level_candidates – List of subgroups to be evaluated at the current

  • level.

Returns:

A list of promising candidates (selectors) for the next level.

get_next_level_candidates_vectorized(task, result, next_level_candidates)[source]

Vectorized evaluation of candidates at the current level to filter promising ones for the next level.

Parameters:
  • task – The subgroup discovery task.

  • result – The current list of discovered subgroups.

  • next_level_candidates – List of subgroups to be evaluated at the current

  • level.

Returns:

A list of promising candidates (selectors) for the next level.

get_next_level_numba(promising_candidates)[source]

Generates the next level of candidates using Numba for acceleration.

Parameters:

promising_candidates – A list of promising candidate selectors.

Returns:

A list of new candidate selectors for the next level.

class pysubgroup.algorithms.BeamSearch(beam_width=20, beam_width_adaptive=False)[source]

Bases: object

Implements the Beam Search algorithm for subgroup discovery.

execute(task)[source]

Executes the Beam Search algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

class pysubgroup.algorithms.BestFirstSearch[source]

Bases: object

Implements the Best-First Search algorithm for subgroup discovery.

execute(task)[source]

Executes the Best-First Search algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

class pysubgroup.algorithms.DFS(apply_representation=None)[source]

Bases: object

Depth-first search with look-ahead for a provided data structure.

execute(task)[source]

Executes the DFS algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

search_internal(task, result, sg)[source]

Recursively searches for subgroups in a depth-first manner.

Parameters:
  • task – The subgroup discovery task.

  • result – The current list of discovered subgroups.

  • sg – The current subgroup being evaluated.

class pysubgroup.algorithms.DFSNumeric[source]

Bases: object

Implements a specialized DFS algorithm for numeric quality functions.

execute(task)[source]

Executes the DFSNumeric algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

search_internal(task, prefix, modification_set, result, bitset)[source]

Recursively searches in a dfs-manner for numeric quality functions.

Parameters:
  • task – The subgroup discovery task.

  • prefix – The current list of selectors in the subgroup description.

  • modification_set – The remaining selectors to consider.

  • result – The current list of discovered subgroups.

  • bitset – The current bitset representing the subgroup.

Returns:

The updated list of discovered subgroups.

tpl

alias of size_mean_parameters

class pysubgroup.algorithms.GeneralisingBFS[source]

Bases: object

Implements a Generalizing Best-First Search algorithm for subgroup discovery.

execute(task)[source]

Executes the Generalizing Best-First Search algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

class pysubgroup.algorithms.SimpleDFS[source]

Bases: object

Implements a simple Depth-First Search algorithm for subgroup discovery. It is the most elementary (and thus probably slow) algorithm implementation.

execute(task, use_optimistic_estimates=True)[source]

Executes the Simple DFS algorithm on the given task.

Parameters:
  • task – The subgroup discovery task to be executed.

  • use_optimistic_estimates – Whether to use optimistic estimates for pruning.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

search_internal(task, prefix, modification_set, result, use_optimistic_estimates)[source]

Recursively searches for subgroups in a depth-first manner.

Parameters:
  • task – The subgroup discovery task.

  • prefix – The current list of selectors in the subgroup description.

  • modification_set – The remaining selectors to consider.

  • result – The current list of discovered subgroups.

  • use_optimistic_estimates – Whether to use optimistic estimates for pruning.

Returns:

The updated list of discovered subgroups.

class pysubgroup.algorithms.SimpleSearch(show_progress=True)[source]

Bases: object

Implements a simple exhaustive search algorithm for subgroup discovery.

execute(task)[source]

Executes the Simple Search algorithm on the given task.

Parameters:

task – The subgroup discovery task to be executed.

Returns:

A SubgroupDiscoveryResult containing the discovered subgroups.

class pysubgroup.algorithms.SubgroupDiscoveryTask(data, target, search_space, qf, result_set_size=10, depth=3, min_quality=-inf, constraints=None)[source]

Bases: object

Encapsulates all parameters required to perform standard subgroup discovery.

pysubgroup.algorithms.constraints_satisfied(constraints, subgroup, statistics=None, data=None)[source]

Checks if all constraints are satisfied for a given subgroup.

Parameters:
  • constraints – A list of constraints to check.

  • subgroup – The subgroup to be evaluated.

  • statistics – Precomputed statistics for the subgroup (optional).

  • data – The dataset to be analyzed (optional).

Returns:

True if all constraints are satisfied, False otherwise.

pysubgroup.binary_target module

Created on 29.09.2017

@author: lemmerfn

class pysubgroup.binary_target.BinaryTarget(target_attribute=None, target_value=None, target_selector=None)[source]

Bases: BaseTarget

Binary target for classic subgroup discovery with boolean targets.

Stores the target attribute and value, and computes various statistics related to the target within a subgroup.

calculate_statistics(subgroup, data, cached_statistics=None)[source]

Calculate various statistics for the subgroup.

Parameters:
  • subgroup – The subgroup for which to calculate statistics.

  • data (pandas DataFrame) – The dataset.

  • cached_statistics (dict, optional) – Previously computed statistics.

Returns:

A dictionary containing various statistical measures.

Return type:

dict

covers(instance)[source]

Determine whether the target selector covers the given instance.

Parameters:

instance (pandas DataFrame) – The data instance to check.

Returns:

Boolean array indicating coverage.

Return type:

numpy.ndarray

get_attributes()[source]

Get the attribute names used in the target.

Returns:

A tuple containing the attribute name.

Return type:

tuple

get_base_statistics(subgroup, data)[source]

Compute basic statistics for the target within the subgroup and dataset.

Parameters:
  • subgroup – The subgroup for which to compute statistics.

  • data (pandas DataFrame) – The dataset.

Returns:

Contains instances_dataset, positives_dataset,

instances_subgroup, positives_subgroup.

Return type:

tuple

statistic_types = ('size_sg', 'size_dataset', 'positives_sg', 'positives_dataset', 'size_complement', 'relative_size_sg', 'relative_size_complement', 'coverage_sg', 'coverage_complement', 'target_share_sg', 'target_share_complement', 'target_share_dataset', 'lift')
class pysubgroup.binary_target.ChiSquaredQF(direction='both', min_instances=5, stat='chi2')[source]

Bases: SimplePositivesQF

ChiSquaredQF tests for statistical independence of a subgroup against its complement.

Calculates the chi-squared statistic or p-value to measure the significance of the difference between the subgroup and the dataset.

static chi_squared_qf(instances_dataset, positives_dataset, instances_subgroup, positives_subgroup, min_instances=5, bidirect=True, direction_positive=True, index=0)[source]

Perform chi-squared test of statistical independence.

Tests whether a subgroup is statistically independent from its complement (see scipy.stats.chi2_contingency).

Parameters:
  • instances_dataset (int) – Total number of instances in the dataset.

  • positives_dataset (int) – Total number of positive instances in the dataset.

  • instances_subgroup (int) – Number of instances in the subgroup.

  • positives_subgroup (int) – Number of positive instances in the subgroup.

  • min_instances (int, optional) – Minimum required instances; return -inf if less.

  • bidirect (bool, optional) – If True, both directions are considered interesting.

  • direction_positive (bool, optional) – If bidirect is False, specifies the direction.

  • index (int, optional) – Whether to return statistic (0) or p-value (1).

Returns:

Chi-squared statistic or p-value, depending on the index parameter.

Return type:

float

static chi_squared_qf_weighted(subgroup, data, weighting_attribute, effective_sample_size=0, min_instances=5)[source]

Perform chi-squared test for weighted data.

Parameters:
  • subgroup – The subgroup for which to calculate the statistic.

  • data (pandas DataFrame) – The dataset.

  • weighting_attribute (str) – The attribute used for weighting.

  • effective_sample_size (int, optional) – Effective sample size.

  • min_instances (int, optional) – Minimum required instances.

Returns:

The p-value from the chi-squared test.

Return type:

float

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup using the chi-squared test.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

The chi-squared statistic or p-value.

Return type:

float

class pysubgroup.binary_target.GeneralizationAware_StandardQF(a, optimistic_estimate_strategy='default')[source]

Bases: GeneralizationAwareQF_stats, BoundedInterestingnessMeasure

Generalization-Aware Standard Quality Function.

Extends the StandardQF to consider generalizations during subgroup discovery, providing methods for optimistic estimates and aggregate statistics.

difference_based_agg_function(stats_subgroup, list_of_pairs)[source]

Aggregate statistics using the difference-based strategy.

Parameters:
  • stats_subgroup – Statistics of the current subgroup.

  • list_of_pairs – List of (stats, agg_tuple) for all generalizations.

Returns:

Aggregate statistics tuple.

Return type:

namedtuple

difference_based_optimistic_estimate(subgroup, target, data, statistics)[source]

Compute the optimistic estimate using the difference-based strategy.

Parameters:
  • subgroup – The subgroup for which to compute the estimate.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any) – Current statistics.

Returns:

The optimistic estimate of the quality value.

Return type:

float

difference_based_read_p(agg_tuple)[source]

Read the p-value from the aggregate tuple using the difference-based strategy.

Parameters:

agg_tuple – The aggregate statistics tuple.

Returns:

The maximum percentage of positives.

Return type:

float

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup considering generalizations.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

The computed quality value.

Return type:

float

class ga_sQF_agg_tuple(max_p, min_delta_negatives, min_negatives)

Bases: tuple

max_p

Alias for field number 0

min_delta_negatives

Alias for field number 1

min_negatives

Alias for field number 2

max_based_aggregate_statistics(stats_subgroup, list_of_pairs)[source]

Aggregate statistics using the maximum-based strategy.

Parameters:
  • stats_subgroup – Statistics of the current subgroup.

  • list_of_pairs – List of (stats, agg_tuple) for all generalizations.

Returns:

The aggregated statistics.

max_based_optimistic_estimate(subgroup, target, data, statistics=None)[source]

Compute the optimistic estimate using the maximum-based strategy.

Parameters:
  • subgroup – The subgroup for which to compute the estimate.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

The optimistic estimate of the quality value.

Return type:

float

max_based_read_p(agg_tuple)[source]

Read the p-value from the aggregate tuple using the maximum-based strategy.

Parameters:

agg_tuple – The aggregate statistics tuple.

Returns:

The ratio of positives in the aggregate statistics.

Return type:

float

class pysubgroup.binary_target.LiftQF[source]

Bases: StandardQF

Lift Quality Function.

LiftQF is a StandardQF with a=0. Thus it treats the difference in ratios as the quality without caring about the relative size of a subgroup.

class pysubgroup.binary_target.SimpleBinomialQF[source]

Bases: StandardQF

Simple Binomial Quality Function.

SimpleBinomialQF is a StandardQF with a=0.5. It is an order-equivalent approximation of the full binomial test if the subgroup size is much smaller than the size of the entire dataset.

class pysubgroup.binary_target.SimplePositivesQF[source]

Bases: AbstractInterestingnessMeasure

Quality function for binary targets based on positive instances.

calculate_constant_statistics(data, target)[source]

Calculate statistics that remain constant for the dataset.

Parameters:
  • data (pandas DataFrame) – The dataset.

  • target (BinaryTarget) – The target definition.

Raises:

AssertionError – If the target is not an instance of BinaryTarget.

calculate_statistics(subgroup, target, data, statistics=None)[source]

Calculate statistics specific to the subgroup.

Parameters:
  • subgroup – The subgroup for which to calculate statistics.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

Contains size_sg and positives_count for the subgroup.

Return type:

namedtuple

gp_get_null_vector()[source]

Get a null vector for initialization in GP-Growth algorithms.

Returns:

Zero-initialized array of size 2.

Return type:

numpy.ndarray

gp_get_params(_cover_arr, v)[source]

Extract parameters from the statistics vector.

Parameters:
  • _cover_arr – Unused parameter.

  • v (numpy.ndarray) – Statistics vector.

Returns:

Contains size_sg and positives_count.

Return type:

namedtuple

gp_get_stats(row_index)[source]

Get statistics for a single row (used in GP-Growth algorithms).

Parameters:

row_index (int) – The index of the row.

Returns:

Array containing [1, positives[row_index]].

Return type:

numpy.ndarray

gp_merge(left, right)[source]

Merge two statistics vectors by summing them.

Parameters:
property gp_requires_cover_arr

Indicate whether the GP-Growth algorithm requires a cover array.

Returns:

False, since cover array is not required.

Return type:

bool

gp_size_sg(stats)[source]

Get the size of the subgroup from the statistics.

Parameters:

stats (numpy.ndarray) – Statistics vector.

Returns:

Size of the subgroup.

Return type:

int

gp_to_str(stats)[source]

Convert statistics to a string representation.

Parameters:

stats (numpy.ndarray) – Statistics vector.

Returns:

String representation of the statistics.

Return type:

str

tpl

alias of PositivesQF_parameters

class pysubgroup.binary_target.StandardQF(a)[source]

Bases: SimplePositivesQF, BoundedInterestingnessMeasure

StandardQF which weights the relative size against the difference in averages.

The StandardQF is a general form of quality function which for different values of ‘a’ is order equivalent to many popular quality measures.

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup using the standard quality function.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

The computed quality value.

Return type:

float

optimistic_estimate(subgroup, target, data, statistics=None)[source]

Compute the optimistic estimate of the quality function.

Parameters:
  • subgroup – The subgroup for which to compute the optimistic estimate.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

The optimistic estimate of the quality value.

Return type:

float

optimistic_generalisation(subgroup, target, data, statistics=None)[source]

Compute the optimistic generalization of the quality function.

Parameters:
  • subgroup – The subgroup for which to compute the optimistic generalization.

  • target (BinaryTarget) – The target definition.

  • data (pandas DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

The optimistic generalization of the quality value.

Return type:

float

static standard_qf(a, instances_dataset, positives_dataset, instances_subgroup, positives_subgroup)[source]

Compute the standard quality function.

Parameters:
  • a (float) – Exponent to trade-off the relative size with difference in means.

  • instances_dataset (int) – Total number of instances in the dataset.

  • positives_dataset (int) – Total number of positive instances in the dataset.

  • instances_subgroup (int) – Number of instances in the subgroup.

  • positives_subgroup (int) – Number of positive instances in the subgroup.

Returns:

The computed quality value.

Return type:

float

class pysubgroup.binary_target.WRAccQF[source]

Bases: StandardQF

Weighted Relative Accuracy Quality Function.

WRAccQF is a StandardQF with a=1. It is order-equivalent to the difference in the observed and expected number of positive instances.

pysubgroup.constraints module

class pysubgroup.constraints.ContainsValueConstraint(attribute_name, value)[source]

Bases: object

A constraint that ensures a subgroup contains in its cover at least one instance that has a specified value for a specified attribute.

attribute_name

The attribute that needs to contain the specified value in at least one instance.

value

The value that needs to be present in the specified attribute in at least one instance.

property is_monotone

Indicates whether the constraint is monotone.

Returns:

True if the constraint is monotone, False otherwise.

Return type:

bool

is_satisfied(subgroup, statistics=None, data=None)[source]

Checks if the subgroup satisfies the constraint.

Parameters:
  • subgroup – The subgroup to be evaluated.

  • statistics – Precomputed statistics for the subgroup (optional).

  • data – The dataset being analyzed (optional).

Returns:

True if the subgroup’s cover contains at least one instance that has the specified value for the specified attribute (as defined during object construction),

False otherwise.

Return type:

bool

class pysubgroup.constraints.MinSupportConstraint(min_support)[source]

Bases: object

A constraint that ensures a subgroup has at least a minimum support.

min_support

The minimum number of instances that a subgroup must cover.

Type:

int

gp_is_satisfied(node)[source]

Checks if a node satisfies the constraint in the GP-Growth algorithm.

Parameters:

node – The node to be evaluated.

Returns:

True if the node’s size is at least the minimum support,

False otherwise.

Return type:

bool

gp_prepare(qf)[source]

Prepares the constraint for the GP-Growth algorithm by accessing the size function.

Parameters:

qf – The quality function used in the GP-Growth algorithm.

property is_monotone

Indicates whether the constraint is monotone.

Returns:

True if the constraint is monotone, False otherwise.

Return type:

bool

is_satisfied(subgroup, statistics=None, data=None)[source]

Checks if the subgroup satisfies the minimum support constraint.

Parameters:
  • subgroup – The subgroup to be evaluated.

  • statistics – Precomputed statistics for the subgroup (optional).

  • data – The dataset being analyzed (optional).

Returns:

True if the subgroup’s size is at least the minimum support,

False otherwise.

Return type:

bool

class pysubgroup.constraints.MinUniqueValuesConstraint(attribute_name, min_unique_values)[source]

Bases: object

A constraint that ensures a subgroup contains in its cover a minimum number of unique values for a specified attribute.

attribute_name

The attribute that needs to contain at least the specified number of values.

min_unique_values

The minimum number of unique values that must be present in the attribute in a subgroup cover.

property is_monotone

Indicates whether the constraint is monotone.

Returns:

True if the constraint is monotone, False otherwise.

Return type:

bool

is_satisfied(subgroup, statistics=None, data=None)[source]

Checks if the subgroup satisfies the constraint.

Parameters:
  • subgroup – The subgroup to be evaluated.

  • statistics – Precomputed statistics for the subgroup (optional).

  • data – The dataset being analyzed (optional).

Returns:

True if the subgroup’s cover contains the minimum number of unique values for the specified attribute (as defined during object construction),

False otherwise.

Return type:

bool

pysubgroup.datasets module

This module provides functions to load example datasets for testing and demonstration purposes. The datasets included are the German Credit Data and the Titanic dataset.

pysubgroup.datasets.get_credit_data()[source]

Load the German Credit Data dataset.

The dataset is provided in ARFF format and includes various attributes related to creditworthiness.

Returns:

A DataFrame containing the credit data.

Return type:

pandas.DataFrame

pysubgroup.datasets.get_titanic_data()[source]

Load the Titanic dataset.

The dataset includes information about the passengers on the Titanic, such as age, sex, class, and survival status.

Returns:

A DataFrame containing the Titanic data.

Return type:

pandas.DataFrame

pysubgroup.fi_target module

Created on 29.09.2017

@author: lemmerfn

This module defines the FITarget and related quality functions for frequent itemset mining using the pysubgroup package.

class pysubgroup.fi_target.AreaQF[source]

Bases: SimpleCountQF

Quality function that evaluates subgroups based on their area.

The area is computed as the size of the subgroup multiplied by the number of contained items

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target – The target definition.

  • data – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The area of the subgroup (size_sg * depth).

Return type:

int

class pysubgroup.fi_target.CountQF[source]

Bases: SimpleCountQF, BoundedInterestingnessMeasure

Quality function that evaluates subgroups based on their size.

Extends SimpleCountQF and BoundedInterestingnessMeasure.

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target – The target definition.

  • data – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The size of the subgroup.

Return type:

int

optimistic_estimate(subgroup, target, data, statistics=None)[source]

Compute the optimistic estimate of the quality function.

Parameters:
  • subgroup – The subgroup for which to compute the optimistic estimate.

  • target – The target definition.

  • data – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The size of the subgroup.

Return type:

int

class pysubgroup.fi_target.FITarget[source]

Bases: BaseTarget

Target class for frequent itemset mining.

Represents the target for mining frequent itemsets, extending the BaseTarget class from pysubgroup.

calculate_statistics(subgroup_description, data, cached_statistics=None)[source]

Calculate statistics for the subgroup.

Parameters:
  • subgroup_description – The description of the subgroup.

  • data – The dataset.

  • cached_statistics (dict, optional) – Previously computed statistics.

Returns:

A dictionary containing ‘size_sg’ and ‘size_dataset’.

Return type:

dict

get_attributes()[source]

Return an empty list as attributes are not used in FITarget.

get_base_statistics(subgroup, data)[source]

Compute the base statistics for the subgroup.

Parameters:
  • subgroup – The subgroup for which to compute statistics.

  • data – The dataset.

Returns:

The size of the subgroup.

Return type:

int

statistic_types = ('size_sg', 'size_dataset')
class pysubgroup.fi_target.SimpleCountQF[source]

Bases: AbstractInterestingnessMeasure

Quality function that counts the number of instances in a subgroup.

Provides basic counting functionality, useful for frequent itemset mining.

calculate_constant_statistics(data, target)[source]

Calculate statistics that remain constant for the dataset.

Parameters:
  • data – The dataset.

  • target – The target definition (unused in this implementation).

calculate_statistics(subgroup_description, target, data, statistics=None)[source]

Calculate statistics specific to the subgroup.

Parameters:
  • subgroup_description – The description of the subgroup.

  • target – The target definition (unused in this implementation).

  • data – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

Contains ‘size_sg’ for the subgroup.

Return type:

namedtuple

gp_get_null_vector()[source]

Get a null vector for initialization in GP-Growth algorithms.

Returns:

A dictionary with ‘size_sg’ set to 0.

Return type:

dict

gp_get_params(_cover_arr, v)[source]

Extract parameters from the statistics dictionary.

Parameters:
  • _cover_arr – Unused parameter.

  • v (dict) – Statistics dictionary.

Returns:

Contains ‘size_sg’ from the statistics.

Return type:

namedtuple

gp_get_stats(_)[source]

Get statistics for a single instance (used in GP-Growth algorithms).

Returns:

A dictionary with ‘size_sg’ set to 1.

Return type:

dict

gp_merge(left, right)[source]

Merge two statistics dictionaries by summing ‘size_sg’.

Parameters:
  • left (dict) – Left statistics dictionary.

  • right (dict) – Right statistics dictionary.

gp_requires_cover_arr = False
gp_size_sg(stats)[source]

Get the size of the subgroup from the statistics.

Parameters:

stats (dict) – Statistics dictionary.

Returns:

Size of the subgroup.

Return type:

int

gp_to_str(stats)[source]

Convert statistics to a string representation.

Parameters:

stats (dict) – Statistics dictionary.

Returns:

String representation of ‘size_sg’.

Return type:

str

tpl

alias of CountQF_parameters

pysubgroup.gp_growth module

class pysubgroup.gp_growth.GpGrowth(mode='b_u')[source]

Bases: object

Implementation of the GP-Growth algorithm.

GP-Growth is a generalization of FP-Growth and SD-Map capable of working with different Exceptional Model Mining targets on top of Frequent Itemset Mining and Subgroup Discovery.

This class provides methods to perform pattern mining using GP-Growth, supporting both bottom-up (‘b_u’) and top-down (‘t_d’) modes.

GP_node

Structure representing a node in the GP-tree.

Type:

namedtuple

minSupp

Minimum support threshold (currently unused).

Type:

int

tqdm

Function for progress bars (default is identity function).

Type:

function

depth

Maximum depth of the search.

Type:

int

mode

Mode of the algorithm (‘b_u’ for bottom-up, ‘t_d’ for top-down).

Type:

str

constraints_monotone

List of monotonic constraints.

Type:

list

results

List to store the resulting subgroups.

Type:

list

task

The subgroup discovery task to execute.

Type:

SubgroupDiscoveryTask

add_if_required(prefix, gp_stats)[source]

Adds a pattern to the result set if it meets the quality threshold.

Parameters:
  • prefix (tuple) – The current pattern (tuple of class indices).

  • gp_stats – The aggregated statistics for the pattern.

calculate_quality_function_for_patterns(task, results, arrs)[source]

Calculates the quality function for the given patterns.

Parameters:
  • task (SubgroupDiscoveryTask) – The task containing the quality function.

  • results (list) – List of patterns with their aggregated parameters.

  • arrs (ndarray) – The coverage arrays of the selectors.

Returns:

A list of tuples containing quality, indices, and statistics.

Return type:

list

check_constraints(node)[source]

Checks if a node satisfies all monotonic constraints.

Parameters:

node – The node to check.

Returns:

True if the node satisfies all constraints, False otherwise.

Return type:

bool

check_tree_is_ordered(root, prefix=None)[source]

Verifies that the nodes of a tree are sorted in ascending order.

Parameters:
  • root (GP_node) – The root node of the tree.

  • prefix (list) – The current path prefix.

Returns:

A set of class labels in the tree.

Return type:

set

convert_results_to_subgroups(results, selectors_sorted)[source]

Converts patterns (indices) to actual subgroups.

Parameters:
  • results (list) – List of results containing qualities, indices, and statistics.

  • selectors_sorted (list) – The list of sorted selectors.

Returns:

A list of tuples containing quality, subgroup, and statistics.

Return type:

list

create_copy_of_path(nodes, new_nodes, stats)[source]

Creates a copy of a path in the tree, updating statistics.

Parameters:
  • nodes (list) – The list of nodes in the path.

  • new_nodes (dict) – Dictionary to store new nodes.

  • stats – The statistics to merge into the nodes.

create_copy_of_tree_top_down(from_root, nodes=None, parent=None, is_valid_class=None)[source]

Creates a copy of the tree starting from a specific root in top-down mode.

Parameters:
  • from_root (GP_node) – The root node to copy from.

  • nodes (list) – List to store the new nodes.

  • parent (GP_node) – The parent of the new root node.

  • is_valid_class (dict) – Dictionary indicating valid classes.

Returns:

The new root node of the copied subtree.

Return type:

GP_node

create_initial_tree(arrs)[source]

Creates the initial FP-tree from the coverage arrays.

Parameters:

arrs (ndarray) – A 2D NumPy array where each column corresponds to the coverage array of a selector.

Returns:

A tuple containing:
  • root (GP_node): The root node of the tree.

  • nodes (list): A list of all nodes in the tree.

Return type:

tuple

create_new_tree_from_nodes(nodes)[source]

Creates a new tree from a list of nodes for recursive mining.

Parameters:

nodes (list) – List of nodes to build the new tree from.

Returns:

A dictionary mapping class labels to nodes in the new tree.

Return type:

defaultdict

execute(task)[source]

Executes the GP-Growth algorithm on the given task.

Parameters:

task (SubgroupDiscoveryTask) – The subgroup discovery task to execute.

Returns:

The result of the subgroup discovery.

Return type:

SubgroupDiscoveryResult

get_nodes_upwards(node)[source]

Retrieves all nodes from a given node up to the root.

Parameters:

node (GP_node) – The starting node.

Returns:

A list of nodes from the given node up to the root.

Return type:

list

get_stats_for_class(cls_nodes)[source]

Aggregates statistics for each class label.

Parameters:

cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.

Returns:

A dictionary mapping class labels to aggregated statistics.

Return type:

dict

get_top_down_tree_for_class(cls_nodes, cls, is_valid_class)[source]

Creates a subtree for a specific class in top-down mode.

Parameters:
  • cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.

  • cls (int) – The class label to create the subtree for.

  • is_valid_class (dict) – Dictionary indicating valid classes.

Returns:

A tuple containing:
  • base_root (GP_node): The root of the new subtree.

  • nodes (list): A list of nodes in the new subtree.

Return type:

tuple

merge_trees_top_down(nodes, mutable_root, from_root, is_valid_class)[source]

Merges two trees in top-down mode.

Parameters:
  • nodes (list) – List of nodes in the mutable tree.

  • mutable_root (GP_node) – The root of the mutable tree to merge into.

  • from_root (GP_node) – The root of the tree to merge from.

  • is_valid_class (dict) – Dictionary indicating valid classes.

nodes_to_cls_nodes(nodes)[source]

Groups nodes by their class labels.

Parameters:

nodes (list) – List of nodes to group.

Returns:

A dictionary mapping class labels to lists of nodes.

Return type:

defaultdict

normal_insert(root, nodes, new_stats, classes)[source]

Inserts a transaction into the FP-tree.

Parameters:
  • root (GP_node) – The root node of the tree.

  • nodes (list) – List of all nodes in the tree.

  • new_stats – The statistics associated with the transaction.

  • classes (array-like) – The class labels (selectors) present in the transaction.

Returns:

The leaf node where the transaction ends.

Return type:

GP_node

prepare_selectors(search_space, data)[source]

Prepares the selectors by computing their coverage arrays and filtering based on constraints.

Parameters:
  • search_space (list) – The list of selectors to consider.

  • data (DataFrame) – The dataset to be analyzed.

Returns:

A tuple containing:
  • selectors_sorted (list): The sorted list of selectors after filtering.

  • arrs (ndarray): A 2D NumPy array where each column corresponds to the

    coverage array of a selector.

Return type:

tuple

recurse(cls_nodes, prefix, is_single_path=False)[source]

Recursively mines patterns in bottom-up mode.

Parameters:
  • cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.

  • prefix (tuple) – The current pattern prefix.

  • is_single_path (bool) – Flag indicating if the current path is a single path.

recurse_top_down(cls_nodes, root, depth_in=0)[source]

Recursively mines patterns in top-down mode.

Parameters:
  • cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.

  • root (GP_node) – The current root node.

  • depth_in (int) – The current depth in the recursion.

Returns:

A list of patterns with their aggregated statistics.

Return type:

list

remove_selectors_with_low_optimistic_estimate(s, search_space_size)[source]

Removes selectors from the list that have an optimistic estimate below the minimum required quality.

Parameters:
  • s (list) – List of selectors with their size and coverage arrays.

  • search_space_size (int) – The size of the initial search space.

setup(task)[source]

Prepares the algorithm by setting up the task, depth, constraints, and quality function.

Parameters:

task (SubgroupDiscoveryTask) – The task to execute.

setup_constraints(constraints, qf)[source]

Prepares constraints for use in the algorithm.

Parameters:
  • constraints (list) – List of constraints to apply.

  • qf – The quality function used in the task.

setup_from_quality_function(qf)[source]

Sets up function pointers from the quality function.

Parameters:

qf – The quality function used in the task.

to_file(task, path)[source]

Writes the tree to a file in a specific format.

Parameters:
  • task (SubgroupDiscoveryTask) – The task containing the quality function.

  • path (str or Path) – The file path to write to.

pysubgroup.gp_growth.identity(x, *args, **kwargs)[source]

Identity function used as a placeholder for tqdm when progress bars are not needed.

Parameters:
  • x – The input value to return.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

Returns:

The input value x.

pysubgroup.measures module

Created on 28.04.2016

@author: lemmerfn

class pysubgroup.measures.AbstractInterestingnessMeasure[source]

Bases: ABC

ensure_statistics(subgroup, target, data, statistics=None)[source]
class pysubgroup.measures.BoundedInterestingnessMeasure[source]

Bases: AbstractInterestingnessMeasure

class pysubgroup.measures.CombinedInterestingnessMeasure(measures, weights=None)[source]

Bases: BoundedInterestingnessMeasure

calculate_constant_statistics(data, target)[source]
calculate_statistics(subgroup, target, data, cached_statistics=None)[source]
evaluate(subgroup, target, data, statistics=None)[source]
evaluate_from_statistics(instances_dataset, positives_dataset, instances_subgroup, positives_subgroup)[source]
optimistic_estimate(subgroup, target, data, statistics=None)[source]
class pysubgroup.measures.CountCallsInterestingMeasure(qf)[source]

Bases: BoundedInterestingnessMeasure

calculate_statistics(sg, target, data, statistics=None)[source]
class pysubgroup.measures.GeneralizationAwareQF(qf)[source]

Bases: AbstractInterestingnessMeasure

A class that computes the generalization aware qf as follows: qf(sg) = qf(sg) - max_{generalizations} qf(sq)

calculate_constant_statistics(data, target)[source]
calculate_statistics(subgroup, target, data, statistics=None)[source]
evaluate(subgroup, target, data, statistics=None)[source]
class ga_tuple(subgroup_quality, generalisation_quality)

Bases: tuple

generalisation_quality

Alias for field number 1

subgroup_quality

Alias for field number 0

get_qual_and_previous_qual(subgroup, target, data)[source]
class pysubgroup.measures.GeneralizationAwareQF_stats(qf)[source]

Bases: AbstractInterestingnessMeasure

An abstract base class that implements aggregation of stats of generalisations

calculate_constant_statistics(data, target)[source]
calculate_statistics(subgroup, target, data, statistics=None)[source]
evaluate(subgroup, target, data, statistics=None)[source]
ga_tuple

alias of ga_stats_tuple

get_stats_and_previous_stats(subgroup, target, data)[source]
pysubgroup.measures.maximum_statistic_filter(result_set, statistic, maximum)[source]
pysubgroup.measures.minimum_quality_filter(result_set, minimum)[source]
pysubgroup.measures.minimum_statistic_filter(result_set, statistic, minimum, data)[source]
pysubgroup.measures.overlap_filter(result_set, data, similarity_level=0.9)[source]
pysubgroup.measures.overlaps_list(sg, list_of_sgs, data, similarity_level=0.9)[source]
pysubgroup.measures.unique_attributes(result_set, data)[source]

pysubgroup.model_predictions_target module

Created on 16.08.2025

@author: Tom Siegl

class pysubgroup.model_predictions_target.ARLQF(label_column: str, positive_label_value: any, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]

Bases: BaseSoftClassifierPerformanceQF

A quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s average ranking loss (ARL) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a greater ARL) compared to the entire dataset, then the quality is positive.

Weighting factors are provided to let the subgroup size and class balance influence the quality.

The overall quality is captured by the formula q = (ARL(subgroup) - ARL(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).

Implementation of phi^{rasl}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).

class pysubgroup.model_predictions_target.BaseSoftClassifierPerformanceQF(performance_measure, performance_measure_type: Literal['score', 'loss'], performance_measure_bound=None, performance_measure_constraints: list[any] = [], subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]

Bases: BoundedInterestingnessMeasure

calculate_constant_statistics(data: DataFrame, target: SoftClassifierTarget)[source]

This function is called once for every search execution, it should do any preparation that is necessary prior to an execution.

calculate_statistics(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics={})[source]

calculates necessary statistics this statistics object is passed on to the evaluate and optimistic_estimate functions

evaluate(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics=None)[source]

return the quality calculated from the statistics

optimistic_estimate(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics=None)[source]

returns optimistic estimate if one is available return it otherwise infinity

class pysubgroup.model_predictions_target.PRAUCQF(label_column: str, positive_label_value: any, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]

Bases: BaseSoftClassifierPerformanceQF

A quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s Area Under the Precision-Recall Curve (PR AUC) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a lower PR AUC) compared to the entire dataset, then the quality is positive.

Weighting factors are provided to let the subgroup size and class balance influence the quality.

The overall quality is captured by the formula q = (PRAUC(subgroup) - PRAUC(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).

Implementation of phi^{rPRAUC}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).

class pysubgroup.model_predictions_target.ROCAUCQF(label_column: str, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]

Bases: BaseSoftClassifierPerformanceQF

A quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s Area Under the Receiver Operating Characteristic Curve (ROC AUC) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a lower ROC AUC) compared to the entire dataset, then the quality is positive.

Weighting factors are provided to let the subgroup size and class balance influence the quality.

The overall quality is captured by the formula q = (ROCAUC(subgroup) - ROCAUC(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).

Implementation of phi^{rROCAUC}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).

class pysubgroup.model_predictions_target.SoftClassifierTarget(label_column='label', prediction_column='prediction')[source]

Bases: object

Minimal target concept implementation to select label and prediction columns for binary soft classifier performance measures.

calculate_statistics(subgroup, data: DataFrame, statistics={})[source]
get_target_columns(data: DataFrame)[source]

Select the label and prediction columns from object initialization.

statistic_types = ()
pysubgroup.model_predictions_target.average_ranking_loss(y_true, y_pred)[source]

Implementation of the Average Ranking Loss (ARL) performance measure for binary soft classifiers based on the definitions in the paper [“Understanding Where Your Classifier Does (Not) Work – The SCaPE Model Class for EMM”](https://doi.org/10.1109/ICDM.2014.10).

Parameters:
  • y_true – Binary Labels, must be ordered to match y_pred.

  • y_pred – Predicted Scores, must be in ascending order.

pysubgroup.model_predictions_target.pr_auc_score(y_true, y_pred)[source]

Area Under the Precision-Recall Curve (PR AUC) performance measure for binary soft classifiers.

Parameters:
  • y_true – Binary Labels, must be ordered to match y_pred.

  • y_pred – Predicted Scores.

pysubgroup.model_target module

class pysubgroup.model_target.EMM_Likelihood(model)[source]

Bases: AbstractInterestingnessMeasure

Exceptional Model Mining likelihood-based interestingness measure.

This class computes the difference in likelihoods between a subgroup model and the inverse (complement) model, providing a measure of how exceptional the subgroup is with respect to the entire dataset.

calculate_constant_statistics(data, target)[source]

Calculate statistics that remain constant over all subgroups.

Parameters:
  • data – The dataset as a pandas DataFrame.

  • target – The target variable (unused in this context).

calculate_statistics(subgroup, target, data, statistics=None)[source]

Calculate statistics specific to a subgroup.

Parameters:
  • subgroup – The subgroup description.

  • target – The target variable (unused in this context).

  • data – The dataset as a pandas DataFrame.

  • statistics – Previously calculated statistics (optional).

Returns:

An EMM_Likelihood.tpl namedtuple containing model parameters, subgroup likelihood, inverse likelihood, and subgroup size.

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the interestingness of a subgroup.

Parameters:
  • subgroup – The subgroup description.

  • target – The target variable (unused in this context).

  • data – The dataset as a pandas DataFrame.

  • statistics – Previously calculated statistics (optional).

Returns:

The difference between subgroup likelihood and inverse likelihood.

get_tuple(sg_size, params, cover_arr)[source]

Compute the likelihoods for the subgroup and its complement.

Parameters:
  • sg_size – Size of the subgroup.

  • params – Model parameters obtained from fitting the subgroup.

  • cover_arr – Boolean array indicating the instances in the subgroup.

Returns:

An EMM_Likelihood.tpl namedtuple with the computed statistics.

gp_get_params(cover_arr, v)[source]

Get parameters for GP-Growth algorithm.

Parameters:
  • cover_arr – Boolean array indicating the instances in the subgroup.

  • v – Statistics vector from GP-Growth.

Returns:

An EMM_Likelihood.tpl namedtuple with the computed statistics.

property gp_requires_cover_arr

Indicate whether the GP-Growth algorithm requires a cover array.

Returns:

True, since the cover array is required.

tpl

alias of EMM_Likelihood

class pysubgroup.model_target.PolyRegression_ModelClass(x_name='x', y_name='y', degree=1)[source]

Bases: object

Polynomial Regression Model Class for Exceptional Model Mining.

Provides methods to fit a polynomial regression model to a subgroup and compute likelihoods for Exceptional Model Mining.

calculate_constant_statistics(data, target)[source]

Calculate statistics that remain constant over all subgroups.

Parameters:
  • data – The dataset as a pandas DataFrame.

  • target – The target variable (unused in this context).

fit(subgroup, data=None)[source]

Fit the polynomial regression model to the subgroup data.

Parameters:
  • subgroup – The subgroup description.

  • data – The dataset as a pandas DataFrame (optional).

Returns:

Contains regression coefficients and subgroup size.

Return type:

beta_tuple

gp_get_null_vector()[source]

Get a null vector for initialization in the GP-Growth algorithm.

Returns:

Zero-initialized array of size 5.

Return type:

numpy.ndarray

gp_get_params(v)[source]

Extract model parameters from the statistics vector.

Parameters:

v (numpy.ndarray) – Statistics vector.

Returns:

Contains regression coefficients and subgroup size.

Return type:

beta_tuple

gp_get_stats(row_index)[source]

Get statistics for a single row (used in GP-Growth algorithm).

Parameters:

row_index (int) – Index of the row in the dataset.

Returns:

Statistics vector for the given row.

Return type:

numpy.ndarray

static gp_merge(u, v)[source]

Merge two statistics vectors for the GP-Growth algorithm.

Parameters:
property gp_requires_cover_arr

Indicate whether the GP-Growth algorithm requires a cover array.

Returns:

False, since the cover array is not required.

gp_size_sg(stats)[source]

Get the size of the subgroup from the statistics.

Parameters:

stats (numpy.ndarray) – Statistics vector.

Returns:

Size of the subgroup.

Return type:

float

gp_to_str(stats)[source]

Convert statistics to a string representation.

Parameters:

stats (numpy.ndarray) – Statistics vector.

Returns:

String representation of the statistics.

Return type:

str

likelihood(stats, sg)[source]

Compute the likelihoods for the subgroup instances.

Parameters:
  • stats (beta_tuple) – Regression parameters and subgroup size.

  • sg (numpy.ndarray) – Boolean array indicating subgroup instances.

Returns:

Likelihood values for the subgroup instances.

Return type:

numpy.ndarray

loglikelihood(stats, sg)[source]

Compute the log-likelihoods for the subgroup instances.

Parameters:
  • stats (beta_tuple) – Regression parameters and subgroup size.

  • sg (numpy.ndarray) – Boolean array indicating subgroup instances.

Returns:

Log-likelihood values for the subgroup instances.

Return type:

numpy.ndarray

class pysubgroup.model_target.beta_tuple(beta, size_sg)

Bases: tuple

beta

Alias for field number 0

size_sg

Alias for field number 1

pysubgroup.numeric_target module

This module defines the NumericTarget and associated quality functions for subgroup discovery when the target variable is numeric.

class pysubgroup.numeric_target.GeneralizationAware_StandardQFNumeric(a, invert=False, estimator='default', centroid='mean')[source]

Bases: GeneralizationAwareQF_stats

Generalization-Aware Standard Quality Function for Numeric Targets.

Extends StandardQFNumeric to consider generalizations during subgroup discovery, providing methods for optimistic estimates and aggregate statistics.

aggregate_statistics(stats_subgroup, list_of_pairs)[source]

Aggregate statistics from generalizations.

Parameters:
  • stats_subgroup – Statistics of the current subgroup.

  • list_of_pairs – List of (stats, agg_stats) tuples from generalizations.

Returns:

The aggregated statistics.

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup considering generalizations.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target (NumericTarget) – The target definition.

  • data (pandas.DataFrame) – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The computed quality value.

Return type:

float

class pysubgroup.numeric_target.NumericTarget(target_variable)[source]

Bases: object

Target class for numeric variables in subgroup discovery.

Represents a target where the variable of interest is numeric, and computes statistics such as mean, median, standard deviation within subgroups.

calculate_statistics(subgroup, data, cached_statistics=None)[source]

Calculate various statistics for the subgroup and dataset.

Parameters:
  • subgroup – The subgroup for which to calculate statistics.

  • data (pandas.DataFrame) – The dataset.

  • cached_statistics (dict, optional) – Previously computed statistics.

Returns:

A dictionary containing statistical measures.

Return type:

dict

get_attributes()[source]

Get a list of attribute names used by the target.

Returns:

A list containing the target variable name.

Return type:

list

get_base_statistics(subgroup, data)[source]

Compute basic statistics for the subgroup and dataset.

Parameters:
  • subgroup – The subgroup for which to compute statistics.

  • data (pandas.DataFrame) – The dataset.

Returns:

(instances_dataset, mean_dataset, instances_subgroup, mean_sg)

Return type:

tuple

statistic_types = ('size_sg', 'size_dataset', 'mean_sg', 'mean_dataset', 'std_sg', 'std_dataset', 'median_sg', 'median_dataset', 'max_sg', 'max_dataset', 'min_sg', 'min_dataset', 'mean_lift', 'median_lift')
class pysubgroup.numeric_target.StandardQFNumeric(a, invert=False, estimator='default', centroid='mean')[source]

Bases: BoundedInterestingnessMeasure

Standard Quality Function for numeric targets.

This quality function computes interestingness of subgroups based on the difference between subgroup mean (or median) and dataset mean (or median), weighted by the size of the subgroup raised to the power of ‘a’.

a

Exponent to trade off between subgroup size and difference in means.

Type:

float

invert

Whether to invert the quality function (not used currently).

Type:

bool

estimator

Strategy for optimistic estimation (‘sum’, ‘max’, ‘order’).

Type:

str

centroid

Central tendency measure (‘mean’, ‘median’, ‘sorted_median’).

Type:

str

class Max_Estimator(qf)[source]

Bases: object

Estimator for optimistic estimate using maximum value strategy.

This estimator calculates the optimistic estimate based on the maximum value greater than the dataset centroid.

From Florian Lemmerich’s Dissertation [section 4.2.2.1, Theorem 4 (page 82)]:

\[oe(sg) = n_{>\mu_0}^a (T^{\max}(sg) - \mu_0)\]
calculate_constant_statistics(data, target)[source]

Calculate constant statistics needed for estimation.

Parameters:
get_data(data, target)[source]

Prepare data for estimation (no changes for this estimator).

Parameters:
Returns:

The unmodified dataset.

Return type:

pandas.DataFrame

get_estimate(subgroup, sg_size, sg_centroid, cover_arr, _)[source]

Compute the optimistic estimate for the subgroup.

Parameters:
  • subgroup – The subgroup description.

  • sg_size (int) – Size of the subgroup.

  • sg_centroid (float) – Mean or median of the subgroup.

  • cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.

  • _ – Unused parameter.

Returns:

The optimistic estimate.

Return type:

float

class MeanOrdering_Estimator(qf)[source]

Bases: object

Estimator for optimistic estimate using mean ordering strategy.

This estimator sorts the target values and computes the optimal subgroup by considering prefixes of the sorted list.

calculate_constant_statistics(data, target)[source]

Set up the estimation function, possibly using Numba for speed.

Parameters:
get_data(data, target)[source]

Prepare data by sorting according to the target variable.

Parameters:
Returns:

The sorted dataset.

Return type:

pandas.DataFrame

get_estimate(subgroup, sg_size, sg_mean, cover_arr, target_values_sg)[source]

Compute the optimistic estimate for the subgroup.

Parameters:
  • subgroup – The subgroup description.

  • sg_size (int) – Size of the subgroup.

  • sg_mean (float) – Mean of the subgroup.

  • cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.

  • target_values_sg (numpy.ndarray) – Target values in the subgroup.

Returns:

The optimistic estimate.

Return type:

float

get_estimate_numpy(values_sg, _, mean_dataset)[source]

Compute the optimistic estimate using NumPy.

Parameters:
  • values_sg (numpy.ndarray) – Sorted target values in the subgroup.

  • _ – Unused parameter.

  • mean_dataset (float) – Mean of the dataset.

Returns:

The optimistic estimate.

Return type:

float

class Summation_Estimator(qf)[source]

Bases: object

Estimator for optimistic estimate using summation strategy.

This estimator calculates the optimistic estimate as a hypothetical subgroup which contains only instances with value greater than the dataset mean and is of maximal size.

From Florian Lemmerich’s Dissertation [section 4.2.2.1, Theorem 2 (page 81)]:

\[oe(sg) = \sum_{x \in sg, T(x)>0} (T(sg) - \mu_0)\]
calculate_constant_statistics(data, target)[source]

Calculate constant statistics needed for estimation.

Parameters:
get_data(data, target)[source]

Prepare data for estimation (no changes for this estimator).

Parameters:
Returns:

The unmodified dataset.

Return type:

pandas.DataFrame

get_estimate(subgroup, sg_size, sg_centroid, cover_arr, _)[source]

Compute the optimistic estimate for the subgroup.

Parameters:
  • subgroup – The subgroup description.

  • sg_size (int) – Size of the subgroup.

  • sg_centroid (float) – Mean or median of the subgroup.

  • cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.

  • _ – Unused parameter.

Returns:

The optimistic estimate.

Return type:

float

calculate_constant_statistics(data, target)[source]

Calculate statistics that remain constant for the dataset.

Parameters:
calculate_statistics(subgroup, target, data, statistics=None)[source]

Calculate statistics specific to the subgroup.

Parameters:
  • subgroup – The subgroup for which to calculate statistics.

  • target (NumericTarget) – The target definition.

  • data (pandas.DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

Contains size_sg, mean or median, and estimate.

Return type:

namedtuple

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup using the standard quality function.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target (NumericTarget) – The target definition.

  • data (pandas.DataFrame) – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The computed quality value.

Return type:

float

mean_tpl

alias of StandardQFNumeric_parameters

median_tpl

alias of StandardQFNumeric_median_parameters

optimistic_estimate(subgroup, target, data, statistics=None)[source]

Compute the optimistic estimate of the quality function.

Parameters:
  • subgroup – The subgroup for which to compute the optimistic estimate.

  • target (NumericTarget) – The target definition.

  • data (pandas.DataFrame) – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The optimistic estimate of the quality value.

Return type:

float

static standard_qf_numeric(a, _, mean_dataset, instances_subgroup, mean_sg)[source]

Compute the standard quality function for numeric targets.

Parameters:
  • a (float) – Exponent for weighting the subgroup size.

  • _ – Unused parameter (size of dataset).

  • mean_dataset (float) – Mean of the target variable in the dataset.

  • instances_subgroup (int) – Number of instances in the subgroup.

  • mean_sg (float) – Mean of the target variable in the subgroup.

Returns:

The computed quality value.

Return type:

float

tpl

alias of StandardQFNumeric_parameters

class pysubgroup.numeric_target.StandardQFNumericMedian[source]

Bases: BoundedInterestingnessMeasure

Quality function for numeric targets using median (deprecated).

Note

This class is no longer supported. Use StandardQFNumeric with centroid=’median’ instead.

tpl

alias of StandardQFNumericMedian_parameters

class pysubgroup.numeric_target.StandardQFNumericTscore(invert=False)[source]

Bases: BoundedInterestingnessMeasure

Quality function for numeric targets using T-score.

calculate_constant_statistics(data, target)[source]

Calculate statistics that remain constant for the dataset.

Parameters:
calculate_statistics(subgroup, target, data, statistics=None)[source]

Calculate statistics specific to the subgroup.

Parameters:
  • subgroup – The subgroup for which to calculate statistics.

  • target (NumericTarget) – The target definition.

  • data (pandas.DataFrame) – The dataset.

  • statistics (any, optional) – Unused in this implementation.

Returns:

Contains size_sg, mean, std, and estimate.

Return type:

namedtuple

evaluate(subgroup, target, data, statistics=None)[source]

Evaluate the quality of the subgroup using the T-score.

Parameters:
  • subgroup – The subgroup to evaluate.

  • target (NumericTarget) – The target definition.

  • data (pandas.DataFrame) – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The computed T-score.

Return type:

float

optimistic_estimate(subgroup, target, data, statistics=None)[source]

Compute the optimistic estimate of the quality function.

Parameters:
  • subgroup – The subgroup for which to compute the optimistic estimate.

  • target – The target definition.

  • data – The dataset.

  • statistics (any, optional) – Previously computed statistics.

Returns:

The optimistic estimate of the quality value.

Return type:

float

static t_score(mean_dataset, instances_subgroup, mean_sg, std_sg)[source]

Compute the T-score for the subgroup.

Parameters:
  • mean_dataset (float) – Mean of the dataset.

  • instances_subgroup (int) – Number of instances in the subgroup.

  • mean_sg (float) – Mean of the subgroup.

  • std_sg (float) – Standard deviation of the subgroup.

Returns:

The computed T-score.

Return type:

float

tpl

alias of StandardQFNumericTscore_parameters

pysubgroup.numeric_target.calc_sorted_median(arr)[source]

Calculate the median of a sorted array.

Parameters:

arr (numpy.ndarray) – A sorted array.

Returns:

The median value.

Return type:

float

pysubgroup.numeric_target.read_mean(tpl)[source]

Extract the mean value from a namedtuple.

Parameters:

tpl (namedtuple) – A namedtuple containing a ‘mean’ field.

Returns:

The mean value.

Return type:

float

pysubgroup.numeric_target.read_median(tpl)[source]

Extract the median value from a namedtuple.

Parameters:

tpl (namedtuple) – A namedtuple containing a ‘median’ field.

Returns:

The median value.

Return type:

float

pysubgroup.permutation_test module

class pysubgroup.permutation_test.NegativeClassCountRandomSelector(size, negative_class_count, np_rng, positive_class_indices, negative_class_indices)[source]

Bases: object

A selector that covers a random subset of the given indices, such that the number of covered instances as well as the number of negatives instances are always the same.

covers(data_instance)[source]
select()[source]

Randomize the cover of this selector.

property selectors
set_descriptions(size, negative_class_count, *args, **kwargs)[source]
pysubgroup.permutation_test.permutation_test(qf: any, result: any, target: ~pysubgroup.model_predictions_target.SoftClassifierTarget, data: ~pandas.core.frame.DataFrame, num_random_samples: int, np_rng: ~numpy.random._generator.Generator = Generator(PCG64) at 0x75AC2054B4C0, max_random_sampling_retries: int = 10, alpha: float = 0.05, pos_label: any = 1, neg_label: any = 0, multitest_correction_method: str = 'fdr_by', tqdm: any = None)[source]

Test the subgroups in the result for statistical significance by comparison to qualities of random samples from the data. Random samples are drawn such that the number of instances from each class in the sample is the same as in the tested subgroup.

Only for SoftClassifierTargets.

Parameters:
  • qf – Quality function to use as the test statistic.

  • result – ps.SubgroupDiscoveryResult object holding the subgroups to test.

  • target – Target concept to use in the quality function.

  • data – Dataset to compute all qualities from. The qualities of the given subgroups are also recomputed on this data for the test.

  • num_random_samples – How many random samples to draw. More samples allow to distinguish p-values more fine-grained.

  • np_rng – Random generator object to use for drawing the samples. Use this to get reproducible results.

  • max_random_sampling_retries – How often to repeat the drawing process for each sample to get a quality. Repetitions are used when the quality is undefined on a random sample.

  • pos_label – Which value in the dataset to count as a positive class.

  • neg_label – Which value in the dataset to count as a negative class.

  • multitest_correction_method – Which method to correct the p-values against the multiple comparison problem with. Refer to statsmodels.stats.multitest.multipletests for all possible values.

Return p_values_raw:

Uncorrected p-values for each subgroup

Return reject:

Test result after multiple testing correction.

Return p_values_corrected:

P-values after multiple testing correction.

Return qualities:

Subgroup qualities on the testing data.

Return samples:

The full random sample of qualities that was generated for each subgroup.

pysubgroup.refinement_operator module

class pysubgroup.refinement_operator.RefinementOperator[source]

Bases: object

Base class for refinement operators.

class pysubgroup.refinement_operator.StaticGeneralizationOperator(selectors)[source]

Bases: object

Refinement operator for static generalization.

This operator generalizes subgroups by adding selectors from a predefined list, ensuring that each selector is used in a specific order.

refinements(sG)[source]

Generate refinements of the given subgroup.

Parameters:

sG – The subgroup to refine.

Returns:

A generator of refined subgroups.

class pysubgroup.refinement_operator.StaticSpecializationOperator(selectors)[source]

Bases: object

Refinement operator for static specialization.

This operator specializes subgroups by adding selectors in a predefined order, ensuring that each attribute is used only once in a subgroup description.

refinements(subgroup)[source]

Generate refinements of the given subgroup.

Parameters:

subgroup – The subgroup to refine.

Returns:

A generator of refined subgroups.

pysubgroup.representations module

class pysubgroup.representations.BitSetRepresentation(df, selectors_to_patch)[source]

Bases: RepresentationBase

Representation class that uses bitsets for selectors and conjunctions.

Conjunction

alias of BitSet_Conjunction

Disjunction

alias of BitSet_Disjunction

patch_classes()[source]

Patch class-level attributes before entering the context.

patch_selector(sel)[source]

Patch a selector by computing its bitset representation.

Parameters:

sel – Selector to patch.

class pysubgroup.representations.BitSet_Conjunction(*args, **kwargs)[source]

Bases: Conjunction

Conjunction subclass that uses bitsets for representation.

Provides efficient computation of the conjunction using numpy boolean arrays.

append_and(to_append)[source]

Append a selector using logical AND and update the representation.

Parameters:

to_append – Selector to append.

compute_representation()[source]

Compute the bitset representation of the conjunction.

Returns:

Numpy boolean array representing the instances covered by the conjunction.

n_instances = 0
property size_sg

Size of the subgroup represented by the conjunction.

class pysubgroup.representations.BitSet_Disjunction(*args, **kwargs)[source]

Bases: Disjunction

Disjunction subclass that uses bitsets for representation.

Provides efficient computation of the disjunction using numpy boolean arrays.

append_or(to_append)[source]

Append a selector using logical OR and update the representation.

Parameters:

to_append – Selector to append.

compute_representation()[source]

Compute the bitset representation of the disjunction.

Returns:

Numpy boolean array representing the instances covered by the disjunction.

property size_sg

Size of the subgroup represented by the disjunction.

class pysubgroup.representations.NumpySetRepresentation(df, selectors_to_patch)[source]

Bases: RepresentationBase

Representation class that uses numpy arrays for selectors and conjunctions.

Conjunction

alias of NumpySet_Conjunction

patch_classes()[source]

Patch class-level attributes before entering the context.

patch_selector(sel)[source]

Patch a selector by computing its numpy array representation.

Parameters:

sel – Selector to patch.

class pysubgroup.representations.NumpySet_Conjunction(*args, **kwargs)[source]

Bases: Conjunction

Conjunction subclass that uses numpy arrays for set representation.

all_set = None
append_and(to_append)[source]

Append a selector using logical AND and update the representation.

Parameters:

to_append – Selector to append.

compute_representation()[source]

Compute the numpy array representation of the conjunction.

Returns:

Numpy array of indices representing the instances covered by the conjunction.

property size_sg

Size of the subgroup represented by the conjunction.

class pysubgroup.representations.RepresentationBase(new_conjunction, selectors_to_patch)[source]

Bases: object

Base class for different representation strategies.

Provides methods to patch selectors and manage class-level patches. Can be used as a context manager to ensure patches are applied and removed properly.

patch_all_selectors()[source]

Patch all selectors in the selectors_to_patch list.

patch_classes()[source]

Patch the required classes.

Can be overridden by subclasses to patch class-level attributes or methods.

patch_selector(sel)[source]

Patch a single selector.

This method should be implemented by subclasses.

undo_patch_classes()[source]

Undo patches applied to classes.

Can be overridden by subclasses to remove class-level patches.

class pysubgroup.representations.SetRepresentation(df, selectors_to_patch)[source]

Bases: RepresentationBase

Representation class that uses sets for selectors and conjunctions.

Conjunction

alias of Set_Conjunction

patch_classes()[source]

Patch class-level attributes before entering the context.

patch_selector(sel)[source]

Patch a selector by computing its set representation.

Parameters:

sel – Selector to patch.

class pysubgroup.representations.Set_Conjunction(*args, **kwargs)[source]

Bases: Conjunction

Conjunction subclass that uses sets for representation.

all_set = {}
append_and(to_append)[source]

Append a selector using logical AND and update the representation.

Parameters:

to_append – Selector to append.

compute_representation()[source]

Compute the set representation of the conjunction.

Returns:

Set of indices representing the instances covered by the conjunction.

property size_sg

Size of the subgroup represented by the conjunction.

pysubgroup.subgroup_description module

Created on 28.04.2016

@author: lemmerfn

class pysubgroup.subgroup_description.BooleanExpressionBase[source]

Bases: ABC

Base class for boolean expressions (conjunctions and disjunctions).

abstractmethod append_and(to_append)[source]

Append a selector or expression using logical AND.

abstractmethod append_or(to_append)[source]

Append a selector or expression using logical OR.

class pysubgroup.subgroup_description.Conjunction(selectors)[source]

Bases: BooleanExpressionBase

Conjunction of selectors (logical AND).

append_and(to_append)[source]

Append a selector or conjunction using logical AND.

append_or(to_append)[source]

Append a selector or expression using logical OR (not supported).

covers(instance)[source]

Determine which instances are covered by the conjunction.

Parameters:

instance – pandas DataFrame containing the data.

Returns:

A boolean array indicating which instances are covered.

property depth

Return the number of selectors in the conjunction.

static from_str(s)[source]

Create a Conjunction from a string representation.

Parameters:

s – String representation of the conjunction.

Returns:

A Conjunction instance.

pop_and()[source]

Remove and return the last selector added using AND.

pop_or()[source]

Pop operation for OR is not supported in Conjunction.

property selectors

Return the selectors in the conjunction as a tuple.

class pysubgroup.subgroup_description.DNF(selectors=None)[source]

Bases: Disjunction

Disjunctive Normal Form expression.

append_and(to_append)[source]

Append a selector using logical AND to all conjunctions.

append_or(to_append)[source]

Append a selector or conjunction using logical OR.

pop_and()[source]

Remove and return the last selector added using AND from all conjunctions.

class pysubgroup.subgroup_description.Disjunction(selectors=None)[source]

Bases: BooleanExpressionBase

Disjunction of selectors (logical OR).

append_and(to_append)[source]

Append a selector or expression using logical AND (not supported).

append_or(to_append)[source]

Append a selector or disjunction using logical OR.

covers(instance)[source]

Determine which instances are covered by the disjunction.

Parameters:

instance – pandas DataFrame containing the data.

Returns:

A boolean array indicating which instances are covered.

property selectors

Return the selectors in the disjunction as a tuple.

class pysubgroup.subgroup_description.EqualitySelector(*args, **kwargs)[source]

Bases: SelectorBase

Selector that checks for equality with a specific value.

property attribute_name

Name of the attribute.

property attribute_value

Value of the attribute to compare for equality.

classmethod compute_descriptions(attribute_name, attribute_value, selector_name)[source]

Compute the descriptions (hash, query, string) for the selector.

covers(data)[source]

Determine which instances in data are covered by this selector.

Parameters:

data – pandas DataFrame containing the data.

Returns:

A boolean array indicating which instances are covered.

static from_str(s)[source]

Create an EqualitySelector from a string representation.

Parameters:

s – String representation of the selector.

Returns:

An EqualitySelector instance.

property selectors

Return the selector itself as a tuple (for compatibility).

set_descriptions(attribute_name, attribute_value, selector_name=None)[source]

Set the descriptions (query, string, hash) for the selector.

class pysubgroup.subgroup_description.IntervalSelector(*args, **kwargs)[source]

Bases: SelectorBase

Selector that checks if a value is within an interval.

property attribute_name

Name of the attribute.

classmethod compute_descriptions(attribute_name, lower_bound, upper_bound, selector_name=None)[source]

Compute the descriptions (hash, query, string) for the interval selector.

classmethod compute_string(attribute_name, lower_bound, upper_bound, rounding_digits)[source]

Compute the string representation of the interval selector.

covers(data_instance)[source]

Determine which instances are covered by this interval selector.

Parameters:

data_instance – pandas DataFrame containing the data.

Returns:

A boolean array indicating which instances are within the interval.

static from_str(s)[source]

Create an IntervalSelector from a string representation.

Parameters:

s – String representation of the interval selector.

Returns:

An IntervalSelector instance.

property lower_bound

Lower bound of the interval (inclusive).

property selectors

Return the selector itself as a tuple (for compatibility).

set_descriptions(attribute_name, lower_bound, upper_bound, selector_name=None)[source]

Set the descriptions (hash, query, string) for the interval selector.

property upper_bound

Upper bound of the interval (exclusive).

class pysubgroup.subgroup_description.NegatedSelector(*args, **kwargs)[source]

Bases: SelectorBase

Selector that negates another selector.

property attribute_name

Name of the attribute.

covers(data_instance)[source]

Determine which instances are not covered by the underlying selector.

Parameters:

data_instance – pandas DataFrame containing the data.

Returns:

A boolean array indicating which instances are not covered.

property selectors

Return the selector itself as a tuple (for compatibility).

set_descriptions(selector)[source]

Set the descriptions (query, hash) for the negated selector.

class pysubgroup.subgroup_description.SelectorBase(*args, **kwargs)[source]

Bases: ABC

Base class for selectors, ensuring each selector instance is unique.

abstractmethod set_descriptions(*args, **kwargs)[source]

Set the descriptions for the selector.

pysubgroup.subgroup_description.create_nominal_selectors(data, ignore=None)[source]

Create equality selectors for nominal attributes.

Parameters:
  • data – pandas DataFrame containing the data.

  • ignore – List of attribute names to ignore.

Returns:

List of EqualitySelector instances.

pysubgroup.subgroup_description.create_nominal_selectors_for_attribute(data, attribute_name, dtypes=None)[source]

Create equality selectors for a nominal attribute.

Parameters:
  • data – pandas DataFrame containing the data.

  • attribute_name – Name of the attribute.

  • dtypes – Data types of the data columns.

Returns:

List of EqualitySelector instances for the attribute.

pysubgroup.subgroup_description.create_numeric_selectors(data, nbins=5, intervals_only=True, weighting_attribute=None, ignore=None)[source]

Create selectors for numeric attributes.

Parameters:
  • data – pandas DataFrame containing the data.

  • nbins – Number of bins to use for discretization.

  • intervals_only – If True, only create interval selectors.

  • weighting_attribute – Optional attribute for weighting.

  • ignore – List of attribute names to ignore.

Returns:

List of numeric selectors.

pysubgroup.subgroup_description.create_numeric_selectors_for_attribute(data, attr_name, nbins=5, intervals_only=True, weighting_attribute=None)[source]

Create selectors for a numeric attribute.

Parameters:
  • data – pandas DataFrame containing the data.

  • attr_name – Name of the attribute.

  • nbins – Number of bins to use for discretization.

  • intervals_only – If True, only create interval selectors.

  • weighting_attribute – Optional attribute for weighting.

Returns:

List of numeric selectors for the attribute.

pysubgroup.subgroup_description.create_selectors(data, nbins=5, intervals_only=True, ignore=None)[source]

Create a list of selectors for all attributes in the data.

Parameters:
  • data – pandas DataFrame containing the data.

  • nbins – Number of bins to use for numeric attributes.

  • intervals_only – If True, only create interval selectors for numeric attributes.

  • ignore – List of attribute names to ignore.

Returns:

List of selectors.

pysubgroup.subgroup_description.get_cover_array_and_size(subgroup, data_len=None, data=None)[source]

Compute the cover array and its size for a given subgroup.

Parameters:
  • subgroup – The subgroup for which to compute the cover array and size.

  • data_len – Optional length of the data.

  • data – Optional data.

Returns:

Tuple of (cover array, size).

pysubgroup.subgroup_description.get_size(subgroup, data_len=None, data=None)[source]

Compute the size of the cover array for a given subgroup.

Parameters:
  • subgroup – The subgroup for which to compute the size.

  • data_len – Optional length of the data.

  • data – Optional data.

Returns:

Size of the cover array.

pysubgroup.subgroup_description.pandas_sparse_eq(col, value)[source]

Compare a pandas sparse column to a value.

Parameters:
  • col – pandas Series with SparseArray data.

  • value – The value to compare with.

Returns:

A pandas SparseArray of booleans indicating where col equals value.

pysubgroup.subgroup_description.remove_target_attributes(selectors, target)[source]

Remove selectors that are based on target attributes.

Parameters:
  • selectors – List of selectors.

  • target – The target object with get_attributes method.

Returns:

List of selectors not based on target attributes.

pysubgroup.utils module

Created on 02.05.2016

@author: lemmerfn

class pysubgroup.utils.BaseTarget[source]

Bases: object

Base class for defining targets in subgroup discovery.

Provides a method to check if all required statistics are present.

all_statistics_present(cached_statistics)[source]

Checks if all required statistics are present in the cached statistics.

Parameters:

cached_statistics (dict) – The dictionary of cached statistics.

Returns:

True if all required statistics are present, False otherwise.

Return type:

bool

class pysubgroup.utils.SubgroupDiscoveryResult(results, task)[source]

Bases: object

Represents the result of a subgroup discovery task.

Contains methods to convert results to different formats.

to_dataframe(statistics_to_show=None, autoround=False, include_target=False)[source]

Converts the results to a pandas DataFrame.

Parameters:
  • statistics_to_show (list, optional) – The statistics to include in the DataFrame.

  • autoround (bool) – If True, automatically rounds numerical columns.

  • include_target (bool) – If True, includes the target in the DataFrame.

Returns:

A pandas DataFrame representing the results.

Return type:

DataFrame

to_descriptions(include_stats=False)[source]

Converts the results to a list of subgroup descriptions.

Parameters:

include_stats (bool) – If True, includes statistics in the output.

Returns:

A list of subgroup descriptions.

Return type:

list

to_latex(statistics_to_show=None, escape_underscore=True)[source]

Converts the results to a LaTeX-formatted table.

Parameters:
  • statistics_to_show (list, optional) – The statistics to include in the LaTeX table.

  • escape_underscore (bool) – If True, escapes underscores in strings.

Returns:

A string containing the LaTeX-formatted table.

Return type:

str

to_table(statistics_to_show=None, print_header=True, include_target=False)[source]

Converts the results to a table format.

Parameters:
  • statistics_to_show (list, optional) – The statistics to include in the table.

  • print_header (bool) – If True, includes a header row.

  • include_target (bool) – If True, includes the target in the table.

Returns:

A list of rows representing the table.

Return type:

list

pysubgroup.utils.add_if_required(result, sg, quality, task: SubgroupDiscoveryTask, check_for_duplicates=False, statistics=None, explicit_result_set_size=None)[source]

Adds a subgroup to the result set if it meets the required quality and constraints.

Important

Only add/remove subgroups from result by using heappop and heappush to ensure order of subgroups by quality.

Parameters:
  • result (list) – The current list of subgroups (heap).

  • sg – The subgroup to potentially add.

  • quality (float) – The quality of the subgroup.

  • task (SubgroupDiscoveryTask) – The task containing parameters and constraints.

  • check_for_duplicates (bool) – If True, checks for duplicates before adding.

  • statistics (optional) – Precomputed statistics for the subgroup.

  • explicit_result_set_size (int, optional) – Overrides the task’s result_set_size.

Returns:

None

pysubgroup.utils.conditional_invert(val, invert)[source]

Conditionally inverts a value based on a boolean flag.

Parameters:
  • val (float) – The value to potentially invert.

  • invert (bool) – If True, the value is inverted.

Returns:

The (possibly inverted) value.

Return type:

float

pysubgroup.utils.count_bits(bitset_as_int)[source]

Counts the number of set bits (1s) in a bitset represented as an integer.

Parameters:

bitset_as_int (int) – The bitset represented as an integer.

Returns:

The number of set bits.

Return type:

int

pysubgroup.utils.create_subgroup_with_representation(data, selectors)[source]

Create an object representing the conjunction of the given selectors, including a bitmask indicating which instances in the dataset are covered.

Parameters:
  • data – dataset to evaluate the cover on

  • selectors – list of selectors to form the conjunction

pysubgroup.utils.derive_effective_sample_size(weights)[source]

Calculates the effective sample size for weighted data.

Parameters:

weights (array-like) – The weights assigned to the samples.

Returns:

The effective sample size.

Return type:

float

pysubgroup.utils.equal_frequency_discretization(data, attribute_name, nbins=5, weighting_attribute=None)[source]

Discretizes a numerical attribute into bins with approximately equal frequency.

Parameters:
  • data (DataFrame) – The dataset containing the attribute to discretize.

  • attribute_name (str) – The name of the attribute to discretize.

  • nbins (int) – The number of bins to create.

  • weighting_attribute (str, optional) – An optional attribute to weight the instances.

Returns:

A list of cutpoints defining the bins.

Return type:

list

pysubgroup.utils.find_set_bits(bitset_as_int)[source]

Finds the indices of set bits in a bitset represented as an integer.

Parameters:

bitset_as_int (int) – The bitset represented as an integer.

Yields:

int – The index of each set bit.

pysubgroup.utils.float_formatter(x, digits=2)[source]

Formats a float to a specified number of decimal places.

Parameters:
  • x (float) – The value to format.

  • digits (int) – The number of decimal places.

Returns:

The formatted string.

Return type:

str

pysubgroup.utils.intersect_of_ordered_list(list_1, list_2)[source]

Computes the intersection of two ordered lists.

Parameters:
  • list_1 (list) – The first ordered list.

  • list_2 (list) – The second ordered list.

Returns:

The intersection of the two lists.

Return type:

list

pysubgroup.utils.is_categorical_attribute(data, attribute_name)[source]

Determines if an attribute in the dataset is categorical.

Parameters:
  • data (DataFrame) – The dataset.

  • attribute_name (str) – The name of the attribute.

Returns:

True if the attribute is categorical, False otherwise.

Return type:

bool

pysubgroup.utils.is_numerical_attribute(data, attribute_name)[source]

Determines if an attribute in the dataset is numerical.

Parameters:
  • data (DataFrame) – The dataset.

  • attribute_name (str) – The name of the attribute.

Returns:

True if the attribute is numerical, False otherwise.

Return type:

bool

pysubgroup.utils.minimum_required_quality(result, task)[source]

Determines the minimum quality required for a subgroup to be considered for inclusion in the result set.

Parameters:
  • result (list) – The current list of subgroups (heap).

  • task (SubgroupDiscoveryTask) – The task containing parameters like

  • min_quality. (result_set_size and)

Returns:

The minimum required quality for a subgroup to be added to the result set.

Return type:

float

pysubgroup.utils.overlap(sg, another_sg, data)[source]

Calculates the Jaccard similarity between two subgroups based on their coverage.

Parameters:
  • sg – The first subgroup.

  • another_sg – The second subgroup.

  • data (DataFrame) – The dataset.

Returns:

The Jaccard similarity between the two subgroups.

Return type:

float

pysubgroup.utils.perc_formatter(x)[source]

Formats a float as a percentage string with one decimal place.

Parameters:

x (float) – The value to format.

Returns:

The formatted percentage string.

Return type:

str

pysubgroup.utils.powerset(iterable, max_length=None)[source]

Generates the power set (all possible combinations) of an iterable up to a maximum length.

Parameters:
  • iterable (iterable) – The iterable to generate combinations from.

  • max_length (int, optional) – The maximum length of combinations.

Returns:

An iterator over the power set of the iterable.

Return type:

iterator

pysubgroup.utils.prepare_subgroup_discovery_result(result, task)[source]

Filters and sorts the result set of subgroups according to the task parameters.

Parameters:
  • result (list) – The list of subgroups (heap).

  • task (SubgroupDiscoveryTask) – The task containing parameters like result_set_size and min_quality.

Returns:

The filtered and sorted list of subgroups.

Return type:

list

pysubgroup.utils.remove_selectors_with_attributes(selector_list, attribute_list)[source]

Removes selectors that are based on specified attributes.

Parameters:
  • selector_list (list) – The list of selectors to filter.

  • attribute_list (list) – The list of attribute names to remove selectors for.

Returns:

The filtered list of selectors.

Return type:

list

pysubgroup.utils.results_df_autoround(df)[source]

Automatically rounds numerical columns in a DataFrame for better readability.

Parameters:

df (DataFrame) – The DataFrame containing the results.

Returns:

The DataFrame with rounded numerical values.

Return type:

DataFrame

pysubgroup.utils.str_to_bool(s)[source]

Converts a string representation of a boolean value to a boolean type.

Parameters:

s (str) – The string to convert (e.g., ‘true’, ‘False’, ‘1’, ‘0’).

Returns:

The boolean value represented by the string.

Return type:

bool

Raises:

ValueError – If the string does not represent a valid boolean value.

pysubgroup.utils.to_bits(list_of_ints)[source]

Converts a list of integers to a bitset represented as an integer.

Parameters:

list_of_ints (list) – The list of integers to convert.

Returns:

The bitset represented as an integer.

Return type:

int

pysubgroup.visualization module

pysubgroup.visualization.plot_distribution_numeric(sg, target, data, bins, show_dataset=True)[source]
pysubgroup.visualization.plot_npspace(result_df, data, annotate=True, fixed_limits=False)[source]
pysubgroup.visualization.plot_qualities_on_sample_distribution(result, qualities, samples, alpha=0.05, side: Literal['left', 'right'] = 'right', bins=25)[source]

Create plots of the empirical sample distribution as a histogram for each subgroup. Include indicators for the subgroup quality and the quality that the alpha threshold corresponds to.

pysubgroup.visualization.plot_roc(result_df, data, qf=<pysubgroup.binary_target.StandardQF object>, levels=40, annotate=False)[source]
pysubgroup.visualization.plot_sgbars(result_df, *, ylabel='target share', title='Discovered Subgroups', dynamic_widths=False, _suffix='')[source]
pysubgroup.visualization.similarity_dendrogram(result, data)[source]
pysubgroup.visualization.similarity_sgs(sgd_results, data, color=True)[source]
pysubgroup.visualization.supportSetVisualization(result, in_order=True, drop_empty=True)[source]

Module contents