pysubgroup package¶
Submodules¶
pysubgroup.algorithms module¶
Created on 29.04.2016
@author: lemmerfn
- class pysubgroup.algorithms.Apriori(representation_type=None, combination_name='Conjunction', use_numba=True)[source]¶
Bases:
objectImplementation of the Apriori algorithm for subgroup discovery.
This class provides methods to perform level-wise search for subgroups using the Apriori algorithm.
- execute(task)[source]¶
Executes the Apriori algorithm on the given task.
- Parameters:
task – The subgroup discovery task to be executed.
- Returns:
A SubgroupDiscoveryResult containing the discovered subgroups.
- get_next_level(promising_candidates)[source]¶
Generates the next level of candidates based on the current promising candidates.
- Parameters:
promising_candidates – A list of promising candidate selectors.
- Returns:
A list of new candidate selectors for the next level.
- get_next_level_candidates(task, result, next_level_candidates)[source]¶
Evaluates candidates at the current level and filters promising ones for the next level.
- Parameters:
task – The subgroup discovery task.
result – The current list of discovered subgroups.
next_level_candidates – List of subgroups to be evaluated at the current
level.
- Returns:
A list of promising candidates (selectors) for the next level.
- get_next_level_candidates_vectorized(task, result, next_level_candidates)[source]¶
Vectorized evaluation of candidates at the current level to filter promising ones for the next level.
- Parameters:
task – The subgroup discovery task.
result – The current list of discovered subgroups.
next_level_candidates – List of subgroups to be evaluated at the current
level.
- Returns:
A list of promising candidates (selectors) for the next level.
- class pysubgroup.algorithms.BeamSearch(beam_width=20, beam_width_adaptive=False)[source]¶
Bases:
objectImplements the Beam Search algorithm for subgroup discovery.
- class pysubgroup.algorithms.BestFirstSearch[source]¶
Bases:
objectImplements the Best-First Search algorithm for subgroup discovery.
- class pysubgroup.algorithms.DFS(apply_representation=None)[source]¶
Bases:
objectDepth-first search with look-ahead for a provided data structure.
- class pysubgroup.algorithms.DFSNumeric[source]¶
Bases:
objectImplements a specialized DFS algorithm for numeric quality functions.
- execute(task)[source]¶
Executes the DFSNumeric algorithm on the given task.
- Parameters:
task – The subgroup discovery task to be executed.
- Returns:
A SubgroupDiscoveryResult containing the discovered subgroups.
- search_internal(task, prefix, modification_set, result, bitset)[source]¶
Recursively searches in a dfs-manner for numeric quality functions.
- Parameters:
task – The subgroup discovery task.
prefix – The current list of selectors in the subgroup description.
modification_set – The remaining selectors to consider.
result – The current list of discovered subgroups.
bitset – The current bitset representing the subgroup.
- Returns:
The updated list of discovered subgroups.
- tpl¶
alias of
size_mean_parameters
- class pysubgroup.algorithms.GeneralisingBFS[source]¶
Bases:
objectImplements a Generalizing Best-First Search algorithm for subgroup discovery.
- class pysubgroup.algorithms.SimpleDFS[source]¶
Bases:
objectImplements a simple Depth-First Search algorithm for subgroup discovery. It is the most elementary (and thus probably slow) algorithm implementation.
- execute(task, use_optimistic_estimates=True)[source]¶
Executes the Simple DFS algorithm on the given task.
- Parameters:
task – The subgroup discovery task to be executed.
use_optimistic_estimates – Whether to use optimistic estimates for pruning.
- Returns:
A SubgroupDiscoveryResult containing the discovered subgroups.
- search_internal(task, prefix, modification_set, result, use_optimistic_estimates)[source]¶
Recursively searches for subgroups in a depth-first manner.
- Parameters:
task – The subgroup discovery task.
prefix – The current list of selectors in the subgroup description.
modification_set – The remaining selectors to consider.
result – The current list of discovered subgroups.
use_optimistic_estimates – Whether to use optimistic estimates for pruning.
- Returns:
The updated list of discovered subgroups.
- class pysubgroup.algorithms.SimpleSearch(show_progress=True)[source]¶
Bases:
objectImplements a simple exhaustive search algorithm for subgroup discovery.
- class pysubgroup.algorithms.SubgroupDiscoveryTask(data, target, search_space, qf, result_set_size=10, depth=3, min_quality=-inf, constraints=None)[source]¶
Bases:
objectEncapsulates all parameters required to perform standard subgroup discovery.
- pysubgroup.algorithms.constraints_satisfied(constraints, subgroup, statistics=None, data=None)[source]¶
Checks if all constraints are satisfied for a given subgroup.
- Parameters:
constraints – A list of constraints to check.
subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset to be analyzed (optional).
- Returns:
True if all constraints are satisfied, False otherwise.
pysubgroup.binary_target module¶
Created on 29.09.2017
@author: lemmerfn
- class pysubgroup.binary_target.BinaryTarget(target_attribute=None, target_value=None, target_selector=None)[source]¶
Bases:
BaseTargetBinary target for classic subgroup discovery with boolean targets.
Stores the target attribute and value, and computes various statistics related to the target within a subgroup.
- calculate_statistics(subgroup, data, cached_statistics=None)[source]¶
Calculate various statistics for the subgroup.
- covers(instance)[source]¶
Determine whether the target selector covers the given instance.
- Parameters:
instance (pandas DataFrame) – The data instance to check.
- Returns:
Boolean array indicating coverage.
- Return type:
- get_attributes()[source]¶
Get the attribute names used in the target.
- Returns:
A tuple containing the attribute name.
- Return type:
- get_base_statistics(subgroup, data)[source]¶
Compute basic statistics for the target within the subgroup and dataset.
- Parameters:
subgroup – The subgroup for which to compute statistics.
data (pandas DataFrame) – The dataset.
- Returns:
- Contains instances_dataset, positives_dataset,
instances_subgroup, positives_subgroup.
- Return type:
- statistic_types = ('size_sg', 'size_dataset', 'positives_sg', 'positives_dataset', 'size_complement', 'relative_size_sg', 'relative_size_complement', 'coverage_sg', 'coverage_complement', 'target_share_sg', 'target_share_complement', 'target_share_dataset', 'lift')¶
- class pysubgroup.binary_target.ChiSquaredQF(direction='both', min_instances=5, stat='chi2')[source]¶
Bases:
SimplePositivesQFChiSquaredQF tests for statistical independence of a subgroup against its complement.
Calculates the chi-squared statistic or p-value to measure the significance of the difference between the subgroup and the dataset.
- static chi_squared_qf(instances_dataset, positives_dataset, instances_subgroup, positives_subgroup, min_instances=5, bidirect=True, direction_positive=True, index=0)[source]¶
Perform chi-squared test of statistical independence.
Tests whether a subgroup is statistically independent from its complement (see scipy.stats.chi2_contingency).
- Parameters:
instances_dataset (int) – Total number of instances in the dataset.
positives_dataset (int) – Total number of positive instances in the dataset.
instances_subgroup (int) – Number of instances in the subgroup.
positives_subgroup (int) – Number of positive instances in the subgroup.
min_instances (int, optional) – Minimum required instances; return -inf if less.
bidirect (bool, optional) – If True, both directions are considered interesting.
direction_positive (bool, optional) – If bidirect is False, specifies the direction.
index (int, optional) – Whether to return statistic (0) or p-value (1).
- Returns:
Chi-squared statistic or p-value, depending on the index parameter.
- Return type:
- static chi_squared_qf_weighted(subgroup, data, weighting_attribute, effective_sample_size=0, min_instances=5)[source]¶
Perform chi-squared test for weighted data.
- Parameters:
- Returns:
The p-value from the chi-squared test.
- Return type:
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup using the chi-squared test.
- Parameters:
subgroup – The subgroup to evaluate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
The chi-squared statistic or p-value.
- Return type:
- class pysubgroup.binary_target.GeneralizationAware_StandardQF(a, optimistic_estimate_strategy='default')[source]¶
Bases:
GeneralizationAwareQF_stats,BoundedInterestingnessMeasureGeneralization-Aware Standard Quality Function.
Extends the StandardQF to consider generalizations during subgroup discovery, providing methods for optimistic estimates and aggregate statistics.
- difference_based_agg_function(stats_subgroup, list_of_pairs)[source]¶
Aggregate statistics using the difference-based strategy.
- Parameters:
stats_subgroup – Statistics of the current subgroup.
list_of_pairs – List of (stats, agg_tuple) for all generalizations.
- Returns:
Aggregate statistics tuple.
- Return type:
namedtuple
- difference_based_optimistic_estimate(subgroup, target, data, statistics)[source]¶
Compute the optimistic estimate using the difference-based strategy.
- Parameters:
subgroup – The subgroup for which to compute the estimate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any) – Current statistics.
- Returns:
The optimistic estimate of the quality value.
- Return type:
- difference_based_read_p(agg_tuple)[source]¶
Read the p-value from the aggregate tuple using the difference-based strategy.
- Parameters:
agg_tuple – The aggregate statistics tuple.
- Returns:
The maximum percentage of positives.
- Return type:
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup considering generalizations.
- Parameters:
subgroup – The subgroup to evaluate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
The computed quality value.
- Return type:
- class ga_sQF_agg_tuple(max_p, min_delta_negatives, min_negatives)¶
Bases:
tuple- max_p¶
Alias for field number 0
- min_delta_negatives¶
Alias for field number 1
- min_negatives¶
Alias for field number 2
- max_based_aggregate_statistics(stats_subgroup, list_of_pairs)[source]¶
Aggregate statistics using the maximum-based strategy.
- Parameters:
stats_subgroup – Statistics of the current subgroup.
list_of_pairs – List of (stats, agg_tuple) for all generalizations.
- Returns:
The aggregated statistics.
- max_based_optimistic_estimate(subgroup, target, data, statistics=None)[source]¶
Compute the optimistic estimate using the maximum-based strategy.
- Parameters:
subgroup – The subgroup for which to compute the estimate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
The optimistic estimate of the quality value.
- Return type:
- class pysubgroup.binary_target.LiftQF[source]¶
Bases:
StandardQFLift Quality Function.
LiftQF is a StandardQF with a=0. Thus it treats the difference in ratios as the quality without caring about the relative size of a subgroup.
- class pysubgroup.binary_target.SimpleBinomialQF[source]¶
Bases:
StandardQFSimple Binomial Quality Function.
SimpleBinomialQF is a StandardQF with a=0.5. It is an order-equivalent approximation of the full binomial test if the subgroup size is much smaller than the size of the entire dataset.
- class pysubgroup.binary_target.SimplePositivesQF[source]¶
Bases:
AbstractInterestingnessMeasureQuality function for binary targets based on positive instances.
- calculate_constant_statistics(data, target)[source]¶
Calculate statistics that remain constant for the dataset.
- Parameters:
data (pandas DataFrame) – The dataset.
target (BinaryTarget) – The target definition.
- Raises:
AssertionError – If the target is not an instance of BinaryTarget.
- calculate_statistics(subgroup, target, data, statistics=None)[source]¶
Calculate statistics specific to the subgroup.
- Parameters:
subgroup – The subgroup for which to calculate statistics.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
Contains size_sg and positives_count for the subgroup.
- Return type:
namedtuple
- gp_get_null_vector()[source]¶
Get a null vector for initialization in GP-Growth algorithms.
- Returns:
Zero-initialized array of size 2.
- Return type:
- gp_get_params(_cover_arr, v)[source]¶
Extract parameters from the statistics vector.
- Parameters:
_cover_arr – Unused parameter.
v (numpy.ndarray) – Statistics vector.
- Returns:
Contains size_sg and positives_count.
- Return type:
namedtuple
- gp_get_stats(row_index)[source]¶
Get statistics for a single row (used in GP-Growth algorithms).
- Parameters:
row_index (int) – The index of the row.
- Returns:
Array containing [1, positives[row_index]].
- Return type:
- gp_merge(left, right)[source]¶
Merge two statistics vectors by summing them.
- Parameters:
left (numpy.ndarray) – Left statistics vector.
right (numpy.ndarray) – Right statistics vector.
- property gp_requires_cover_arr¶
Indicate whether the GP-Growth algorithm requires a cover array.
- Returns:
False, since cover array is not required.
- Return type:
- gp_size_sg(stats)[source]¶
Get the size of the subgroup from the statistics.
- Parameters:
stats (numpy.ndarray) – Statistics vector.
- Returns:
Size of the subgroup.
- Return type:
- gp_to_str(stats)[source]¶
Convert statistics to a string representation.
- Parameters:
stats (numpy.ndarray) – Statistics vector.
- Returns:
String representation of the statistics.
- Return type:
- tpl¶
alias of
PositivesQF_parameters
- class pysubgroup.binary_target.StandardQF(a)[source]¶
Bases:
SimplePositivesQF,BoundedInterestingnessMeasureStandardQF which weights the relative size against the difference in averages.
The StandardQF is a general form of quality function which for different values of ‘a’ is order equivalent to many popular quality measures.
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup using the standard quality function.
- Parameters:
subgroup – The subgroup to evaluate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
The computed quality value.
- Return type:
- optimistic_estimate(subgroup, target, data, statistics=None)[source]¶
Compute the optimistic estimate of the quality function.
- Parameters:
subgroup – The subgroup for which to compute the optimistic estimate.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
The optimistic estimate of the quality value.
- Return type:
- optimistic_generalisation(subgroup, target, data, statistics=None)[source]¶
Compute the optimistic generalization of the quality function.
- Parameters:
subgroup – The subgroup for which to compute the optimistic generalization.
target (BinaryTarget) – The target definition.
data (pandas DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
The optimistic generalization of the quality value.
- Return type:
- static standard_qf(a, instances_dataset, positives_dataset, instances_subgroup, positives_subgroup)[source]¶
Compute the standard quality function.
- Parameters:
a (float) – Exponent to trade-off the relative size with difference in means.
instances_dataset (int) – Total number of instances in the dataset.
positives_dataset (int) – Total number of positive instances in the dataset.
instances_subgroup (int) – Number of instances in the subgroup.
positives_subgroup (int) – Number of positive instances in the subgroup.
- Returns:
The computed quality value.
- Return type:
- class pysubgroup.binary_target.WRAccQF[source]¶
Bases:
StandardQFWeighted Relative Accuracy Quality Function.
WRAccQF is a StandardQF with a=1. It is order-equivalent to the difference in the observed and expected number of positive instances.
pysubgroup.constraints module¶
- class pysubgroup.constraints.ContainsValueConstraint(attribute_name, value)[source]¶
Bases:
objectA constraint that ensures a subgroup contains in its cover at least one instance that has a specified value for a specified attribute.
- attribute_name¶
The attribute that needs to contain the specified value in at least one instance.
- value¶
The value that needs to be present in the specified attribute in at least one instance.
- property is_monotone¶
Indicates whether the constraint is monotone.
- Returns:
True if the constraint is monotone, False otherwise.
- Return type:
- is_satisfied(subgroup, statistics=None, data=None)[source]¶
Checks if the subgroup satisfies the constraint.
- Parameters:
subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset being analyzed (optional).
- Returns:
- True if the subgroup’s cover contains at least one instance that has the specified value for the specified attribute (as defined during object construction),
False otherwise.
- Return type:
- class pysubgroup.constraints.MinSupportConstraint(min_support)[source]¶
Bases:
objectA constraint that ensures a subgroup has at least a minimum support.
- gp_is_satisfied(node)[source]¶
Checks if a node satisfies the constraint in the GP-Growth algorithm.
- Parameters:
node – The node to be evaluated.
- Returns:
- True if the node’s size is at least the minimum support,
False otherwise.
- Return type:
- gp_prepare(qf)[source]¶
Prepares the constraint for the GP-Growth algorithm by accessing the size function.
- Parameters:
qf – The quality function used in the GP-Growth algorithm.
- property is_monotone¶
Indicates whether the constraint is monotone.
- Returns:
True if the constraint is monotone, False otherwise.
- Return type:
- is_satisfied(subgroup, statistics=None, data=None)[source]¶
Checks if the subgroup satisfies the minimum support constraint.
- Parameters:
subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset being analyzed (optional).
- Returns:
- True if the subgroup’s size is at least the minimum support,
False otherwise.
- Return type:
- class pysubgroup.constraints.MinUniqueValuesConstraint(attribute_name, min_unique_values)[source]¶
Bases:
objectA constraint that ensures a subgroup contains in its cover a minimum number of unique values for a specified attribute.
- attribute_name¶
The attribute that needs to contain at least the specified number of values.
- min_unique_values¶
The minimum number of unique values that must be present in the attribute in a subgroup cover.
- property is_monotone¶
Indicates whether the constraint is monotone.
- Returns:
True if the constraint is monotone, False otherwise.
- Return type:
- is_satisfied(subgroup, statistics=None, data=None)[source]¶
Checks if the subgroup satisfies the constraint.
- Parameters:
subgroup – The subgroup to be evaluated.
statistics – Precomputed statistics for the subgroup (optional).
data – The dataset being analyzed (optional).
- Returns:
- True if the subgroup’s cover contains the minimum number of unique values for the specified attribute (as defined during object construction),
False otherwise.
- Return type:
pysubgroup.datasets module¶
This module provides functions to load example datasets for testing and demonstration purposes. The datasets included are the German Credit Data and the Titanic dataset.
- pysubgroup.datasets.get_credit_data()[source]¶
Load the German Credit Data dataset.
The dataset is provided in ARFF format and includes various attributes related to creditworthiness.
- Returns:
A DataFrame containing the credit data.
- Return type:
pysubgroup.fi_target module¶
Created on 29.09.2017
@author: lemmerfn
This module defines the FITarget and related quality functions for frequent itemset mining using the pysubgroup package.
- class pysubgroup.fi_target.AreaQF[source]¶
Bases:
SimpleCountQFQuality function that evaluates subgroups based on their area.
The area is computed as the size of the subgroup multiplied by the number of contained items
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup.
- Parameters:
subgroup – The subgroup to evaluate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The area of the subgroup (size_sg * depth).
- Return type:
- class pysubgroup.fi_target.CountQF[source]¶
Bases:
SimpleCountQF,BoundedInterestingnessMeasureQuality function that evaluates subgroups based on their size.
Extends SimpleCountQF and BoundedInterestingnessMeasure.
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup.
- Parameters:
subgroup – The subgroup to evaluate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The size of the subgroup.
- Return type:
- optimistic_estimate(subgroup, target, data, statistics=None)[source]¶
Compute the optimistic estimate of the quality function.
- Parameters:
subgroup – The subgroup for which to compute the optimistic estimate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The size of the subgroup.
- Return type:
- class pysubgroup.fi_target.FITarget[source]¶
Bases:
BaseTargetTarget class for frequent itemset mining.
Represents the target for mining frequent itemsets, extending the BaseTarget class from pysubgroup.
- calculate_statistics(subgroup_description, data, cached_statistics=None)[source]¶
Calculate statistics for the subgroup.
- get_base_statistics(subgroup, data)[source]¶
Compute the base statistics for the subgroup.
- Parameters:
subgroup – The subgroup for which to compute statistics.
data – The dataset.
- Returns:
The size of the subgroup.
- Return type:
- statistic_types = ('size_sg', 'size_dataset')¶
- class pysubgroup.fi_target.SimpleCountQF[source]¶
Bases:
AbstractInterestingnessMeasureQuality function that counts the number of instances in a subgroup.
Provides basic counting functionality, useful for frequent itemset mining.
- calculate_constant_statistics(data, target)[source]¶
Calculate statistics that remain constant for the dataset.
- Parameters:
data – The dataset.
target – The target definition (unused in this implementation).
- calculate_statistics(subgroup_description, target, data, statistics=None)[source]¶
Calculate statistics specific to the subgroup.
- Parameters:
subgroup_description – The description of the subgroup.
target – The target definition (unused in this implementation).
data – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
Contains ‘size_sg’ for the subgroup.
- Return type:
namedtuple
- gp_get_null_vector()[source]¶
Get a null vector for initialization in GP-Growth algorithms.
- Returns:
A dictionary with ‘size_sg’ set to 0.
- Return type:
- gp_get_params(_cover_arr, v)[source]¶
Extract parameters from the statistics dictionary.
- Parameters:
_cover_arr – Unused parameter.
v (dict) – Statistics dictionary.
- Returns:
Contains ‘size_sg’ from the statistics.
- Return type:
namedtuple
- gp_get_stats(_)[source]¶
Get statistics for a single instance (used in GP-Growth algorithms).
- Returns:
A dictionary with ‘size_sg’ set to 1.
- Return type:
- gp_requires_cover_arr = False¶
- tpl¶
alias of
CountQF_parameters
pysubgroup.gp_growth module¶
- class pysubgroup.gp_growth.GpGrowth(mode='b_u')[source]¶
Bases:
objectImplementation of the GP-Growth algorithm.
GP-Growth is a generalization of FP-Growth and SD-Map capable of working with different Exceptional Model Mining targets on top of Frequent Itemset Mining and Subgroup Discovery.
This class provides methods to perform pattern mining using GP-Growth, supporting both bottom-up (‘b_u’) and top-down (‘t_d’) modes.
- GP_node¶
Structure representing a node in the GP-tree.
- Type:
namedtuple
- tqdm¶
Function for progress bars (default is identity function).
- Type:
function
- task¶
The subgroup discovery task to execute.
- Type:
- add_if_required(prefix, gp_stats)[source]¶
Adds a pattern to the result set if it meets the quality threshold.
- Parameters:
prefix (tuple) – The current pattern (tuple of class indices).
gp_stats – The aggregated statistics for the pattern.
- calculate_quality_function_for_patterns(task, results, arrs)[source]¶
Calculates the quality function for the given patterns.
- Parameters:
task (SubgroupDiscoveryTask) – The task containing the quality function.
results (list) – List of patterns with their aggregated parameters.
arrs (ndarray) – The coverage arrays of the selectors.
- Returns:
A list of tuples containing quality, indices, and statistics.
- Return type:
- check_constraints(node)[source]¶
Checks if a node satisfies all monotonic constraints.
- Parameters:
node – The node to check.
- Returns:
True if the node satisfies all constraints, False otherwise.
- Return type:
- check_tree_is_ordered(root, prefix=None)[source]¶
Verifies that the nodes of a tree are sorted in ascending order.
- convert_results_to_subgroups(results, selectors_sorted)[source]¶
Converts patterns (indices) to actual subgroups.
- create_copy_of_path(nodes, new_nodes, stats)[source]¶
Creates a copy of a path in the tree, updating statistics.
- create_copy_of_tree_top_down(from_root, nodes=None, parent=None, is_valid_class=None)[source]¶
Creates a copy of the tree starting from a specific root in top-down mode.
- create_initial_tree(arrs)[source]¶
Creates the initial FP-tree from the coverage arrays.
- Parameters:
arrs (ndarray) – A 2D NumPy array where each column corresponds to the coverage array of a selector.
- Returns:
- A tuple containing:
root (GP_node): The root node of the tree.
nodes (list): A list of all nodes in the tree.
- Return type:
- create_new_tree_from_nodes(nodes)[source]¶
Creates a new tree from a list of nodes for recursive mining.
- Parameters:
nodes (list) – List of nodes to build the new tree from.
- Returns:
A dictionary mapping class labels to nodes in the new tree.
- Return type:
defaultdict
- execute(task)[source]¶
Executes the GP-Growth algorithm on the given task.
- Parameters:
task (SubgroupDiscoveryTask) – The subgroup discovery task to execute.
- Returns:
The result of the subgroup discovery.
- Return type:
- get_nodes_upwards(node)[source]¶
Retrieves all nodes from a given node up to the root.
- Parameters:
node (GP_node) – The starting node.
- Returns:
A list of nodes from the given node up to the root.
- Return type:
- get_stats_for_class(cls_nodes)[source]¶
Aggregates statistics for each class label.
- Parameters:
cls_nodes (defaultdict) – Dictionary mapping class labels to nodes.
- Returns:
A dictionary mapping class labels to aggregated statistics.
- Return type:
- get_top_down_tree_for_class(cls_nodes, cls, is_valid_class)[source]¶
Creates a subtree for a specific class in top-down mode.
- Parameters:
- Returns:
- A tuple containing:
base_root (GP_node): The root of the new subtree.
nodes (list): A list of nodes in the new subtree.
- Return type:
- merge_trees_top_down(nodes, mutable_root, from_root, is_valid_class)[source]¶
Merges two trees in top-down mode.
- nodes_to_cls_nodes(nodes)[source]¶
Groups nodes by their class labels.
- Parameters:
nodes (list) – List of nodes to group.
- Returns:
A dictionary mapping class labels to lists of nodes.
- Return type:
defaultdict
- normal_insert(root, nodes, new_stats, classes)[source]¶
Inserts a transaction into the FP-tree.
- Parameters:
root (GP_node) – The root node of the tree.
nodes (list) – List of all nodes in the tree.
new_stats – The statistics associated with the transaction.
classes (array-like) – The class labels (selectors) present in the transaction.
- Returns:
The leaf node where the transaction ends.
- Return type:
GP_node
- prepare_selectors(search_space, data)[source]¶
Prepares the selectors by computing their coverage arrays and filtering based on constraints.
- Parameters:
search_space (list) – The list of selectors to consider.
data (DataFrame) – The dataset to be analyzed.
- Returns:
- A tuple containing:
selectors_sorted (list): The sorted list of selectors after filtering.
- arrs (ndarray): A 2D NumPy array where each column corresponds to the
coverage array of a selector.
- Return type:
- recurse(cls_nodes, prefix, is_single_path=False)[source]¶
Recursively mines patterns in bottom-up mode.
- remove_selectors_with_low_optimistic_estimate(s, search_space_size)[source]¶
Removes selectors from the list that have an optimistic estimate below the minimum required quality.
- setup(task)[source]¶
Prepares the algorithm by setting up the task, depth, constraints, and quality function.
- Parameters:
task (SubgroupDiscoveryTask) – The task to execute.
- setup_constraints(constraints, qf)[source]¶
Prepares constraints for use in the algorithm.
- Parameters:
constraints (list) – List of constraints to apply.
qf – The quality function used in the task.
- setup_from_quality_function(qf)[source]¶
Sets up function pointers from the quality function.
- Parameters:
qf – The quality function used in the task.
- to_file(task, path)[source]¶
Writes the tree to a file in a specific format.
- Parameters:
task (SubgroupDiscoveryTask) – The task containing the quality function.
path (str or Path) – The file path to write to.
pysubgroup.measures module¶
Created on 28.04.2016
@author: lemmerfn
- class pysubgroup.measures.CombinedInterestingnessMeasure(measures, weights=None)[source]¶
- class pysubgroup.measures.GeneralizationAwareQF(qf)[source]¶
Bases:
AbstractInterestingnessMeasureA class that computes the generalization aware qf as follows: qf(sg) = qf(sg) - max_{generalizations} qf(sq)
- class pysubgroup.measures.GeneralizationAwareQF_stats(qf)[source]¶
Bases:
AbstractInterestingnessMeasureAn abstract base class that implements aggregation of stats of generalisations
- ga_tuple¶
alias of
ga_stats_tuple
pysubgroup.model_predictions_target module¶
Created on 16.08.2025
@author: Tom Siegl
- class pysubgroup.model_predictions_target.ARLQF(label_column: str, positive_label_value: any, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶
Bases:
BaseSoftClassifierPerformanceQFA quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s average ranking loss (ARL) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a greater ARL) compared to the entire dataset, then the quality is positive.
Weighting factors are provided to let the subgroup size and class balance influence the quality.
The overall quality is captured by the formula q = (ARL(subgroup) - ARL(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).
Implementation of phi^{rasl}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).
- class pysubgroup.model_predictions_target.BaseSoftClassifierPerformanceQF(performance_measure, performance_measure_type: Literal['score', 'loss'], performance_measure_bound=None, performance_measure_constraints: list[any] = [], subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶
Bases:
BoundedInterestingnessMeasure- calculate_constant_statistics(data: DataFrame, target: SoftClassifierTarget)[source]¶
This function is called once for every search execution, it should do any preparation that is necessary prior to an execution.
- calculate_statistics(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics={})[source]¶
calculates necessary statistics this statistics object is passed on to the evaluate and optimistic_estimate functions
- evaluate(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics=None)[source]¶
return the quality calculated from the statistics
- optimistic_estimate(subgroup, target: SoftClassifierTarget, data: DataFrame, statistics=None)[source]¶
returns optimistic estimate if one is available return it otherwise infinity
- class pysubgroup.model_predictions_target.PRAUCQF(label_column: str, positive_label_value: any, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶
Bases:
BaseSoftClassifierPerformanceQFA quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s Area Under the Precision-Recall Curve (PR AUC) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a lower PR AUC) compared to the entire dataset, then the quality is positive.
Weighting factors are provided to let the subgroup size and class balance influence the quality.
The overall quality is captured by the formula q = (PRAUC(subgroup) - PRAUC(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).
Implementation of phi^{rPRAUC}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).
- class pysubgroup.model_predictions_target.ROCAUCQF(label_column: str, subgroup_class_balance_weight: float = 0, subgroup_size_weight: float = 0)[source]¶
Bases:
BaseSoftClassifierPerformanceQFA quality function which scores binary soft classifier performance in a subgroup based on the difference of the classifier’s Area Under the Receiver Operating Characteristic Curve (ROC AUC) on the subgroup cover vs. the entire dataset. If the classifier performs worse on the subgroup (i.e. it has a lower ROC AUC) compared to the entire dataset, then the quality is positive.
Weighting factors are provided to let the subgroup size and class balance influence the quality.
The overall quality is captured by the formula q = (ROCAUC(subgroup) - ROCAUC(dataset)) * |subgroup|^(size_weight) * class_balance(subgroup)^(class_balance_weight).
Implementation of phi^{rROCAUC}_{alpha, beta} from the paper [“SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers”](https://doi.org/10.48550/arXiv.2505.11283).
- class pysubgroup.model_predictions_target.SoftClassifierTarget(label_column='label', prediction_column='prediction')[source]¶
Bases:
objectMinimal target concept implementation to select label and prediction columns for binary soft classifier performance measures.
- get_target_columns(data: DataFrame)[source]¶
Select the label and prediction columns from object initialization.
- statistic_types = ()¶
- pysubgroup.model_predictions_target.average_ranking_loss(y_true, y_pred)[source]¶
Implementation of the Average Ranking Loss (ARL) performance measure for binary soft classifiers based on the definitions in the paper [“Understanding Where Your Classifier Does (Not) Work – The SCaPE Model Class for EMM”](https://doi.org/10.1109/ICDM.2014.10).
- Parameters:
y_true – Binary Labels, must be ordered to match y_pred.
y_pred – Predicted Scores, must be in ascending order.
pysubgroup.model_target module¶
- class pysubgroup.model_target.EMM_Likelihood(model)[source]¶
Bases:
AbstractInterestingnessMeasureExceptional Model Mining likelihood-based interestingness measure.
This class computes the difference in likelihoods between a subgroup model and the inverse (complement) model, providing a measure of how exceptional the subgroup is with respect to the entire dataset.
- calculate_constant_statistics(data, target)[source]¶
Calculate statistics that remain constant over all subgroups.
- Parameters:
data – The dataset as a pandas DataFrame.
target – The target variable (unused in this context).
- calculate_statistics(subgroup, target, data, statistics=None)[source]¶
Calculate statistics specific to a subgroup.
- Parameters:
subgroup – The subgroup description.
target – The target variable (unused in this context).
data – The dataset as a pandas DataFrame.
statistics – Previously calculated statistics (optional).
- Returns:
An EMM_Likelihood.tpl namedtuple containing model parameters, subgroup likelihood, inverse likelihood, and subgroup size.
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the interestingness of a subgroup.
- Parameters:
subgroup – The subgroup description.
target – The target variable (unused in this context).
data – The dataset as a pandas DataFrame.
statistics – Previously calculated statistics (optional).
- Returns:
The difference between subgroup likelihood and inverse likelihood.
- get_tuple(sg_size, params, cover_arr)[source]¶
Compute the likelihoods for the subgroup and its complement.
- Parameters:
sg_size – Size of the subgroup.
params – Model parameters obtained from fitting the subgroup.
cover_arr – Boolean array indicating the instances in the subgroup.
- Returns:
An EMM_Likelihood.tpl namedtuple with the computed statistics.
- gp_get_params(cover_arr, v)[source]¶
Get parameters for GP-Growth algorithm.
- Parameters:
cover_arr – Boolean array indicating the instances in the subgroup.
v – Statistics vector from GP-Growth.
- Returns:
An EMM_Likelihood.tpl namedtuple with the computed statistics.
- property gp_requires_cover_arr¶
Indicate whether the GP-Growth algorithm requires a cover array.
- Returns:
True, since the cover array is required.
- tpl¶
alias of
EMM_Likelihood
- class pysubgroup.model_target.PolyRegression_ModelClass(x_name='x', y_name='y', degree=1)[source]¶
Bases:
objectPolynomial Regression Model Class for Exceptional Model Mining.
Provides methods to fit a polynomial regression model to a subgroup and compute likelihoods for Exceptional Model Mining.
- calculate_constant_statistics(data, target)[source]¶
Calculate statistics that remain constant over all subgroups.
- Parameters:
data – The dataset as a pandas DataFrame.
target – The target variable (unused in this context).
- fit(subgroup, data=None)[source]¶
Fit the polynomial regression model to the subgroup data.
- Parameters:
subgroup – The subgroup description.
data – The dataset as a pandas DataFrame (optional).
- Returns:
Contains regression coefficients and subgroup size.
- Return type:
- gp_get_null_vector()[source]¶
Get a null vector for initialization in the GP-Growth algorithm.
- Returns:
Zero-initialized array of size 5.
- Return type:
- gp_get_params(v)[source]¶
Extract model parameters from the statistics vector.
- Parameters:
v (numpy.ndarray) – Statistics vector.
- Returns:
Contains regression coefficients and subgroup size.
- Return type:
- gp_get_stats(row_index)[source]¶
Get statistics for a single row (used in GP-Growth algorithm).
- Parameters:
row_index (int) – Index of the row in the dataset.
- Returns:
Statistics vector for the given row.
- Return type:
- static gp_merge(u, v)[source]¶
Merge two statistics vectors for the GP-Growth algorithm.
- Parameters:
u (numpy.ndarray) – Left statistics vector.
v (numpy.ndarray) – Right statistics vector.
- property gp_requires_cover_arr¶
Indicate whether the GP-Growth algorithm requires a cover array.
- Returns:
False, since the cover array is not required.
- gp_size_sg(stats)[source]¶
Get the size of the subgroup from the statistics.
- Parameters:
stats (numpy.ndarray) – Statistics vector.
- Returns:
Size of the subgroup.
- Return type:
- gp_to_str(stats)[source]¶
Convert statistics to a string representation.
- Parameters:
stats (numpy.ndarray) – Statistics vector.
- Returns:
String representation of the statistics.
- Return type:
- likelihood(stats, sg)[source]¶
Compute the likelihoods for the subgroup instances.
- Parameters:
stats (beta_tuple) – Regression parameters and subgroup size.
sg (numpy.ndarray) – Boolean array indicating subgroup instances.
- Returns:
Likelihood values for the subgroup instances.
- Return type:
- loglikelihood(stats, sg)[source]¶
Compute the log-likelihoods for the subgroup instances.
- Parameters:
stats (beta_tuple) – Regression parameters and subgroup size.
sg (numpy.ndarray) – Boolean array indicating subgroup instances.
- Returns:
Log-likelihood values for the subgroup instances.
- Return type:
pysubgroup.numeric_target module¶
This module defines the NumericTarget and associated quality functions for subgroup discovery when the target variable is numeric.
- class pysubgroup.numeric_target.GeneralizationAware_StandardQFNumeric(a, invert=False, estimator='default', centroid='mean')[source]¶
Bases:
GeneralizationAwareQF_statsGeneralization-Aware Standard Quality Function for Numeric Targets.
Extends StandardQFNumeric to consider generalizations during subgroup discovery, providing methods for optimistic estimates and aggregate statistics.
- aggregate_statistics(stats_subgroup, list_of_pairs)[source]¶
Aggregate statistics from generalizations.
- Parameters:
stats_subgroup – Statistics of the current subgroup.
list_of_pairs – List of (stats, agg_stats) tuples from generalizations.
- Returns:
The aggregated statistics.
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup considering generalizations.
- Parameters:
subgroup – The subgroup to evaluate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The computed quality value.
- Return type:
- class pysubgroup.numeric_target.NumericTarget(target_variable)[source]¶
Bases:
objectTarget class for numeric variables in subgroup discovery.
Represents a target where the variable of interest is numeric, and computes statistics such as mean, median, standard deviation within subgroups.
- calculate_statistics(subgroup, data, cached_statistics=None)[source]¶
Calculate various statistics for the subgroup and dataset.
- Parameters:
subgroup – The subgroup for which to calculate statistics.
data (pandas.DataFrame) – The dataset.
cached_statistics (dict, optional) – Previously computed statistics.
- Returns:
A dictionary containing statistical measures.
- Return type:
- get_attributes()[source]¶
Get a list of attribute names used by the target.
- Returns:
A list containing the target variable name.
- Return type:
- get_base_statistics(subgroup, data)[source]¶
Compute basic statistics for the subgroup and dataset.
- Parameters:
subgroup – The subgroup for which to compute statistics.
data (pandas.DataFrame) – The dataset.
- Returns:
(instances_dataset, mean_dataset, instances_subgroup, mean_sg)
- Return type:
- statistic_types = ('size_sg', 'size_dataset', 'mean_sg', 'mean_dataset', 'std_sg', 'std_dataset', 'median_sg', 'median_dataset', 'max_sg', 'max_dataset', 'min_sg', 'min_dataset', 'mean_lift', 'median_lift')¶
- class pysubgroup.numeric_target.StandardQFNumeric(a, invert=False, estimator='default', centroid='mean')[source]¶
Bases:
BoundedInterestingnessMeasureStandard Quality Function for numeric targets.
This quality function computes interestingness of subgroups based on the difference between subgroup mean (or median) and dataset mean (or median), weighted by the size of the subgroup raised to the power of ‘a’.
- class Max_Estimator(qf)[source]¶
Bases:
objectEstimator for optimistic estimate using maximum value strategy.
This estimator calculates the optimistic estimate based on the maximum value greater than the dataset centroid.
From Florian Lemmerich’s Dissertation [section 4.2.2.1, Theorem 4 (page 82)]:
\[oe(sg) = n_{>\mu_0}^a (T^{\max}(sg) - \mu_0)\]- calculate_constant_statistics(data, target)[source]¶
Calculate constant statistics needed for estimation.
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- get_data(data, target)[source]¶
Prepare data for estimation (no changes for this estimator).
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- Returns:
The unmodified dataset.
- Return type:
- get_estimate(subgroup, sg_size, sg_centroid, cover_arr, _)[source]¶
Compute the optimistic estimate for the subgroup.
- Parameters:
subgroup – The subgroup description.
sg_size (int) – Size of the subgroup.
sg_centroid (float) – Mean or median of the subgroup.
cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.
_ – Unused parameter.
- Returns:
The optimistic estimate.
- Return type:
- class MeanOrdering_Estimator(qf)[source]¶
Bases:
objectEstimator for optimistic estimate using mean ordering strategy.
This estimator sorts the target values and computes the optimal subgroup by considering prefixes of the sorted list.
- calculate_constant_statistics(data, target)[source]¶
Set up the estimation function, possibly using Numba for speed.
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- get_data(data, target)[source]¶
Prepare data by sorting according to the target variable.
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- Returns:
The sorted dataset.
- Return type:
- get_estimate(subgroup, sg_size, sg_mean, cover_arr, target_values_sg)[source]¶
Compute the optimistic estimate for the subgroup.
- Parameters:
subgroup – The subgroup description.
sg_size (int) – Size of the subgroup.
sg_mean (float) – Mean of the subgroup.
cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.
target_values_sg (numpy.ndarray) – Target values in the subgroup.
- Returns:
The optimistic estimate.
- Return type:
- get_estimate_numpy(values_sg, _, mean_dataset)[source]¶
Compute the optimistic estimate using NumPy.
- Parameters:
values_sg (numpy.ndarray) – Sorted target values in the subgroup.
_ – Unused parameter.
mean_dataset (float) – Mean of the dataset.
- Returns:
The optimistic estimate.
- Return type:
- class Summation_Estimator(qf)[source]¶
Bases:
objectEstimator for optimistic estimate using summation strategy.
This estimator calculates the optimistic estimate as a hypothetical subgroup which contains only instances with value greater than the dataset mean and is of maximal size.
From Florian Lemmerich’s Dissertation [section 4.2.2.1, Theorem 2 (page 81)]:
\[oe(sg) = \sum_{x \in sg, T(x)>0} (T(sg) - \mu_0)\]- calculate_constant_statistics(data, target)[source]¶
Calculate constant statistics needed for estimation.
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- get_data(data, target)[source]¶
Prepare data for estimation (no changes for this estimator).
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- Returns:
The unmodified dataset.
- Return type:
- get_estimate(subgroup, sg_size, sg_centroid, cover_arr, _)[source]¶
Compute the optimistic estimate for the subgroup.
- Parameters:
subgroup – The subgroup description.
sg_size (int) – Size of the subgroup.
sg_centroid (float) – Mean or median of the subgroup.
cover_arr (numpy.ndarray) – Boolean array indicating subgroup instances.
_ – Unused parameter.
- Returns:
The optimistic estimate.
- Return type:
- calculate_constant_statistics(data, target)[source]¶
Calculate statistics that remain constant for the dataset.
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- calculate_statistics(subgroup, target, data, statistics=None)[source]¶
Calculate statistics specific to the subgroup.
- Parameters:
subgroup – The subgroup for which to calculate statistics.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
Contains size_sg, mean or median, and estimate.
- Return type:
namedtuple
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup using the standard quality function.
- Parameters:
subgroup – The subgroup to evaluate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The computed quality value.
- Return type:
- mean_tpl¶
alias of
StandardQFNumeric_parameters
- median_tpl¶
alias of
StandardQFNumeric_median_parameters
- optimistic_estimate(subgroup, target, data, statistics=None)[source]¶
Compute the optimistic estimate of the quality function.
- Parameters:
subgroup – The subgroup for which to compute the optimistic estimate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The optimistic estimate of the quality value.
- Return type:
- static standard_qf_numeric(a, _, mean_dataset, instances_subgroup, mean_sg)[source]¶
Compute the standard quality function for numeric targets.
- Parameters:
- Returns:
The computed quality value.
- Return type:
- tpl¶
alias of
StandardQFNumeric_parameters
- class pysubgroup.numeric_target.StandardQFNumericMedian[source]¶
Bases:
BoundedInterestingnessMeasureQuality function for numeric targets using median (deprecated).
Note
This class is no longer supported. Use StandardQFNumeric with centroid=’median’ instead.
- tpl¶
alias of
StandardQFNumericMedian_parameters
- class pysubgroup.numeric_target.StandardQFNumericTscore(invert=False)[source]¶
Bases:
BoundedInterestingnessMeasureQuality function for numeric targets using T-score.
- calculate_constant_statistics(data, target)[source]¶
Calculate statistics that remain constant for the dataset.
- Parameters:
data (pandas.DataFrame) – The dataset.
target (NumericTarget) – The target definition.
- calculate_statistics(subgroup, target, data, statistics=None)[source]¶
Calculate statistics specific to the subgroup.
- Parameters:
subgroup – The subgroup for which to calculate statistics.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Unused in this implementation.
- Returns:
Contains size_sg, mean, std, and estimate.
- Return type:
namedtuple
- evaluate(subgroup, target, data, statistics=None)[source]¶
Evaluate the quality of the subgroup using the T-score.
- Parameters:
subgroup – The subgroup to evaluate.
target (NumericTarget) – The target definition.
data (pandas.DataFrame) – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The computed T-score.
- Return type:
- optimistic_estimate(subgroup, target, data, statistics=None)[source]¶
Compute the optimistic estimate of the quality function.
- Parameters:
subgroup – The subgroup for which to compute the optimistic estimate.
target – The target definition.
data – The dataset.
statistics (any, optional) – Previously computed statistics.
- Returns:
The optimistic estimate of the quality value.
- Return type:
- static t_score(mean_dataset, instances_subgroup, mean_sg, std_sg)[source]¶
Compute the T-score for the subgroup.
- tpl¶
alias of
StandardQFNumericTscore_parameters
- pysubgroup.numeric_target.calc_sorted_median(arr)[source]¶
Calculate the median of a sorted array.
- Parameters:
arr (numpy.ndarray) – A sorted array.
- Returns:
The median value.
- Return type:
pysubgroup.permutation_test module¶
- class pysubgroup.permutation_test.NegativeClassCountRandomSelector(size, negative_class_count, np_rng, positive_class_indices, negative_class_indices)[source]¶
Bases:
objectA selector that covers a random subset of the given indices, such that the number of covered instances as well as the number of negatives instances are always the same.
- property selectors¶
- pysubgroup.permutation_test.permutation_test(qf: any, result: any, target: ~pysubgroup.model_predictions_target.SoftClassifierTarget, data: ~pandas.core.frame.DataFrame, num_random_samples: int, np_rng: ~numpy.random._generator.Generator = Generator(PCG64) at 0x75AC2054B4C0, max_random_sampling_retries: int = 10, alpha: float = 0.05, pos_label: any = 1, neg_label: any = 0, multitest_correction_method: str = 'fdr_by', tqdm: any = None)[source]¶
Test the subgroups in the result for statistical significance by comparison to qualities of random samples from the data. Random samples are drawn such that the number of instances from each class in the sample is the same as in the tested subgroup.
Only for SoftClassifierTargets.
- Parameters:
qf – Quality function to use as the test statistic.
result – ps.SubgroupDiscoveryResult object holding the subgroups to test.
target – Target concept to use in the quality function.
data – Dataset to compute all qualities from. The qualities of the given subgroups are also recomputed on this data for the test.
num_random_samples – How many random samples to draw. More samples allow to distinguish p-values more fine-grained.
np_rng – Random generator object to use for drawing the samples. Use this to get reproducible results.
max_random_sampling_retries – How often to repeat the drawing process for each sample to get a quality. Repetitions are used when the quality is undefined on a random sample.
pos_label – Which value in the dataset to count as a positive class.
neg_label – Which value in the dataset to count as a negative class.
multitest_correction_method – Which method to correct the p-values against the multiple comparison problem with. Refer to statsmodels.stats.multitest.multipletests for all possible values.
- Return p_values_raw:
Uncorrected p-values for each subgroup
- Return reject:
Test result after multiple testing correction.
- Return p_values_corrected:
P-values after multiple testing correction.
- Return qualities:
Subgroup qualities on the testing data.
- Return samples:
The full random sample of qualities that was generated for each subgroup.
pysubgroup.refinement_operator module¶
- class pysubgroup.refinement_operator.RefinementOperator[source]¶
Bases:
objectBase class for refinement operators.
- class pysubgroup.refinement_operator.StaticGeneralizationOperator(selectors)[source]¶
Bases:
objectRefinement operator for static generalization.
This operator generalizes subgroups by adding selectors from a predefined list, ensuring that each selector is used in a specific order.
pysubgroup.representations module¶
- class pysubgroup.representations.BitSetRepresentation(df, selectors_to_patch)[source]¶
Bases:
RepresentationBaseRepresentation class that uses bitsets for selectors and conjunctions.
- Conjunction¶
alias of
BitSet_Conjunction
- Disjunction¶
alias of
BitSet_Disjunction
- class pysubgroup.representations.BitSet_Conjunction(*args, **kwargs)[source]¶
Bases:
ConjunctionConjunction subclass that uses bitsets for representation.
Provides efficient computation of the conjunction using numpy boolean arrays.
- append_and(to_append)[source]¶
Append a selector using logical AND and update the representation.
- Parameters:
to_append – Selector to append.
- compute_representation()[source]¶
Compute the bitset representation of the conjunction.
- Returns:
Numpy boolean array representing the instances covered by the conjunction.
- n_instances = 0¶
- property size_sg¶
Size of the subgroup represented by the conjunction.
- class pysubgroup.representations.BitSet_Disjunction(*args, **kwargs)[source]¶
Bases:
DisjunctionDisjunction subclass that uses bitsets for representation.
Provides efficient computation of the disjunction using numpy boolean arrays.
- append_or(to_append)[source]¶
Append a selector using logical OR and update the representation.
- Parameters:
to_append – Selector to append.
- compute_representation()[source]¶
Compute the bitset representation of the disjunction.
- Returns:
Numpy boolean array representing the instances covered by the disjunction.
- property size_sg¶
Size of the subgroup represented by the disjunction.
- class pysubgroup.representations.NumpySetRepresentation(df, selectors_to_patch)[source]¶
Bases:
RepresentationBaseRepresentation class that uses numpy arrays for selectors and conjunctions.
- Conjunction¶
alias of
NumpySet_Conjunction
- class pysubgroup.representations.NumpySet_Conjunction(*args, **kwargs)[source]¶
Bases:
ConjunctionConjunction subclass that uses numpy arrays for set representation.
- all_set = None¶
- append_and(to_append)[source]¶
Append a selector using logical AND and update the representation.
- Parameters:
to_append – Selector to append.
- compute_representation()[source]¶
Compute the numpy array representation of the conjunction.
- Returns:
Numpy array of indices representing the instances covered by the conjunction.
- property size_sg¶
Size of the subgroup represented by the conjunction.
- class pysubgroup.representations.RepresentationBase(new_conjunction, selectors_to_patch)[source]¶
Bases:
objectBase class for different representation strategies.
Provides methods to patch selectors and manage class-level patches. Can be used as a context manager to ensure patches are applied and removed properly.
- patch_classes()[source]¶
Patch the required classes.
Can be overridden by subclasses to patch class-level attributes or methods.
- class pysubgroup.representations.SetRepresentation(df, selectors_to_patch)[source]¶
Bases:
RepresentationBaseRepresentation class that uses sets for selectors and conjunctions.
- Conjunction¶
alias of
Set_Conjunction
- class pysubgroup.representations.Set_Conjunction(*args, **kwargs)[source]¶
Bases:
ConjunctionConjunction subclass that uses sets for representation.
- all_set = {}¶
- append_and(to_append)[source]¶
Append a selector using logical AND and update the representation.
- Parameters:
to_append – Selector to append.
- compute_representation()[source]¶
Compute the set representation of the conjunction.
- Returns:
Set of indices representing the instances covered by the conjunction.
- property size_sg¶
Size of the subgroup represented by the conjunction.
pysubgroup.subgroup_description module¶
Created on 28.04.2016
@author: lemmerfn
- class pysubgroup.subgroup_description.BooleanExpressionBase[source]¶
Bases:
ABCBase class for boolean expressions (conjunctions and disjunctions).
- class pysubgroup.subgroup_description.Conjunction(selectors)[source]¶
Bases:
BooleanExpressionBaseConjunction of selectors (logical AND).
- covers(instance)[source]¶
Determine which instances are covered by the conjunction.
- Parameters:
instance – pandas DataFrame containing the data.
- Returns:
A boolean array indicating which instances are covered.
- property depth¶
Return the number of selectors in the conjunction.
- static from_str(s)[source]¶
Create a Conjunction from a string representation.
- Parameters:
s – String representation of the conjunction.
- Returns:
A Conjunction instance.
- property selectors¶
Return the selectors in the conjunction as a tuple.
- class pysubgroup.subgroup_description.DNF(selectors=None)[source]¶
Bases:
DisjunctionDisjunctive Normal Form expression.
- class pysubgroup.subgroup_description.Disjunction(selectors=None)[source]¶
Bases:
BooleanExpressionBaseDisjunction of selectors (logical OR).
- covers(instance)[source]¶
Determine which instances are covered by the disjunction.
- Parameters:
instance – pandas DataFrame containing the data.
- Returns:
A boolean array indicating which instances are covered.
- property selectors¶
Return the selectors in the disjunction as a tuple.
- class pysubgroup.subgroup_description.EqualitySelector(*args, **kwargs)[source]¶
Bases:
SelectorBaseSelector that checks for equality with a specific value.
- property attribute_name¶
Name of the attribute.
- property attribute_value¶
Value of the attribute to compare for equality.
- classmethod compute_descriptions(attribute_name, attribute_value, selector_name)[source]¶
Compute the descriptions (hash, query, string) for the selector.
- covers(data)[source]¶
Determine which instances in data are covered by this selector.
- Parameters:
data – pandas DataFrame containing the data.
- Returns:
A boolean array indicating which instances are covered.
- static from_str(s)[source]¶
Create an EqualitySelector from a string representation.
- Parameters:
s – String representation of the selector.
- Returns:
An EqualitySelector instance.
- property selectors¶
Return the selector itself as a tuple (for compatibility).
- class pysubgroup.subgroup_description.IntervalSelector(*args, **kwargs)[source]¶
Bases:
SelectorBaseSelector that checks if a value is within an interval.
- property attribute_name¶
Name of the attribute.
- classmethod compute_descriptions(attribute_name, lower_bound, upper_bound, selector_name=None)[source]¶
Compute the descriptions (hash, query, string) for the interval selector.
- classmethod compute_string(attribute_name, lower_bound, upper_bound, rounding_digits)[source]¶
Compute the string representation of the interval selector.
- covers(data_instance)[source]¶
Determine which instances are covered by this interval selector.
- Parameters:
data_instance – pandas DataFrame containing the data.
- Returns:
A boolean array indicating which instances are within the interval.
- static from_str(s)[source]¶
Create an IntervalSelector from a string representation.
- Parameters:
s – String representation of the interval selector.
- Returns:
An IntervalSelector instance.
- property lower_bound¶
Lower bound of the interval (inclusive).
- property selectors¶
Return the selector itself as a tuple (for compatibility).
- set_descriptions(attribute_name, lower_bound, upper_bound, selector_name=None)[source]¶
Set the descriptions (hash, query, string) for the interval selector.
- property upper_bound¶
Upper bound of the interval (exclusive).
- class pysubgroup.subgroup_description.NegatedSelector(*args, **kwargs)[source]¶
Bases:
SelectorBaseSelector that negates another selector.
- property attribute_name¶
Name of the attribute.
- covers(data_instance)[source]¶
Determine which instances are not covered by the underlying selector.
- Parameters:
data_instance – pandas DataFrame containing the data.
- Returns:
A boolean array indicating which instances are not covered.
- property selectors¶
Return the selector itself as a tuple (for compatibility).
- class pysubgroup.subgroup_description.SelectorBase(*args, **kwargs)[source]¶
Bases:
ABCBase class for selectors, ensuring each selector instance is unique.
- pysubgroup.subgroup_description.create_nominal_selectors(data, ignore=None)[source]¶
Create equality selectors for nominal attributes.
- Parameters:
data – pandas DataFrame containing the data.
ignore – List of attribute names to ignore.
- Returns:
List of EqualitySelector instances.
- pysubgroup.subgroup_description.create_nominal_selectors_for_attribute(data, attribute_name, dtypes=None)[source]¶
Create equality selectors for a nominal attribute.
- Parameters:
data – pandas DataFrame containing the data.
attribute_name – Name of the attribute.
dtypes – Data types of the data columns.
- Returns:
List of EqualitySelector instances for the attribute.
- pysubgroup.subgroup_description.create_numeric_selectors(data, nbins=5, intervals_only=True, weighting_attribute=None, ignore=None)[source]¶
Create selectors for numeric attributes.
- Parameters:
data – pandas DataFrame containing the data.
nbins – Number of bins to use for discretization.
intervals_only – If True, only create interval selectors.
weighting_attribute – Optional attribute for weighting.
ignore – List of attribute names to ignore.
- Returns:
List of numeric selectors.
- pysubgroup.subgroup_description.create_numeric_selectors_for_attribute(data, attr_name, nbins=5, intervals_only=True, weighting_attribute=None)[source]¶
Create selectors for a numeric attribute.
- Parameters:
data – pandas DataFrame containing the data.
attr_name – Name of the attribute.
nbins – Number of bins to use for discretization.
intervals_only – If True, only create interval selectors.
weighting_attribute – Optional attribute for weighting.
- Returns:
List of numeric selectors for the attribute.
- pysubgroup.subgroup_description.create_selectors(data, nbins=5, intervals_only=True, ignore=None)[source]¶
Create a list of selectors for all attributes in the data.
- Parameters:
data – pandas DataFrame containing the data.
nbins – Number of bins to use for numeric attributes.
intervals_only – If True, only create interval selectors for numeric attributes.
ignore – List of attribute names to ignore.
- Returns:
List of selectors.
- pysubgroup.subgroup_description.get_cover_array_and_size(subgroup, data_len=None, data=None)[source]¶
Compute the cover array and its size for a given subgroup.
- Parameters:
subgroup – The subgroup for which to compute the cover array and size.
data_len – Optional length of the data.
data – Optional data.
- Returns:
Tuple of (cover array, size).
- pysubgroup.subgroup_description.get_size(subgroup, data_len=None, data=None)[source]¶
Compute the size of the cover array for a given subgroup.
- Parameters:
subgroup – The subgroup for which to compute the size.
data_len – Optional length of the data.
data – Optional data.
- Returns:
Size of the cover array.
pysubgroup.utils module¶
Created on 02.05.2016
@author: lemmerfn
- class pysubgroup.utils.BaseTarget[source]¶
Bases:
objectBase class for defining targets in subgroup discovery.
Provides a method to check if all required statistics are present.
- class pysubgroup.utils.SubgroupDiscoveryResult(results, task)[source]¶
Bases:
objectRepresents the result of a subgroup discovery task.
Contains methods to convert results to different formats.
- to_dataframe(statistics_to_show=None, autoround=False, include_target=False)[source]¶
Converts the results to a pandas DataFrame.
- Parameters:
- Returns:
A pandas DataFrame representing the results.
- Return type:
DataFrame
- to_descriptions(include_stats=False)[source]¶
Converts the results to a list of subgroup descriptions.
- to_latex(statistics_to_show=None, escape_underscore=True)[source]¶
Converts the results to a LaTeX-formatted table.
- pysubgroup.utils.add_if_required(result, sg, quality, task: SubgroupDiscoveryTask, check_for_duplicates=False, statistics=None, explicit_result_set_size=None)[source]¶
Adds a subgroup to the result set if it meets the required quality and constraints.
Important
Only add/remove subgroups from result by using heappop and heappush to ensure order of subgroups by quality.
- Parameters:
result (list) – The current list of subgroups (heap).
sg – The subgroup to potentially add.
quality (float) – The quality of the subgroup.
task (SubgroupDiscoveryTask) – The task containing parameters and constraints.
check_for_duplicates (bool) – If True, checks for duplicates before adding.
statistics (optional) – Precomputed statistics for the subgroup.
explicit_result_set_size (int, optional) – Overrides the task’s result_set_size.
- Returns:
None
- pysubgroup.utils.conditional_invert(val, invert)[source]¶
Conditionally inverts a value based on a boolean flag.
- pysubgroup.utils.count_bits(bitset_as_int)[source]¶
Counts the number of set bits (1s) in a bitset represented as an integer.
- pysubgroup.utils.create_subgroup_with_representation(data, selectors)[source]¶
Create an object representing the conjunction of the given selectors, including a bitmask indicating which instances in the dataset are covered.
- Parameters:
data – dataset to evaluate the cover on
selectors – list of selectors to form the conjunction
- pysubgroup.utils.derive_effective_sample_size(weights)[source]¶
Calculates the effective sample size for weighted data.
- Parameters:
weights (array-like) – The weights assigned to the samples.
- Returns:
The effective sample size.
- Return type:
- pysubgroup.utils.equal_frequency_discretization(data, attribute_name, nbins=5, weighting_attribute=None)[source]¶
Discretizes a numerical attribute into bins with approximately equal frequency.
- Parameters:
- Returns:
A list of cutpoints defining the bins.
- Return type:
- pysubgroup.utils.find_set_bits(bitset_as_int)[source]¶
Finds the indices of set bits in a bitset represented as an integer.
- Parameters:
bitset_as_int (int) – The bitset represented as an integer.
- Yields:
int – The index of each set bit.
- pysubgroup.utils.float_formatter(x, digits=2)[source]¶
Formats a float to a specified number of decimal places.
- pysubgroup.utils.intersect_of_ordered_list(list_1, list_2)[source]¶
Computes the intersection of two ordered lists.
- pysubgroup.utils.is_categorical_attribute(data, attribute_name)[source]¶
Determines if an attribute in the dataset is categorical.
- pysubgroup.utils.is_numerical_attribute(data, attribute_name)[source]¶
Determines if an attribute in the dataset is numerical.
- pysubgroup.utils.minimum_required_quality(result, task)[source]¶
Determines the minimum quality required for a subgroup to be considered for inclusion in the result set.
- Parameters:
result (list) – The current list of subgroups (heap).
task (SubgroupDiscoveryTask) – The task containing parameters like
min_quality. (result_set_size and)
- Returns:
The minimum required quality for a subgroup to be added to the result set.
- Return type:
- pysubgroup.utils.overlap(sg, another_sg, data)[source]¶
Calculates the Jaccard similarity between two subgroups based on their coverage.
- Parameters:
sg – The first subgroup.
another_sg – The second subgroup.
data (DataFrame) – The dataset.
- Returns:
The Jaccard similarity between the two subgroups.
- Return type:
- pysubgroup.utils.perc_formatter(x)[source]¶
Formats a float as a percentage string with one decimal place.
- pysubgroup.utils.powerset(iterable, max_length=None)[source]¶
Generates the power set (all possible combinations) of an iterable up to a maximum length.
- Parameters:
iterable (iterable) – The iterable to generate combinations from.
max_length (int, optional) – The maximum length of combinations.
- Returns:
An iterator over the power set of the iterable.
- Return type:
iterator
- pysubgroup.utils.prepare_subgroup_discovery_result(result, task)[source]¶
Filters and sorts the result set of subgroups according to the task parameters.
- Parameters:
result (list) – The list of subgroups (heap).
task (SubgroupDiscoveryTask) – The task containing parameters like result_set_size and min_quality.
- Returns:
The filtered and sorted list of subgroups.
- Return type:
- pysubgroup.utils.remove_selectors_with_attributes(selector_list, attribute_list)[source]¶
Removes selectors that are based on specified attributes.
- pysubgroup.utils.results_df_autoround(df)[source]¶
Automatically rounds numerical columns in a DataFrame for better readability.
- Parameters:
df (DataFrame) – The DataFrame containing the results.
- Returns:
The DataFrame with rounded numerical values.
- Return type:
DataFrame
- pysubgroup.utils.str_to_bool(s)[source]¶
Converts a string representation of a boolean value to a boolean type.
- Parameters:
s (str) – The string to convert (e.g., ‘true’, ‘False’, ‘1’, ‘0’).
- Returns:
The boolean value represented by the string.
- Return type:
- Raises:
ValueError – If the string does not represent a valid boolean value.
pysubgroup.visualization module¶
- pysubgroup.visualization.plot_distribution_numeric(sg, target, data, bins, show_dataset=True)[source]¶
- pysubgroup.visualization.plot_qualities_on_sample_distribution(result, qualities, samples, alpha=0.05, side: Literal['left', 'right'] = 'right', bins=25)[source]¶
Create plots of the empirical sample distribution as a histogram for each subgroup. Include indicators for the subgroup quality and the quality that the alpha threshold corresponds to.
- pysubgroup.visualization.plot_roc(result_df, data, qf=<pysubgroup.binary_target.StandardQF object>, levels=40, annotate=False)[source]¶