Targets and Quality Functions

To define the goal of our subgroup discovery task, we use targets and quality functions. Targets are used to define which attributes play a significant role and can provide common statistics for a subgroup in question. Quality functions assign a score to each subgroup. These scores are used by all the algorithms to determine the most interesting subgroups.

Frequent Itemset Targets

The most simple target is the FITarget with its associated quality functions CountQF and AreaQf. The CountQF simple counts the number of instances covered by the subgroup in question. The AreaQF multiplies the depth or length of the subgroup description with the number of instances covered by that description.

Binary Targets

For Boolean or Binary Targets we provide the ChiSquaredQF as well as the StandardQF quality functions. The StandardQF quality function uses a parameter \(\alpha\) to weight the relative size \(\frac{N_{SG}}{N}\) of a subgroup and multiplies it with the differences in relations of positive instances \(p\) to the number of instances \(N\)

\[\left ( \frac{N_{SG}}{N} \right ) ^\alpha \left(\frac{p_{SG}}{N_{SG}} - \frac{p}{N} \right)\]

The StandardQF also supports an optimistic estimate.

The ChiSquaredQF is calculated based on the following contigency table which is then passed to the scipy chi2_contigency function. The small \(n\) represents the number of negative instances and should not be confused with the capital \(N\) which represents the total number of instances.

\(p_{SG}\)

\(p-p_{SG}\)

\(n_{SG}\)

\(n-n_{SG}\)

Nominal Targets

Currently pysubgroup only supports nominal targets as binary targets. So you can look for deviations of one nominal value with respect to all othe nominal values.

Numeric Targets

For numeric targets pysubgroup offers the StandardQFNumeric which is defined similar to the StandardQF

\[\left ( \frac{N_{SG}}{N} \right ) ^\alpha \left (\mu_{SG} - \mu \right )\]

where \(\mu_{SG}\) and \(\mu\) are the mean value for the subgroup and entire dataset respectively. For the StandardQFNumeric we offer three optimistic estimates: Average, Summation and Ordering. These are in detail described in Florian Lemmerich’s dissertation. You can choose between the different optimistic estimates by using the keyword argument estimator the different options are 'sum', 'average', and 'order'

Custom Quality Function

To create a custom quality function that works will all algorithms except gp_growth.

class MyQualityFunction:
    def calculate_constant_statistics(self, task):
        """ calculate_constant_statistics
            This function is called once for every execution,
            it should do any preparation that is necessary prior to an execution.
        """
        pass

    def calculate_statistics(self, subgroup, data=None):
        """ calculates necessary statistics
            this statistics object is passed on to the evaluate
            and optimistic_estimate functions
        """
        pass

    def evaluate(self, subgroup, statistics_or_data=None):
        """ return the quality calculated from the statistics """
        pass

    def optimistic_estimate(self, subgroup, statistics=None):
        """ returns optimistic estimate
            if one is available return it otherwise infinity"""
        pass