ema workbench

Other Sub Sites

clusterer

Created on Sep 8, 2011

Code author: gyucel <g.yucel (at) tudelft (dot) nl>, jhkwakkel <j.h.kwakkel (at) tudelft (dot) nl>

a reworking of the cluster. The distance metrics have now their own .py file. The metrics available are currently stored in the distance_functions dictionary.

class clusterer.Cluster(cluster_no, all_ds_indices, sample_ds_index, runLogs, dist_clust)

Contains information about a data-series cluster, as well as some methods to help analyzing a cluster. Basic attributes of a cluster (e.g. c) object are as follows;

  • c.no : Cluster number/index
  • c.indices : Original indices of the dataseries that are in cluster c
  • c.sample : Original index of the dataseries that is the representative of cluster c (i.e. median element of the cluster)
  • c.size : Number of elements (i.e. dataseries) in the cluster c
clusterer.cluster(data, outcome, distance='gonenc', interClusterDistance='complete', cMethod='inconsistent', cValue=2.5, plotDendrogram=True, plotClusters=True, groupPlot=False, **kwargs)

Method that clusters time-series data from the specified cpickle file according to a selected distance measure.

Parameters:
  • data – return from meth:perform_experiments.
  • outcome – Name of outcome/variable whose behavior is being analyzed
  • distance – The distance metric to be used.
  • interClusterDistance – How to calculate inter cluster distance. see linkage for details.
  • cMethod – Cutoff method, see fcluster for details.
  • cValue

    Cutoff value, see fcluster for details.

  • plotDendogram – Boolean, if true, plot dendogram.
  • plotCluster – Boolean, true if you want to plot clusters.
  • groupPlot – Boolean, if true plot clusters in a single window, else the clusters are plotted in separate windows.
Return type:

A tuple containing the list of distances, the list of clusters (a Cluster object for each cluster), and a list of logged distance metrics for each time series.

The remainder of the arguments are passed on to the specified distance function.

Gonenc Distance:

  • ‘distance’: String that specifies the distance to be used.

    Options: bmd (default), mse, sse

  • ‘filter?’: Boolean that specifies whether the data series will be

    filtered (for bmd distance)

  • ‘slope filter’: A float number that specifies the filtering threshold

    for the slope (for every data point if change__in_the_ outcome/average_value_of_the_outcome < threshold, consider slope = 0) (for bmd distance)

  • ‘curvature filter’: A float number that specifies the filtering

    threshold for the curvature (for every data point if change__in_the_slope/average_value_of_the_slope < threshold, consider curvature = 0) (for bmd distance)

  • ‘no of sisters’: 50 (for bmd distance)

clusterer.construct_distances(data, distance='gonenc', **kwargs)

Constructs a n-by-n matrix of distances for n data-series in data according to the specified distance.

Distance argument specifies the distance measure to be used. Options, which are defined in clusteringDistances.py, are as follows.

  • gonenc: a distance based on qualitative dynamic pattern features

  • willem: a disance mainly based on the presence of crisis-periods and

    the overall trend of the data series

  • sse: regular sum of squared errors

  • mse: regular mean squared error

SSE and MSE are in clusterinDistances.py and don’t work right now.

others will be added over time

clusterer.get_drow_index(i, j, size)

Get the index in the distance row for the distance between i and j.

:param i; result i :param j: result j :param size: the number of results

...note:: i > j