ema workbench

Table Of Contents

Other Sub Sites

clusterer

Created on Sep 8, 2011

Code author: gyucel <g.yucel (at) tudelft (dot) nl>, jhkwakkel <j.h.kwakkel (at) tudelft (dot) nl>

a reworking of the cluster. The distance metrics have now their own .py file. The metrics available are currently stored in the distance_functions dictionary.

clusterer.cluster(data, outcome, distance='gonenc', interClusterDistance='complete', cMethod='inconsistent', cValue=2.5, plotDendrogram=True, plotClusters=True, groupPlot=False, **kwargs)

Method that clusters time-series data from the specified cpickle file according to a selected distance measure.

Parameters:
  • data – return from meth:perform_experiments.
  • outcome – Name of outcome/variable whose behavior is being analyzed
  • distance – The distance metric to be used.
  • interClusterDistance – How to calculate inter cluster distance. see linkage for details.
  • cMethod – Cutoff method, see fcluster for details.
  • cValue

    Cutoff value, see fcluster for details.

  • plotDendogram – Boolean, if true, plot dendogram.
  • plotCluster – Boolean, true if you want to plot clusters.
  • groupPlot – Boolean, if true plot clusters in a single window, else the clusters are plotted in separate windows.
Return type:

A tuple containing the list of distances, the cluster allocation, and a list of logged distance metrics for each time series.

The remainder of the arguments are passed on to the specified distance function. See the distance functions for details on these parameters.

clusterer.construct_distances(data, distance='gonenc', **kwargs)

Constructs a n-by-n matrix of distances for n data-series in data according to the specified distance.

Distance argument specifies the distance measure to be used. Options, which are defined in clusteringDistances.py, are as follows.

  • gonenc: a distance based on qualitative dynamic pattern features

  • willem: a disance mainly based on the presence of crisis-periods and

    the overall trend of the data series

  • sse: regular sum of squared errors

  • mse: regular mean squared error

SSE and MSE are in clusterinDistances.py and don’t work right now.

others will be added over time

distance_gonenc()

clusterCode.distance_gonenc.distance_gonenc(data, sisterCount=50, wSlopeError=1, wCurvatureError=1, filterSlope=True, tHoldSlope=0.1, filterCurvature=True, tHoldCurvature=0.1, addMidExtension=True, addEndExtension=True)

The distance measures the proximity of data series in terms of their qualitative pattern features. In order words, it quantifies the proximity between two different dynamic behaviour modes.

It is designed to work mainly on non-stationary data. It’s current version does not perform well in catching the proximity of two cyclic/repetitive patterns with different number of cycles (e.g. oscillation with 4 cycle versus oscillation with 6 cycles).

Parameters:
  • sisterCount – Number of long-versions that will be created for the short vector while comparing two data series with unequal feature vector lengths.
  • wSlopeError – Weight of the error between the 1st dimensions of the two feature vectors (i.e. Slope). (default=1)
  • wCurvatureError – Weight of the error between the 2nd dimensions of the two feature vectors (i.e. Curvature). (default=1)
  • wFilterSlope – Boolean, indicating whether the slope vectors should be filtered for minor fluctuations, or not. (default=True)
  • tHoldSlope – The threshold value to be used in filtering out fluctuations in the slope. (default=0.1)
  • filterCurvature – Boolean, indicating whether the curvature vectors should be filtered for minor fluctuations, or not. (default=True)
  • tHoldCurvature – The threshold value to be used in filtering out fluctuations in the curvature. (default=0.1)
  • addMidExtension – Boolean, indicating whether the feature vectors should be extended by introducing transition sections along the vector. (default=True)
  • addEndExtension – Boolean, indicating whether the feature vectors should be extended by introducing startup/closing sections at the beginning/end of the vector. (default=True)

distance_mse()

distance_mse.distance_mse(data)

The MSE (mean squared-error) distance is equal to the SSE distance divided by the number of data points in data series.

The SSE distance between two data series is equal to the sum of squared-errors between corresponding data points of these two data series. Let the data series be of length N; Then SSE distance between ds1 and ds2 equals to the sum of the square of error terms from 1 to N, where error_term(i) equals to ds1(i)-ds2(i)

Given that SSE is calculated as given above, MSE equals SSE divided by N.

As SSE distance, the MSE distance only works with data series of equal length.

distance_sse()

distance_sse.distance_sse(data)

The SSE (sum of squared-errors) distance between two data series is equal to the sum of squared-errors between corresponding data points of these two data series. Let the data series be of length N; Then SSE distance between ds1 and ds2 equals to the sum of the square of error terms from 1 to N, where error_term(i) equals to ds1(i)-ds2(i)

Since SSE calculation is based on pairwise comparison of individual data points, the data series should be of equal length.

SSE distance equals to the square of Euclidian distance, which is a commonly used distance metric in time series comparisons.

distance_triangle()

distance_triangle.distance_triangle(data)
The triangle distance is calculated as follows;

Let ds1(.) and ds2(.) be two data series of length N. Then; A equals to the summation of ds1(i).ds2(i) from i=1 to N B equals to the square-root of the (summation ds1(i)^2 from i=1 to N) C equals to the square-root of the (summation ds1(i)^2 from i=1 to N)

distance_triangle = A/(B.C)

The triangle distance works only with data series of the same length

In the literature, it is claimed that the triangle distance can deal with noise and amplitude scaling very well, and may yield poor results in cases of offset translation and linear drift.