codebook_generation
Class SimpleKMeansWithOutput

java.lang.Object
  extended by weka.clusterers.AbstractClusterer
      extended by weka.clusterers.RandomizableClusterer
          extended by codebook_generation.SimpleKMeansWithOutput
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, weka.clusterers.Clusterer, weka.clusterers.NumberOfClustersRequestable, weka.core.CapabilitiesHandler, weka.core.OptionHandler, weka.core.Randomizable, weka.core.RevisionHandler, weka.core.TechnicalInformationHandler, weka.core.WeightedInstancesHandler

public class SimpleKMeansWithOutput
extends weka.clusterers.RandomizableClusterer
implements weka.clusterers.NumberOfClustersRequestable, weka.core.WeightedInstancesHandler, weka.core.TechnicalInformationHandler

Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean. For more information see:

D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.

BibTeX:

 @inproceedings{Arthur2007,
    author = {D. Arthur and S. Vassilvitskii},
    booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms},
    pages = {1027-1035},
    title = {k-means++: the advantages of carefull seeding},
    year = {2007}
 }
 

Valid options are:

 -N <num>
  number of clusters.
  (default 2).
 
 -P
  Initialize using the k-means++ method.
 
 -V
  Display std. deviations for centroids.
 
 -M
  Replace missing values with mean/mode.
 
 -A <classname and options>
  Distance function to use.
  (default: weka.core.EuclideanDistance)
 
 -I <num>
  Maximum number of iterations.
 
 -O
  Preserve order of instances.
 
 -fast
  Enables faster distance calculations, using cut-off values.
  Disables the calculation/output of squared errors/distances.
 
 -num-slots <num>
  Number of execution slots.
  (default 1 - i.e. no parallelism)
 
 -S <num>
  Random number seed.
  (default 10)
 

Version:
$Revision: 9375 $
Author:
Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
RandomizableClusterer, Serialized Form

Nested Class Summary
private  class SimpleKMeansWithOutput.KMeansClusterTask
           
private  class SimpleKMeansWithOutput.KMeansComputeCentroidTask
           
 
Field Summary
protected  int[] m_Assignments
          Assignments obtained.
private  weka.core.Instances m_ClusterCentroids
          holds the cluster centroids.
private  int[][] m_ClusterMissingCounts
           
private  int[][][] m_ClusterNominalCounts
          For each cluster, holds the frequency counts for the values of each nominal attribute.
private  int[] m_ClusterSizes
          The number of instances in each cluster.
private  weka.core.Instances m_ClusterStdDevs
          Holds the standard deviations of the numeric attributes in each cluster.
protected  int m_completed
           
private  boolean m_displayStdDevs
          Display standard deviations for numeric atts.
protected  weka.core.DistanceFunction m_DistanceFunction
          the distance function used.
private  boolean m_dontReplaceMissing
          Replace missing values globally?
protected  int m_executionSlots
           
protected  java.util.concurrent.ExecutorService m_executorPool
          For parallel execution mode
protected  int m_failed
           
protected  boolean m_FastDistanceCalc
          whether to use fast calculation of distances (using a cut-off).
private  double[] m_FullMeansOrMediansOrModes
          Stats on the full data set for comparison purposes.
private  int[] m_FullMissingCounts
           
private  int[][] m_FullNominalCounts
           
private  double[] m_FullStdDevs
           
protected  boolean m_initializeWithKMeansPlusPlus
          Whether to initialize cluster centers using the k-means++ method
private  int m_Iterations
          Keep track of the number of iterations completed before convergence.
private  int m_MaxIterations
          Maximum number of iterations to be executed.
private  int m_NumClusters
          number of clusters to generate.
private  boolean m_PreserveOrder
          Preserve order of instances.
private  weka.filters.unsupervised.attribute.ReplaceMissingValues m_ReplaceMissingFilter
          replace missing values in training instances.
private  double[] m_squaredErrors
          Holds the squared errors for all clusters.
(package private) static long serialVersionUID
          for serialization.
 
Fields inherited from class weka.clusterers.RandomizableClusterer
m_Seed, m_SeedDefault
 
Constructor Summary
SimpleKMeansWithOutput()
          the default constructor.
 
Method Summary
 void buildClusterer(weka.core.Instances data)
          Generates a clusterer.
 int clusterInstance(weka.core.Instance instance)
          Classifies a given instance.
private  int clusterProcessedInstance(weka.core.Instance instance, boolean updateErrors, boolean useFastDistCalc)
          clusters an instance that has been through the filters.
 java.lang.String displayStdDevsTipText()
          Returns the tip text for this property.
 java.lang.String distanceFunctionTipText()
          Returns the tip text for this property.
 java.lang.String dontReplaceMissingValuesTipText()
          Returns the tip text for this property.
 java.lang.String fastDistanceCalcTipText()
          Returns the tip text for this property.
 int[] getAssignments()
          Gets the assignments for each instance.
 weka.core.Capabilities getCapabilities()
          Returns default capabilities of the clusterer.
 weka.core.Instances getClusterCentroids()
          Gets the the cluster centroids.
 int[][][] getClusterNominalCounts()
          Returns for each cluster the frequency counts for the values of each nominal attribute.
 int[] getClusterSizes()
          Gets the number of instances in each cluster.
 weka.core.Instances getClusterStandardDevs()
          Gets the standard deviations of the numeric attributes in each cluster.
 boolean getDisplayStdDevs()
          Gets whether standard deviations and nominal count.
 weka.core.DistanceFunction getDistanceFunction()
          returns the distance function currently in use.
 boolean getDontReplaceMissingValues()
          Gets whether missing values are to be replaced.
 boolean getFastDistanceCalc()
          Gets whether to use faster distance calculation.
 boolean getInitializeUsingKMeansPlusPlusMethod()
          Get whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).
 int getMaxIterations()
          gets the number of maximum iterations to be executed.
 int getNumClusters()
          gets the number of clusters to generate.
 int getNumExecutionSlots()
          Get the degree of parallelism to use.
 java.lang.String[] getOptions()
          Gets the current settings of SimpleKMeans.
 boolean getPreserveInstancesOrder()
          Gets whether order of instances must be preserved.
 java.lang.String getRevision()
          Returns the revision string.
 double getSquaredError()
          Gets the squared error for all clusters.
 weka.core.TechnicalInformation getTechnicalInformation()
           
 java.lang.String globalInfo()
          Returns a string describing this clusterer.
 java.lang.String initializeUsingKMeansPlusPlusMethodTipText()
          Returns the tip text for this property.
protected  void kMeansPlusPlusInit(weka.core.Instances data)
           
protected  boolean launchAssignToClusters(weka.core.Instances insts, int[] clusterAssignments)
          Launch the tasks that assign instances to clusters
protected  int launchMoveCentroids(weka.core.Instances[] clusters)
          Launch the move centroids tasks
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] args)
          Main method for executing this class.
 java.lang.String maxIterationsTipText()
          Returns the tip text for this property.
protected  double[] moveCentroid(int centroidIndex, weka.core.Instances members, boolean updateClusterInfo, boolean addToCentroidInstances)
          Move the centroid to it's new coordinates.
 int numberOfClusters()
          Returns the number of clusters.
 java.lang.String numClustersTipText()
          Returns the tip text for this property.
 java.lang.String numExecutionSlotsTipText()
          Returns the tip text for this property
private  java.lang.String pad(java.lang.String source, java.lang.String padChar, int length, boolean leftPad)
           
 java.lang.String preserveInstancesOrderTipText()
          Returns the tip text for this property.
 void setDisplayStdDevs(boolean stdD)
          Sets whether standard deviations and nominal count.
 void setDistanceFunction(weka.core.DistanceFunction df)
          sets the distance function to use for instance comparison.
 void setDontReplaceMissingValues(boolean r)
          Sets whether missing values are to be replaced.
 void setFastDistanceCalc(boolean value)
          Sets whether to use faster distance calculation.
 void setInitializeUsingKMeansPlusPlusMethod(boolean k)
          Set whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).
 void setMaxIterations(int n)
          set the maximum number of iterations to be executed.
 void setNumClusters(int n)
          set the number of clusters to generate.
 void setNumExecutionSlots(int slots)
          Set the degree of parallelism to use.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setPreserveInstancesOrder(boolean r)
          Sets whether order of instances must be preserved.
protected  void startExecutorPool()
          Start the pool of execution threads
 java.lang.String toString()
          return a string describing this clusterer.
 
Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed
 
Methods inherited from class weka.clusterers.AbstractClusterer
distributionForInstance, forName, makeCopies, makeCopy, runClusterer
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

serialVersionUID

static final long serialVersionUID
for serialization.

See Also:
Constant Field Values

m_ReplaceMissingFilter

private weka.filters.unsupervised.attribute.ReplaceMissingValues m_ReplaceMissingFilter
replace missing values in training instances.


m_NumClusters

private int m_NumClusters
number of clusters to generate.


m_ClusterCentroids

private weka.core.Instances m_ClusterCentroids
holds the cluster centroids.


m_ClusterStdDevs

private weka.core.Instances m_ClusterStdDevs
Holds the standard deviations of the numeric attributes in each cluster.


m_ClusterNominalCounts

private int[][][] m_ClusterNominalCounts
For each cluster, holds the frequency counts for the values of each nominal attribute.


m_ClusterMissingCounts

private int[][] m_ClusterMissingCounts

m_FullMeansOrMediansOrModes

private double[] m_FullMeansOrMediansOrModes
Stats on the full data set for comparison purposes. In case the attribute is numeric the value is the mean if is being used the Euclidian distance or the median if Manhattan distance and if the attribute is nominal then it's mode is saved.


m_FullStdDevs

private double[] m_FullStdDevs

m_FullNominalCounts

private int[][] m_FullNominalCounts

m_FullMissingCounts

private int[] m_FullMissingCounts

m_displayStdDevs

private boolean m_displayStdDevs
Display standard deviations for numeric atts.


m_dontReplaceMissing

private boolean m_dontReplaceMissing
Replace missing values globally?


m_ClusterSizes

private int[] m_ClusterSizes
The number of instances in each cluster.


m_MaxIterations

private int m_MaxIterations
Maximum number of iterations to be executed.


m_Iterations

private int m_Iterations
Keep track of the number of iterations completed before convergence.


m_squaredErrors

private double[] m_squaredErrors
Holds the squared errors for all clusters.


m_DistanceFunction

protected weka.core.DistanceFunction m_DistanceFunction
the distance function used.


m_PreserveOrder

private boolean m_PreserveOrder
Preserve order of instances.


m_Assignments

protected int[] m_Assignments
Assignments obtained.


m_FastDistanceCalc

protected boolean m_FastDistanceCalc
whether to use fast calculation of distances (using a cut-off).


m_initializeWithKMeansPlusPlus

protected boolean m_initializeWithKMeansPlusPlus
Whether to initialize cluster centers using the k-means++ method


m_executionSlots

protected int m_executionSlots

m_executorPool

protected transient java.util.concurrent.ExecutorService m_executorPool
For parallel execution mode


m_completed

protected int m_completed

m_failed

protected int m_failed
Constructor Detail

SimpleKMeansWithOutput

public SimpleKMeansWithOutput()
the default constructor.

Method Detail

startExecutorPool

protected void startExecutorPool()
Start the pool of execution threads


getTechnicalInformation

public weka.core.TechnicalInformation getTechnicalInformation()
Specified by:
getTechnicalInformation in interface weka.core.TechnicalInformationHandler

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer.

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

getCapabilities

public weka.core.Capabilities getCapabilities()
Returns default capabilities of the clusterer.

Specified by:
getCapabilities in interface weka.clusterers.Clusterer
Specified by:
getCapabilities in interface weka.core.CapabilitiesHandler
Overrides:
getCapabilities in class weka.clusterers.AbstractClusterer
Returns:
the capabilities of this clusterer

launchMoveCentroids

protected int launchMoveCentroids(weka.core.Instances[] clusters)
Launch the move centroids tasks

Parameters:
clusters - the cluster centroids
Returns:
the number of empty clusters

launchAssignToClusters

protected boolean launchAssignToClusters(weka.core.Instances insts,
                                         int[] clusterAssignments)
                                  throws java.lang.Exception
Launch the tasks that assign instances to clusters

Parameters:
insts - the instances to be clustered
clusterAssignments - the array of cluster assignments
Returns:
true if k means has converged
Throws:
java.lang.Exception - if a problem occurs

buildClusterer

public void buildClusterer(weka.core.Instances data)
                    throws java.lang.Exception
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.

Specified by:
buildClusterer in interface weka.clusterers.Clusterer
Specified by:
buildClusterer in class weka.clusterers.AbstractClusterer
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

kMeansPlusPlusInit

protected void kMeansPlusPlusInit(weka.core.Instances data)
                           throws java.lang.Exception
Throws:
java.lang.Exception

moveCentroid

protected double[] moveCentroid(int centroidIndex,
                                weka.core.Instances members,
                                boolean updateClusterInfo,
                                boolean addToCentroidInstances)
Move the centroid to it's new coordinates. Generate the centroid coordinates based on it's members (objects assigned to the cluster of the centroid) and the distance function being used.

Parameters:
centroidIndex - index of the centroid which the coordinates will be computed
members - the objects that are assigned to the cluster of this centroid
updateClusterInfo - if the method is supposed to update the m_Cluster arrays
addToCentroidInstances - true if the method is to add the computed coordinates to the Instances holding the centroids
Returns:
the centroid coordinates

clusterProcessedInstance

private int clusterProcessedInstance(weka.core.Instance instance,
                                     boolean updateErrors,
                                     boolean useFastDistCalc)
clusters an instance that has been through the filters.

Parameters:
instance - the instance to assign a cluster to
updateErrors - if true, update the within clusters sum of errors
useFastDistCalc - whether to use the fast distance calculation or not
Returns:
a cluster number

clusterInstance

public int clusterInstance(weka.core.Instance instance)
                    throws java.lang.Exception
Classifies a given instance.

Specified by:
clusterInstance in interface weka.clusterers.Clusterer
Overrides:
clusterInstance in class weka.clusterers.AbstractClusterer
Parameters:
instance - the instance to be assigned to a cluster
Returns:
the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
Throws:
java.lang.Exception - if instance could not be classified successfully

numberOfClusters

public int numberOfClusters()
                     throws java.lang.Exception
Returns the number of clusters.

Specified by:
numberOfClusters in interface weka.clusterers.Clusterer
Specified by:
numberOfClusters in class weka.clusterers.AbstractClusterer
Returns:
the number of clusters generated for a training dataset.
Throws:
java.lang.Exception - if number of clusters could not be returned successfully

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface weka.core.OptionHandler
Overrides:
listOptions in class weka.clusterers.RandomizableClusterer
Returns:
an enumeration of all the available options.

numClustersTipText

public java.lang.String numClustersTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNumClusters

public void setNumClusters(int n)
                    throws java.lang.Exception
set the number of clusters to generate.

Specified by:
setNumClusters in interface weka.clusterers.NumberOfClustersRequestable
Parameters:
n - the number of clusters to generate
Throws:
java.lang.Exception - if number of clusters is negative

getNumClusters

public int getNumClusters()
gets the number of clusters to generate.

Returns:
the number of clusters to generate

initializeUsingKMeansPlusPlusMethodTipText

public java.lang.String initializeUsingKMeansPlusPlusMethodTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setInitializeUsingKMeansPlusPlusMethod

public void setInitializeUsingKMeansPlusPlusMethod(boolean k)
Set whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).

Parameters:
k - true if the k-means++ method is to be used to select initial cluster centers.

getInitializeUsingKMeansPlusPlusMethod

public boolean getInitializeUsingKMeansPlusPlusMethod()
Get whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).

Returns:
true if the k-means++ method is to be used to select initial cluster centers.

maxIterationsTipText

public java.lang.String maxIterationsTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMaxIterations

public void setMaxIterations(int n)
                      throws java.lang.Exception
set the maximum number of iterations to be executed.

Parameters:
n - the maximum number of iterations
Throws:
java.lang.Exception - if maximum number of iteration is smaller than 1

getMaxIterations

public int getMaxIterations()
gets the number of maximum iterations to be executed.

Returns:
the number of clusters to generate

displayStdDevsTipText

public java.lang.String displayStdDevsTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDisplayStdDevs

public void setDisplayStdDevs(boolean stdD)
Sets whether standard deviations and nominal count. Should be displayed in the clustering output.

Parameters:
stdD - true if std. devs and counts should be displayed

getDisplayStdDevs

public boolean getDisplayStdDevs()
Gets whether standard deviations and nominal count. Should be displayed in the clustering output.

Returns:
true if std. devs and counts should be displayed

dontReplaceMissingValuesTipText

public java.lang.String dontReplaceMissingValuesTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDontReplaceMissingValues

public void setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.

Parameters:
r - true if missing values are to be replaced

getDontReplaceMissingValues

public boolean getDontReplaceMissingValues()
Gets whether missing values are to be replaced.

Returns:
true if missing values are to be replaced

distanceFunctionTipText

public java.lang.String distanceFunctionTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getDistanceFunction

public weka.core.DistanceFunction getDistanceFunction()
returns the distance function currently in use.

Returns:
the distance function

setDistanceFunction

public void setDistanceFunction(weka.core.DistanceFunction df)
                         throws java.lang.Exception
sets the distance function to use for instance comparison.

Parameters:
df - the new distance function to use
Throws:
java.lang.Exception - if instances cannot be processed

preserveInstancesOrderTipText

public java.lang.String preserveInstancesOrderTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setPreserveInstancesOrder

public void setPreserveInstancesOrder(boolean r)
Sets whether order of instances must be preserved.

Parameters:
r - true if missing values are to be replaced

getPreserveInstancesOrder

public boolean getPreserveInstancesOrder()
Gets whether order of instances must be preserved.

Returns:
true if missing values are to be replaced

fastDistanceCalcTipText

public java.lang.String fastDistanceCalcTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setFastDistanceCalc

public void setFastDistanceCalc(boolean value)
Sets whether to use faster distance calculation.

Parameters:
value - true if faster calculation to be used

getFastDistanceCalc

public boolean getFastDistanceCalc()
Gets whether to use faster distance calculation.

Returns:
true if faster calculation is used

numExecutionSlotsTipText

public java.lang.String numExecutionSlotsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNumExecutionSlots

public void setNumExecutionSlots(int slots)
Set the degree of parallelism to use.

Parameters:
slots - the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds

getNumExecutionSlots

public int getNumExecutionSlots()
Get the degree of parallelism to use.

Returns:
the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.

Valid options are:

 -N <num>
  number of clusters.
  (default 2).
 
 -P
  Initialize using the k-means++ method.
 
 -V
  Display std. deviations for centroids.
 
 -M
  Replace missing values with mean/mode.
 
 -A <classname and options>
  Distance function to use.
  (default: weka.core.EuclideanDistance)
 
 -I <num>
  Maximum number of iterations.
 
 -O
  Preserve order of instances.
 
 -fast
  Enables faster distance calculations, using cut-off values.
  Disables the calculation/output of squared errors/distances.
 
 -num-slots <num>
  Number of execution slots.
  (default 1 - i.e. no parallelism)
 
 -S <num>
  Random number seed.
  (default 10)
 

Specified by:
setOptions in interface weka.core.OptionHandler
Overrides:
setOptions in class weka.clusterers.RandomizableClusterer
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of SimpleKMeans.

Specified by:
getOptions in interface weka.core.OptionHandler
Overrides:
getOptions in class weka.clusterers.RandomizableClusterer
Returns:
an array of strings suitable for passing to setOptions()

toString

public java.lang.String toString()
return a string describing this clusterer.

Overrides:
toString in class java.lang.Object
Returns:
a description of the clusterer as a string

pad

private java.lang.String pad(java.lang.String source,
                             java.lang.String padChar,
                             int length,
                             boolean leftPad)

getClusterCentroids

public weka.core.Instances getClusterCentroids()
Gets the the cluster centroids.

Returns:
the cluster centroids

getClusterStandardDevs

public weka.core.Instances getClusterStandardDevs()
Gets the standard deviations of the numeric attributes in each cluster.

Returns:
the standard deviations of the numeric attributes in each cluster

getClusterNominalCounts

public int[][][] getClusterNominalCounts()
Returns for each cluster the frequency counts for the values of each nominal attribute.

Returns:
the counts

getSquaredError

public double getSquaredError()
Gets the squared error for all clusters.

Returns:
the squared error, NaN if fast distance calculation is used
See Also:
m_FastDistanceCalc

getClusterSizes

public int[] getClusterSizes()
Gets the number of instances in each cluster.

Returns:
The number of instances in each cluster

getAssignments

public int[] getAssignments()
                     throws java.lang.Exception
Gets the assignments for each instance.

Returns:
Array of indexes of the centroid assigned to each instance
Throws:
java.lang.Exception - if order of instances wasn't preserved or no assignments were made

getRevision

public java.lang.String getRevision()
Returns the revision string.

Specified by:
getRevision in interface weka.core.RevisionHandler
Overrides:
getRevision in class weka.clusterers.AbstractClusterer
Returns:
the revision

main

public static void main(java.lang.String[] args)
Main method for executing this class.

Parameters:
args - use -h to list all parameters