|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.clusterers.AbstractClusterer
weka.clusterers.RandomizableClusterer
codebook_generation.SimpleKMeansWithOutput
public class SimpleKMeansWithOutput
Cluster data using the k means algorithm. Can use either the Euclidean distance
(default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the
component-wise median rather than mean. For more information see:
D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
@inproceedings{Arthur2007, author = {D. Arthur and S. Vassilvitskii}, booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms}, pages = {1027-1035}, title = {k-means++: the advantages of carefull seeding}, year = {2007} }Valid options are:
-N <num> number of clusters. (default 2).
-P Initialize using the k-means++ method.
-V Display std. deviations for centroids.
-M Replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
RandomizableClusterer
,
Serialized FormNested Class Summary | |
---|---|
private class |
SimpleKMeansWithOutput.KMeansClusterTask
|
private class |
SimpleKMeansWithOutput.KMeansComputeCentroidTask
|
Field Summary | |
---|---|
protected int[] |
m_Assignments
Assignments obtained. |
private weka.core.Instances |
m_ClusterCentroids
holds the cluster centroids. |
private int[][] |
m_ClusterMissingCounts
|
private int[][][] |
m_ClusterNominalCounts
For each cluster, holds the frequency counts for the values of each nominal attribute. |
private int[] |
m_ClusterSizes
The number of instances in each cluster. |
private weka.core.Instances |
m_ClusterStdDevs
Holds the standard deviations of the numeric attributes in each cluster. |
protected int |
m_completed
|
private boolean |
m_displayStdDevs
Display standard deviations for numeric atts. |
protected weka.core.DistanceFunction |
m_DistanceFunction
the distance function used. |
private boolean |
m_dontReplaceMissing
Replace missing values globally? |
protected int |
m_executionSlots
|
protected java.util.concurrent.ExecutorService |
m_executorPool
For parallel execution mode |
protected int |
m_failed
|
protected boolean |
m_FastDistanceCalc
whether to use fast calculation of distances (using a cut-off). |
private double[] |
m_FullMeansOrMediansOrModes
Stats on the full data set for comparison purposes. |
private int[] |
m_FullMissingCounts
|
private int[][] |
m_FullNominalCounts
|
private double[] |
m_FullStdDevs
|
protected boolean |
m_initializeWithKMeansPlusPlus
Whether to initialize cluster centers using the k-means++ method |
private int |
m_Iterations
Keep track of the number of iterations completed before convergence. |
private int |
m_MaxIterations
Maximum number of iterations to be executed. |
private int |
m_NumClusters
number of clusters to generate. |
private boolean |
m_PreserveOrder
Preserve order of instances. |
private weka.filters.unsupervised.attribute.ReplaceMissingValues |
m_ReplaceMissingFilter
replace missing values in training instances. |
private double[] |
m_squaredErrors
Holds the squared errors for all clusters. |
(package private) static long |
serialVersionUID
for serialization. |
Fields inherited from class weka.clusterers.RandomizableClusterer |
---|
m_Seed, m_SeedDefault |
Constructor Summary | |
---|---|
SimpleKMeansWithOutput()
the default constructor. |
Method Summary | |
---|---|
void |
buildClusterer(weka.core.Instances data)
Generates a clusterer. |
int |
clusterInstance(weka.core.Instance instance)
Classifies a given instance. |
private int |
clusterProcessedInstance(weka.core.Instance instance,
boolean updateErrors,
boolean useFastDistCalc)
clusters an instance that has been through the filters. |
java.lang.String |
displayStdDevsTipText()
Returns the tip text for this property. |
java.lang.String |
distanceFunctionTipText()
Returns the tip text for this property. |
java.lang.String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property. |
java.lang.String |
fastDistanceCalcTipText()
Returns the tip text for this property. |
int[] |
getAssignments()
Gets the assignments for each instance. |
weka.core.Capabilities |
getCapabilities()
Returns default capabilities of the clusterer. |
weka.core.Instances |
getClusterCentroids()
Gets the the cluster centroids. |
int[][][] |
getClusterNominalCounts()
Returns for each cluster the frequency counts for the values of each nominal attribute. |
int[] |
getClusterSizes()
Gets the number of instances in each cluster. |
weka.core.Instances |
getClusterStandardDevs()
Gets the standard deviations of the numeric attributes in each cluster. |
boolean |
getDisplayStdDevs()
Gets whether standard deviations and nominal count. |
weka.core.DistanceFunction |
getDistanceFunction()
returns the distance function currently in use. |
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced. |
boolean |
getFastDistanceCalc()
Gets whether to use faster distance calculation. |
boolean |
getInitializeUsingKMeansPlusPlusMethod()
Get whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers). |
int |
getMaxIterations()
gets the number of maximum iterations to be executed. |
int |
getNumClusters()
gets the number of clusters to generate. |
int |
getNumExecutionSlots()
Get the degree of parallelism to use. |
java.lang.String[] |
getOptions()
Gets the current settings of SimpleKMeans. |
boolean |
getPreserveInstancesOrder()
Gets whether order of instances must be preserved. |
java.lang.String |
getRevision()
Returns the revision string. |
double |
getSquaredError()
Gets the squared error for all clusters. |
weka.core.TechnicalInformation |
getTechnicalInformation()
|
java.lang.String |
globalInfo()
Returns a string describing this clusterer. |
java.lang.String |
initializeUsingKMeansPlusPlusMethodTipText()
Returns the tip text for this property. |
protected void |
kMeansPlusPlusInit(weka.core.Instances data)
|
protected boolean |
launchAssignToClusters(weka.core.Instances insts,
int[] clusterAssignments)
Launch the tasks that assign instances to clusters |
protected int |
launchMoveCentroids(weka.core.Instances[] clusters)
Launch the move centroids tasks |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options. |
static void |
main(java.lang.String[] args)
Main method for executing this class. |
java.lang.String |
maxIterationsTipText()
Returns the tip text for this property. |
protected double[] |
moveCentroid(int centroidIndex,
weka.core.Instances members,
boolean updateClusterInfo,
boolean addToCentroidInstances)
Move the centroid to it's new coordinates. |
int |
numberOfClusters()
Returns the number of clusters. |
java.lang.String |
numClustersTipText()
Returns the tip text for this property. |
java.lang.String |
numExecutionSlotsTipText()
Returns the tip text for this property |
private java.lang.String |
pad(java.lang.String source,
java.lang.String padChar,
int length,
boolean leftPad)
|
java.lang.String |
preserveInstancesOrderTipText()
Returns the tip text for this property. |
void |
setDisplayStdDevs(boolean stdD)
Sets whether standard deviations and nominal count. |
void |
setDistanceFunction(weka.core.DistanceFunction df)
sets the distance function to use for instance comparison. |
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced. |
void |
setFastDistanceCalc(boolean value)
Sets whether to use faster distance calculation. |
void |
setInitializeUsingKMeansPlusPlusMethod(boolean k)
Set whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers). |
void |
setMaxIterations(int n)
set the maximum number of iterations to be executed. |
void |
setNumClusters(int n)
set the number of clusters to generate. |
void |
setNumExecutionSlots(int slots)
Set the degree of parallelism to use. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setPreserveInstancesOrder(boolean r)
Sets whether order of instances must be preserved. |
protected void |
startExecutorPool()
Start the pool of execution threads |
java.lang.String |
toString()
return a string describing this clusterer. |
Methods inherited from class weka.clusterers.RandomizableClusterer |
---|
getSeed, seedTipText, setSeed |
Methods inherited from class weka.clusterers.AbstractClusterer |
---|
distributionForInstance, forName, makeCopies, makeCopy, runClusterer |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
static final long serialVersionUID
private weka.filters.unsupervised.attribute.ReplaceMissingValues m_ReplaceMissingFilter
private int m_NumClusters
private weka.core.Instances m_ClusterCentroids
private weka.core.Instances m_ClusterStdDevs
private int[][][] m_ClusterNominalCounts
private int[][] m_ClusterMissingCounts
private double[] m_FullMeansOrMediansOrModes
private double[] m_FullStdDevs
private int[][] m_FullNominalCounts
private int[] m_FullMissingCounts
private boolean m_displayStdDevs
private boolean m_dontReplaceMissing
private int[] m_ClusterSizes
private int m_MaxIterations
private int m_Iterations
private double[] m_squaredErrors
protected weka.core.DistanceFunction m_DistanceFunction
private boolean m_PreserveOrder
protected int[] m_Assignments
protected boolean m_FastDistanceCalc
protected boolean m_initializeWithKMeansPlusPlus
protected int m_executionSlots
protected transient java.util.concurrent.ExecutorService m_executorPool
protected int m_completed
protected int m_failed
Constructor Detail |
---|
public SimpleKMeansWithOutput()
Method Detail |
---|
protected void startExecutorPool()
public weka.core.TechnicalInformation getTechnicalInformation()
getTechnicalInformation
in interface weka.core.TechnicalInformationHandler
public java.lang.String globalInfo()
public weka.core.Capabilities getCapabilities()
getCapabilities
in interface weka.clusterers.Clusterer
getCapabilities
in interface weka.core.CapabilitiesHandler
getCapabilities
in class weka.clusterers.AbstractClusterer
protected int launchMoveCentroids(weka.core.Instances[] clusters)
clusters
- the cluster centroids
protected boolean launchAssignToClusters(weka.core.Instances insts, int[] clusterAssignments) throws java.lang.Exception
insts
- the instances to be clusteredclusterAssignments
- the array of cluster assignments
java.lang.Exception
- if a problem occurspublic void buildClusterer(weka.core.Instances data) throws java.lang.Exception
buildClusterer
in interface weka.clusterers.Clusterer
buildClusterer
in class weka.clusterers.AbstractClusterer
data
- set of instances serving as training data
java.lang.Exception
- if the clusterer has not been generated successfullyprotected void kMeansPlusPlusInit(weka.core.Instances data) throws java.lang.Exception
java.lang.Exception
protected double[] moveCentroid(int centroidIndex, weka.core.Instances members, boolean updateClusterInfo, boolean addToCentroidInstances)
centroidIndex
- index of the centroid which the coordinates will be computedmembers
- the objects that are assigned to the cluster of this centroidupdateClusterInfo
- if the method is supposed to update the m_Cluster arraysaddToCentroidInstances
- true if the method is to add the computed coordinates to the Instances holding the centroids
private int clusterProcessedInstance(weka.core.Instance instance, boolean updateErrors, boolean useFastDistCalc)
instance
- the instance to assign a cluster toupdateErrors
- if true, update the within clusters sum of errorsuseFastDistCalc
- whether to use the fast distance calculation or not
public int clusterInstance(weka.core.Instance instance) throws java.lang.Exception
clusterInstance
in interface weka.clusterers.Clusterer
clusterInstance
in class weka.clusterers.AbstractClusterer
instance
- the instance to be assigned to a cluster
java.lang.Exception
- if instance could not be classified successfullypublic int numberOfClusters() throws java.lang.Exception
numberOfClusters
in interface weka.clusterers.Clusterer
numberOfClusters
in class weka.clusterers.AbstractClusterer
java.lang.Exception
- if number of clusters could not be returned successfullypublic java.util.Enumeration listOptions()
listOptions
in interface weka.core.OptionHandler
listOptions
in class weka.clusterers.RandomizableClusterer
public java.lang.String numClustersTipText()
public void setNumClusters(int n) throws java.lang.Exception
setNumClusters
in interface weka.clusterers.NumberOfClustersRequestable
n
- the number of clusters to generate
java.lang.Exception
- if number of clusters is negativepublic int getNumClusters()
public java.lang.String initializeUsingKMeansPlusPlusMethodTipText()
public void setInitializeUsingKMeansPlusPlusMethod(boolean k)
k
- true if the k-means++ method is to be used to select initial cluster centers.public boolean getInitializeUsingKMeansPlusPlusMethod()
public java.lang.String maxIterationsTipText()
public void setMaxIterations(int n) throws java.lang.Exception
n
- the maximum number of iterations
java.lang.Exception
- if maximum number of iteration is smaller than 1public int getMaxIterations()
public java.lang.String displayStdDevsTipText()
public void setDisplayStdDevs(boolean stdD)
stdD
- true if std. devs and counts should be displayedpublic boolean getDisplayStdDevs()
public java.lang.String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r
- true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public java.lang.String distanceFunctionTipText()
public weka.core.DistanceFunction getDistanceFunction()
public void setDistanceFunction(weka.core.DistanceFunction df) throws java.lang.Exception
df
- the new distance function to use
java.lang.Exception
- if instances cannot be processedpublic java.lang.String preserveInstancesOrderTipText()
public void setPreserveInstancesOrder(boolean r)
r
- true if missing values are to be replacedpublic boolean getPreserveInstancesOrder()
public java.lang.String fastDistanceCalcTipText()
public void setFastDistanceCalc(boolean value)
value
- true if faster calculation to be usedpublic boolean getFastDistanceCalc()
public java.lang.String numExecutionSlotsTipText()
public void setNumExecutionSlots(int slots)
slots
- the number of tasks to run in parallel when computing the nearest neighbors and evaluating
different values of k between the lower and upper boundspublic int getNumExecutionSlots()
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-N <num> number of clusters. (default 2).
-P Initialize using the k-means++ method.
-V Display std. deviations for centroids.
-M Replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
setOptions
in interface weka.core.OptionHandler
setOptions
in class weka.clusterers.RandomizableClusterer
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface weka.core.OptionHandler
getOptions
in class weka.clusterers.RandomizableClusterer
public java.lang.String toString()
toString
in class java.lang.Object
private java.lang.String pad(java.lang.String source, java.lang.String padChar, int length, boolean leftPad)
public weka.core.Instances getClusterCentroids()
public weka.core.Instances getClusterStandardDevs()
public int[][][] getClusterNominalCounts()
public double getSquaredError()
m_FastDistanceCalc
public int[] getClusterSizes()
public int[] getAssignments() throws java.lang.Exception
java.lang.Exception
- if order of instances wasn't preserved or no assignments were madepublic java.lang.String getRevision()
getRevision
in interface weka.core.RevisionHandler
getRevision
in class weka.clusterers.AbstractClusterer
public static void main(java.lang.String[] args)
args
- use -h to list all parameters
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |