2014-03-18
[speech@h14 ~]$ hdfs dfs -ls demo
Found 10 items
-rw-r–r– 3 speech supergroup 628 2014-03-17 10:59 demo/text1.txt
-rw-r–r– 3 speech supergroup 1327 2014-03-17 10:59 demo/text10.txt
-rw-r–r– 3 speech supergroup 5165 2014-03-17 10:59 demo/text2.txt
-rw-r–r– 3 speech supergroup 3736 2014-03-17 10:59 demo/text3.txt
-rw-r–r– 3 speech supergroup 4338 2014-03-17 10:59 demo/text4.txt
-rw-r–r– 3 speech supergroup 3338 2014-03-17 10:59 demo/text5.txt
-rw-r–r– 3 speech supergroup 5836 2014-03-17 10:59 demo/text6.txt
-rw-r–r– 3 speech supergroup 2936 2014-03-17 10:59 demo/text7.txt
-rw-r–r– 3 speech supergroup 905 2014-03-17 10:59 demo/text8.txt
-rw-r–r– 3 speech supergroup 1566 2014-03-17 10:59 demo/text9.txt
[speech@h14 ~]$
Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary (key, value) pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.
The output of seqDirectory will be a Sequence file < Text, Text > of all documents (/sub-directory-path/documentFileName, documentText).
[speech@h14 ~]$ mahout seqdirectory -c UTF-8 -i demo -o demo-seqfiles
Check the output:
mahout seqdumper -i /user/margusja/demo-seqfiles/part-m-00000
[speech@h14 ~]$ hdfs dfs -ls demo-seqfiles
Found 2 items
-rw-r–r– 3 speech supergroup 0 2014-03-18 14:54 demo-seqfiles/_SUCCESS
-rw-r–r– 3 speech supergroup 15186 2014-03-18 14:54 demo-seqfiles/part-m-00000
[speech@h14 ~]$ mahout seq2sparse -nv -i demo-seqfiles -o demo-vectors -ow -x 10
-x 10 removes stopwords. Words that are in 10 files will be removed
[speech@h14 ~]$ hdfs dfs -ls demo-vectors
Found 7 items
drwxr-xr-x – speech supergroup 0 2014-03-18 14:57 demo-vectors/df-count
-rw-r–r– 3 speech supergroup 10472 2014-03-18 14:57 demo-vectors/dictionary.file-0
-rw-r–r– 3 speech supergroup 10933 2014-03-18 14:57 demo-vectors/frequency.file-0
drwxr-xr-x – speech supergroup 0 2014-03-18 14:58 demo-vectors/tf-vectors
drwxr-xr-x – speech supergroup 0 2014-03-18 14:58 demo-vectors/tfidf-vectors
drwxr-xr-x – speech supergroup 0 2014-03-18 14:57 demo-vectors/tokenized-documents
drwxr-xr-x – speech supergroup 0 2014-03-18 14:57 demo-vectors/wordcount
[speech@h14 ~]$ mahout kmeans -i demo-vectors/tfidf-vectors -c demo-canopy-centroids -o demo-kmeans-clusters -k 3 -x 10 -cl -ow
[speech@h14 ~]$ hdfs dfs -ls demo-kmeans-clusters
Found 6 items
-rw-r–r– 3 speech supergroup 194 2014-03-18 15:02 demo-kmeans-clusters/_policy
drwxr-xr-x – speech supergroup 0 2014-03-18 15:02 demo-kmeans-clusters/clusteredPoints
drwxr-xr-x – speech supergroup 0 2014-03-18 15:01 demo-kmeans-clusters/clusters-0
drwxr-xr-x – speech supergroup 0 2014-03-18 15:01 demo-kmeans-clusters/clusters-1
drwxr-xr-x – speech supergroup 0 2014-03-18 15:01 demo-kmeans-clusters/clusters-2
drwxr-xr-x – speech supergroup 0 2014-03-18 15:02 demo-kmeans-clusters/clusters-3-final
[speech@h14 ~]$ mahout clusterdump -dt sequencefile -d demo-vectors/dictionary.file-0 -i demo-kmeans-clusters/clusters-3-final -n 10 -o demo_clusters -p demo-kmeans-clusters/clusteredPoints
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/18 15:02:46 INFO common.AbstractJob: Command line arguments: {–dictionary=[demo-vectors/dictionary.file-0], –dictionaryType=[sequencefile], –distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], –endPhase=[2147483647], –input=[demo-kmeans-clusters/clusters-3-final], –numWords=[10], –output=[demo_clusters], –outputFormat=[TEXT], –pointsDir=[demo-kmeans-clusters/clusteredPoints], –startPhase=[0], –tempDir=[temp]}
14/03/18 15:02:48 INFO clustering.ClusterDumper: Wrote 3 clusters
14/03/18 15:02:48 INFO driver.MahoutDriver: Program took 1880 ms (Minutes: 0.03133333333333333)
[speech@h14 ~]$
This is simple example how to create clusters from text articles using mahout and hadoop.
- Created 10 text files, copied postimees.ee articles each to separate file. Moved local dir to hadoop fs
[hduser@vm38 mahout-0.9]$ hadoop fs -ls demo
Warning: $HADOOP_HOME is deprecated.
Found 10 items
-rw-r–r– 2 hduser supergroup 1933 2013-11-25 15:15 /user/hduser/demo/uudis1.txt
-rw-r–r– 2 hduser supergroup 1870 2013-11-25 15:15 /user/hduser/demo/uudis10.txt
-rw-r–r– 2 hduser supergroup 706 2013-11-25 15:15 /user/hduser/demo/uudis2.txt
-rw-r–r– 2 hduser supergroup 1812 2013-11-25 15:15 /user/hduser/demo/uudis3.txt
-rw-r–r– 2 hduser supergroup 1174 2013-11-25 15:15 /user/hduser/demo/uudis4.txt
-rw-r–r– 2 hduser supergroup 2363 2013-11-25 15:15 /user/hduser/demo/uudis5.txt
-rw-r–r– 2 hduser supergroup 1708 2013-11-25 15:15 /user/hduser/demo/uudis6.txt
-rw-r–r– 2 hduser supergroup 2198 2013-11-25 15:15 /user/hduser/demo/uudis7.txt
-rw-r–r– 2 hduser supergroup 806 2013-11-25 15:15 /user/hduser/demo/uudis8.txt
-rw-r–r– 2 hduser supergroup 737 2013-11-25 15:15 /user/hduser/demo/uudis9.txt
[hduser@vm38 mahout-0.9]$
- Create SequenceFile format in to hadoop fs
[hduser@vm38 mahout-0.9]$ bin/mahout seqdirectory -c UTF-8 -i demo -o demo-seqfiles
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/hduser/mahout-0.9/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Warning: $HADOOP_HOME is deprecated.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [file:/usr/local/hadoop-1.0.4/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/11/28 11:04:54 INFO common.AbstractJob: Command line arguments: {–charset=[UTF-8], –chunkSize=[64], –endPhase=[2147483647], –fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], –input=[demo], –keyPrefix=[], –output=[demo-seqfiles], –startPhase=[0], –tempDir=[temp]}
13/11/28 11:04:55 INFO driver.MahoutDriver: Program took 1263 ms (Minutes: 0.02105)
[hduser@vm38 mahout-0.9]$
- Convert data to vectors ( key -nv gives namevectors)
[hduser@vm38 mahout-0.9]$ bin/mahout seq2sparse -nv -i demo-seqfiles/ -o demo-vectors -ow
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/hduser/mahout-0.9/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Warning: $HADOOP_HOME is deprecated.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [file:/usr/local/hadoop-1.0.4/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/11/28 11:09:26 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
13/11/28 11:09:26 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
13/11/28 11:09:26 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
13/11/28 11:09:26 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/11/28 11:09:26 INFO input.FileInputFormat: Total input paths to process : 1
13/11/28 11:09:27 INFO mapred.JobClient: Running job: job_201310021514_0216
13/11/28 11:09:28 INFO mapred.JobClient: map 0% reduce 0%
13/11/28 11:09:42 INFO mapred.JobClient: map 100% reduce 0%
13/11/28 11:09:47 INFO mapred.JobClient: Job complete: job_201310021514_0216
13/11/28 11:09:47 INFO mapred.JobClient: Counters: 19
13/11/28 11:09:47 INFO mapred.JobClient: Job Counters
13/11/28 11:09:47 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13764
13/11/28 11:09:47 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/28 11:09:47 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/28 11:09:47 INFO mapred.JobClient: Rack-local map tasks=1
13/11/28 11:09:47 INFO mapred.JobClient: Launched map tasks=1
13/11/28 11:09:47 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/11/28 11:09:47 INFO mapred.JobClient: File Output Format Counters
13/11/28 11:09:47 INFO mapred.JobClient: Bytes Written=15158
13/11/28 11:09:47 INFO mapred.JobClient: FileSystemCounters
13/11/28 11:09:47 INFO mapred.JobClient: HDFS_BYTES_READ=15834
13/11/28 11:09:47 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21686
13/11/28 11:09:47 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=15158
13/11/28 11:09:47 INFO mapred.JobClient: File Input Format Counters
13/11/28 11:09:47 INFO mapred.JobClient: Bytes Read=15716
13/11/28 11:09:47 INFO mapred.JobClient: Map-Reduce Framework
13/11/28 11:09:47 INFO mapred.JobClient: Map input records=10
13/11/28 11:09:47 INFO mapred.JobClient: Physical memory (bytes) snapshot=81567744
13/11/28 11:09:47 INFO mapred.JobClient: Spilled Records=0
13/11/28 11:09:47 INFO mapred.JobClient: CPU time spent (ms)=460
13/11/28 11:09:47 INFO mapred.JobClient: Total committed heap usage (bytes)=76939264
13/11/28 11:09:47 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2875113472
13/11/28 11:09:47 INFO mapred.JobClient: Map output records=10
13/11/28 11:09:47 INFO mapred.JobClient: SPLIT_RAW_BYTES=118
13/11/28 11:09:47 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
…
…
…
13/11/28 11:13:17 INFO mapred.JobClient: Job complete: job_201310021514_0222
13/11/28 11:13:17 INFO mapred.JobClient: Counters: 29
13/11/28 11:13:17 INFO mapred.JobClient: Job Counters
13/11/28 11:13:17 INFO mapred.JobClient: Launched reduce tasks=1
13/11/28 11:13:17 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13780
13/11/28 11:13:17 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/28 11:13:17 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/28 11:13:17 INFO mapred.JobClient: Rack-local map tasks=1
13/11/28 11:13:17 INFO mapred.JobClient: Launched map tasks=1
13/11/28 11:13:17 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10639
13/11/28 11:13:17 INFO mapred.JobClient: File Output Format Counters
13/11/28 11:13:17 INFO mapred.JobClient: Bytes Written=5122
13/11/28 11:13:17 INFO mapred.JobClient: FileSystemCounters
13/11/28 11:13:17 INFO mapred.JobClient: FILE_BYTES_READ=4957
13/11/28 11:13:17 INFO mapred.JobClient: HDFS_BYTES_READ=5262
13/11/28 11:13:17 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54153
13/11/28 11:13:17 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=5122
13/11/28 11:13:17 INFO mapred.JobClient: File Input Format Counters
13/11/28 11:13:17 INFO mapred.JobClient: Bytes Read=5122
13/11/28 11:13:17 INFO mapred.JobClient: Map-Reduce Framework
13/11/28 11:13:17 INFO mapred.JobClient: Map output materialized bytes=4957
13/11/28 11:13:17 INFO mapred.JobClient: Map input records=10
13/11/28 11:13:17 INFO mapred.JobClient: Reduce shuffle bytes=0
13/11/28 11:13:17 INFO mapred.JobClient: Spilled Records=20
13/11/28 11:13:17 INFO mapred.JobClient: Map output bytes=4912
13/11/28 11:13:17 INFO mapred.JobClient: Total committed heap usage (bytes)=217645056
13/11/28 11:13:17 INFO mapred.JobClient: CPU time spent (ms)=1020
13/11/28 11:13:17 INFO mapred.JobClient: Combine input records=0
13/11/28 11:13:17 INFO mapred.JobClient: SPLIT_RAW_BYTES=140
13/11/28 11:13:17 INFO mapred.JobClient: Reduce input records=10
13/11/28 11:13:17 INFO mapred.JobClient: Reduce input groups=10
13/11/28 11:13:17 INFO mapred.JobClient: Combine output records=0
13/11/28 11:13:17 INFO mapred.JobClient: Physical memory (bytes) snapshot=232296448
13/11/28 11:13:17 INFO mapred.JobClient: Reduce output records=10
13/11/28 11:13:17 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5751988224
13/11/28 11:13:17 INFO mapred.JobClient: Map output records=10
13/11/28 11:13:17 INFO common.HadoopUtil: Deleting demo-vectors/partial-vectors-0
13/11/28 11:13:17 INFO driver.MahoutDriver: Program took 231406 ms (Minutes: 3.8567666666666667)
The result:
[hduser@vm38 mahout-0.9]$ hadoop fs -ls demo-vectors
Warning: $HADOOP_HOME is deprecated.
Found 7 items
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:21 /user/hduser/demo-vectors/df-count
-rw-r–r– 2 hduser supergroup 757 2013-11-28 11:19 /user/hduser/demo-vectors/dictionary.file-0
-rw-r–r– 2 hduser supergroup 873 2013-11-28 11:21 /user/hduser/demo-vectors/frequency.file-0
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:21 /user/hduser/demo-vectors/tf-vectors
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:22 /user/hduser/demo-vectors/tfidf-vectors
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:18 /user/hduser/demo-vectors/tokenized-documents
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:19 /user/hduser/demo-vectors/wordcount
- Now a simple clustering with kmenas
[hduser@vm38 mahout-0.9]$ bin/mahout kmeans -i demo-vectors/tfidf-vectors -c demo-canopy-centroids -o demo-kmeans-clusters5 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 0.1 -k 4 -x 4 -cl
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/hduser/mahout-0.9/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Warning: $HADOOP_HOME is deprecated.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [file:/usr/local/hadoop-1.0.4/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/11/28 11:26:10 INFO common.AbstractJob: Command line arguments: {–clustering=null, –clusters=[demo-canopy-centroids], –convergenceDelta=[0.1], –distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], –endPhase=[2147483647], –input=[demo-vectors/tfidf-vectors], –maxIter=[4], –method=[mapreduce], –numClusters=[4], –output=[demo-kmeans-clusters4], –startPhase=[0], –tempDir=[temp]}
13/11/28 11:26:11 INFO common.HadoopUtil: Deleting demo-canopy-centroids
13/11/28 11:26:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/28 11:26:11 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/11/28 11:26:11 INFO compress.CodecPool: Got brand-new compressor
13/11/28 11:26:11 INFO kmeans.RandomSeedGenerator: Wrote 4 Klusters to demo-canopy-centroids/part-randomSeed
13/11/28 11:26:11 INFO kmeans.KMeansDriver: Input: demo-vectors/tfidf-vectors Clusters In: demo-canopy-centroids/part-randomSeed Out: demo-kmeans-clusters4 Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
13/11/28 11:26:11 INFO kmeans.KMeansDriver: convergence: 0.1 max Iterations: 4 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
13/11/28 11:26:11 INFO compress.CodecPool: Got brand-new decompressor
Cluster Iterator running iteration 1 over priorPath: demo-kmeans-clusters4/clusters-0
13/11/28 11:26:12 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/11/28 11:26:12 INFO input.FileInputFormat: Total input paths to process : 1
13/11/28 11:26:12 INFO mapred.JobClient: Running job: job_201310021514_0231
13/11/28 11:26:13 INFO mapred.JobClient: map 0% reduce 0%
13/11/28 11:26:26 INFO mapred.JobClient: map 100% reduce 0%
13/11/28 11:26:38 INFO mapred.JobClient: map 100% reduce 100%
13/11/28 11:26:43 INFO mapred.JobClient: Job complete: job_201310021514_0231
13/11/28 11:26:43 INFO mapred.JobClient: Counters: 29
13/11/28 11:26:43 INFO mapred.JobClient: Job Counters
13/11/28 11:26:43 INFO mapred.JobClient: Launched reduce tasks=1
13/11/28 11:26:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14946
13/11/28 11:26:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/28 11:26:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/28 11:26:43 INFO mapred.JobClient: Launched map tasks=1
13/11/28 11:26:43 INFO mapred.JobClient: Data-local map tasks=1
13/11/28 11:26:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11140
13/11/28 11:26:43 INFO mapred.JobClient: File Output Format Counters
13/11/28 11:26:43 INFO mapred.JobClient: Bytes Written=1992
13/11/28 11:26:43 INFO mapred.JobClient: FileSystemCounters
13/11/28 11:26:43 INFO mapred.JobClient: FILE_BYTES_READ=2787
13/11/28 11:26:43 INFO mapred.JobClient: HDFS_BYTES_READ=7996
13/11/28 11:26:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=49849
13/11/28 11:26:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1992
13/11/28 11:26:43 INFO mapred.JobClient: File Input Format Counters
13/11/28 11:26:43 INFO mapred.JobClient: Bytes Read=1694
13/11/28 11:26:43 INFO mapred.JobClient: Map-Reduce Framework
13/11/28 11:26:43 INFO mapred.JobClient: Map output materialized bytes=2787
13/11/28 11:26:43 INFO mapred.JobClient: Map input records=10
13/11/28 11:26:43 INFO mapred.JobClient: Reduce shuffle bytes=2787
13/11/28 11:26:43 INFO mapred.JobClient: Spilled Records=8
13/11/28 11:26:43 INFO mapred.JobClient: Map output bytes=2765
13/11/28 11:26:43 INFO mapred.JobClient: Total committed heap usage (bytes)=219152384
13/11/28 11:26:43 INFO mapred.JobClient: CPU time spent (ms)=1970
13/11/28 11:26:43 INFO mapred.JobClient: Combine input records=0
13/11/28 11:26:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=136
13/11/28 11:26:43 INFO mapred.JobClient: Reduce input records=4
13/11/28 11:26:43 INFO mapred.JobClient: Reduce input groups=4
13/11/28 11:26:43 INFO mapred.JobClient: Combine output records=0
13/11/28 11:26:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=249462784
13/11/28 11:26:43 INFO mapred.JobClient: Reduce output records=4
13/11/28 11:26:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5729689600
13/11/28 11:26:43 INFO mapred.JobClient: Map output records=4
Cluster Iterator running iteration 2 over priorPath: demo-kmeans-clusters4/clusters-1
13/11/28 11:26:43 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/11/28 11:26:43 INFO input.FileInputFormat: Total input paths to process : 1
13/11/28 11:26:43 INFO mapred.JobClient: Running job: job_201310021514_0232
13/11/28 11:26:44 INFO mapred.JobClient: map 0% reduce 0%
…
…
…
13/11/28 11:27:15 INFO kmeans.KMeansDriver: Clustering data
13/11/28 11:27:15 INFO kmeans.KMeansDriver: Running Clustering
13/11/28 11:27:15 INFO kmeans.KMeansDriver: Input: demo-vectors/tfidf-vectors Clusters In: demo-kmeans-clusters4 Out: demo-kmeans-clusters4 Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@5bd31f85
13/11/28 11:27:16 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/28 11:27:16 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/11/28 11:27:16 INFO input.FileInputFormat: Total input paths to process : 1
13/11/28 11:27:16 INFO mapred.JobClient: Running job: job_201310021514_0233
13/11/28 11:27:17 INFO mapred.JobClient: map 0% reduce 0%
13/11/28 11:27:30 INFO mapred.JobClient: map 100% reduce 0%
13/11/28 11:27:35 INFO mapred.JobClient: Job complete: job_201310021514_0233
13/11/28 11:27:35 INFO mapred.JobClient: Counters: 19
13/11/28 11:27:35 INFO mapred.JobClient: Job Counters
13/11/28 11:27:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14929
13/11/28 11:27:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/28 11:27:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/28 11:27:35 INFO mapred.JobClient: Launched map tasks=1
13/11/28 11:27:35 INFO mapred.JobClient: Data-local map tasks=1
13/11/28 11:27:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/11/28 11:27:35 INFO mapred.JobClient: File Output Format Counters
13/11/28 11:27:35 INFO mapred.JobClient: Bytes Written=1723
13/11/28 11:27:35 INFO mapred.JobClient: FileSystemCounters
13/11/28 11:27:35 INFO mapred.JobClient: HDFS_BYTES_READ=4016
13/11/28 11:27:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21720
13/11/28 11:27:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1723
13/11/28 11:27:35 INFO mapred.JobClient: File Input Format Counters
13/11/28 11:27:35 INFO mapred.JobClient: Bytes Read=1694
13/11/28 11:27:35 INFO mapred.JobClient: Map-Reduce Framework
13/11/28 11:27:35 INFO mapred.JobClient: Map input records=10
13/11/28 11:27:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=89280512
13/11/28 11:27:35 INFO mapred.JobClient: Spilled Records=0
13/11/28 11:27:35 INFO mapred.JobClient: CPU time spent (ms)=990
13/11/28 11:27:35 INFO mapred.JobClient: Total committed heap usage (bytes)=61341696
13/11/28 11:27:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2866900992
13/11/28 11:27:35 INFO mapred.JobClient: Map output records=10
13/11/28 11:27:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=136
13/11/28 11:27:35 INFO driver.MahoutDriver: Program took 84664 ms (Minutes: 1.4110666666666667)
And the result:
[hduser@vm38 mahout-0.9]$ hadoop fs -ls demo-kmeans-clusters4
Warning: $HADOOP_HOME is deprecated.
Found 5 items
-rw-r–r– 2 hduser supergroup 194 2013-11-28 11:27 /user/hduser/demo-kmeans-clusters4/_policy
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:27 /user/hduser/demo-kmeans-clusters4/clusteredPoints
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:26 /user/hduser/demo-kmeans-clusters4/clusters-0
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:26 /user/hduser/demo-kmeans-clusters4/clusters-1
drwxr-xr-x – hduser supergroup 0 2013-11-28 11:27 /user/hduser/demo-kmeans-clusters4/clusters-2-final
- View clusters
Clustering tasks in Mahout will output data in the format of a SequenceFile (Text, Cluster) and the Text is a cluster identifier string. To analyze this output we need to convert the sequence files to a human readable format and this is achieved using the clusterdump utility.
[hduser@vm38 mahout-0.9]$ bin/mahout clusterdump -dt sequencefile -d demo-vectors/dictionary.file-* -i /user/hduser/demo-kmeans-clusters6/clusters-3-final/ -n 10 -o clusters_out
[hduser@vm38 mahout-0.9]$ less clusters_out
VL-5{n=1 c=[anesteesia:5.835, arvestades:2.204, eaiü:6.904, eesmärk:2.204, eest:1.693, eesti:2.710, eestis:1.693, ei:1.511, eriala:3.690, euroopa:2.204, freimann:3.690, freimanni:3.690, ida:1.916, intensiivravi:4.520, ja:4.421, juhatuse:2.710, ka:1.563, kes:1.511, kokku:1.916, linnas:2.204, lisas:2.204, merle:4.520, märkis:2.204, neile:2.204, ning:1.915, nõuded:3.690, oma:1.511, oskusi:5.219, pärast:2.204, pärnu:2.204, põhja:1.916, riiklikul:2.204, see:2.137, selgitas:1.693, sõnul:2.137, ta:2.394, tallinn:4.520, tartu:3.690, tasandil:2.204, teadmisi:4.520, tuleb:1.916, vajab:3.690, võib:2.204, õdede:3.690, õdedele:3.690, õed:3.690, üldõe:3.690] r=[]}
Top Terms:
eaiü => 6.903923511505127
anesteesia => 5.834880828857422
oskusi => 5.218875885009766
intensiivravi => 4.519679069519043
merle => 4.519679069519043
teadmisi => 4.519679069519043
tallinn => 4.519679069519043
ja => 4.421442031860352
freimann => 3.6903023719787598
üldõe => 3.6903023719787598
VL-6{n=1 c=[aasta:2.204, annab:2.204, eesti:4.694, eestis:2.394, elab:3.117, elanikkonnast:3.690, ette:1.916, inimesi:2.204, ja:2.472, juba:2.204, juhatuse:1.916, ka:1.563, korteriühistute:5.219, korteriühistutes:5.219, kui:1.357, ligikaudu:3.690, liidu:3.690, liikme:2.204, liit:3.690, mille:2.204, mis:3.386, nii:1.916, ning:1.563, protsenti:3.690, riiklikul:2.204, saab:1.511, samuti:1.693, sõnul:1.511, tasandil:2.204, tähtis:2.204, täna:1.693, tänavu:1.693, urmas:2.204, välja:1.693, ära:2.204, üle:1.916] r=[]}
Top Terms:
korteriühistutes => 5.218875885009766
korteriühistute => 5.218875885009766
eesti => 4.693934917449951
elanikkonnast => 3.6903023719787598
liit => 3.6903023719787598
liidu => 3.6903023719787598
ligikaudu => 3.6903023719787598
protsenti => 3.6903023719787598
mis => 3.386294364929199
elab => 3.1168882846832275
VL-8{n=6 c=[16:0.639, 1918:0.615, 28:0.870, 400:0.735, 95:0.615, aasta:0.519, aastal:0.958, aga:0.873, aktuaalne:0.735, algust:0.615, all:0.735, andis:0.367, annab:0.519, apteegi:0.615, apteek:0.753, apteeki:0.615, aru:0.615, arvan:0.735, arvestades:0.367, audru:1.065, auks:0.615, avatud:0.735, detsembril:0.367, ees:0.753, eesmärk:0.367, eest:0.771, eesti:0.782, eestis:0.888, ei:1.044, elab:0.367, elavad:0.615, enam:0.958, ette:0.319, euroopa:0.636, euroopaga:0.615, euroopasse:0.615, eurot:0.367, hambaarst:0.753, head:0.958, ida:0.639, ilmaga:0.615, inimesi:0.367, inimest:0.735, insenerid:0.615, ja:1.958, juba:0.519, juhatuse:0.319, juures:1.156, ka:1.150, kaamera:0.735, kaljas:0.753, kalmudel:0.615, kangelaste:0.615, kas:0.615, kaubanduskeskuse:0.615, kaugemale:0.615, kell:0.615, kes:0.860, kohal:0.615, kohta:0.735, kokku:0.319, kui:1.483, kuni:0.367, kõik:0.887, kõrgemal:0.753, küsimus:0.887, laenu:0.636, langenud:0.753, lennujaama:0.615, lennukid:0.753, lennuraja:0.753, libedust:0.615, ligi:0.367, liiduga:0.753, liikme:0.367, linna:0.615, linnas:0.367, lisas:0.367, läve:0.753, lörtsi:0.615, ma:0.735, maailmas:0.615, maanteeinfo:0.753, maja:0.615, majas:0.753, maksis:0.367, me:1.335, meelt:0.735, meetri:0.615, meie:0.972, meil:0.735, meile:1.090, mille:0.367, mis:1.053, mälestusmärkide:0.615, märkis:0.519, narva:0.887, neile:0.367, nende:0.519, nihutada:0.615, nii:0.873, ning:0.813, novembril:0.870, nüüd:0.887, oleks:0.615, olemas:0.735, oleme:0.735, oli:1.256, olid:1.246, olnud:0.735, oma:0.860, palju:0.735, peaks:0.615, perearsti:0.615, politsei:0.615, projekteeritud:0.735, pärast:0.367, pärnu:0.367, põhja:0.319, põhjus:0.735, püstitatud:0.615, rahvale:0.735, rahvas:0.958, raja:0.870, raske:0.735, reaalset:0.367, reinsalu:0.870, riia:0.615, riigi:0.753, riigist:0.735, rääkis:1.090, saab:1.007, saada:0.367, saama:0.615, saame:0.887, saanud:0.367, samas:0.367, samuti:0.282, seda:1.129, see:1.630, selgitas:0.847, selle:1.053, selleks:0.735, selles:0.519, selline:0.735, sest:0.873, siin:0.615, siis:1.363, sõdureid:0.615, sõja:0.615, sõnul:0.755, ta:0.771, tagasi:0.615, tallinna:1.278, tegelikult:0.735, teisel:0.615, tema:0.319, temperatuurid:0.615, tsybulenko:0.615, tuleb:0.639, tuli:0.753, tulnud:0.735, tähendab:0.615, tähtis:0.367, täna:1.053, tänavu:0.282, ukraina:1.305, ukrainas:0.615, ukrainlased:0.615, urmas:0.367, vabadussõda:0.615, vabadussõja:0.972, vabadussõjas:0.615, vaevalt:0.735, vahendas:0.735, vaja:0.615, vald:0.615, vastu:1.363, vihma:0.615, viisat:0.615, väga:1.353, välja:0.564, või:1.170, võib:0.367, võidelnute:0.615, võimaluse:0.615, võrra:0.615, võtta:0.887, www.mnt.ee:0.615, ühe:0.519, ühes:0.636, ühinemisraha:0.615, üle:0.319, ülejõe:0.615] r=[16:0.903, 1918:1.375, 28:1.945, 400:1.039, 95:1.375, aasta:1.162, aastal:0.958, aga:1.299, aktuaalne:1.039, algust:1.375, all:1.039, andis:0.821, annab:1.162, apteegi:1.375, apteek:1.684, apteeki:1.375, aru:1.375, arvan:1.039, arvestades:0.821, audru:2.382, auks:1.375, avatud:1.039, detsembril:0.821, ees:1.684, eesmärk:0.821, eest:1.148, eesti:1.749, eestis:1.265, ei:1.092, elab:0.821, elavad:1.375, enam:0.958, ette:0.714, euroopa:1.423, euroopaga:1.375, euroopasse:1.375, eurot:0.821, hambaarst:1.684, head:0.958, ida:0.903, ilmaga:1.375, inimesi:0.821, inimest:1.039, insenerid:1.375, ja:1.509, juba:1.162, juhatuse:0.714, juures:1.647, ka:0.554, kaamera:1.039, kaljas:1.684, kalmudel:1.375, kangelaste:1.375, kas:1.375, kaubanduskeskuse:1.375, kaugemale:1.375, kell:1.375, kes:0.885, kohal:1.375, kohta:1.039, kokku:0.714, kui:0.749, kuni:0.821, kõik:1.282, kõrgemal:1.684, küsimus:1.282, laenu:1.423, langenud:1.684, lennujaama:1.375, lennukid:1.684, lennuraja:1.684, libedust:1.375, ligi:0.821, liiduga:1.684, liikme:0.821, linna:1.375, linnas:0.821, lisas:0.821, läve:1.684, lörtsi:1.375, ma:1.039, maailmas:1.375, maanteeinfo:1.684, maja:1.375, majas:1.684, maksis:0.821, me:1.041, meelt:1.039, meetri:1.375, meie:2.175, meil:1.039, meile:1.122, mille:0.821, mis:1.131, mälestusmärkide:1.375, märkis:1.162, narva:1.282, neile:0.821, nende:1.162, nihutada:1.375, nii:1.299, ning:0.597, novembril:1.945, nüüd:1.282, oleks:1.375, olemas:1.039, oleme:1.039, oli:1.499, olid:0.915, olnud:1.039, oma:0.885, palju:1.039, peaks:1.375, perearsti:1.375, politsei:1.375, projekteeritud:1.039, pärast:0.821, pärnu:0.821, põhja:0.714, põhjus:1.039, püstitatud:1.375, rahvale:1.039, rahvas:0.958, raja:1.945, raske:1.039, reaalset:0.821, reinsalu:1.945, riia:1.375, riigi:1.684, riigist:1.039, rääkis:1.122, saab:0.712, saada:0.821, saama:1.375, saame:1.282, saanud:0.821, samas:0.821, samuti:0.631, seda:0.798, see:1.382, selgitas:0.847, selle:1.131, selleks:1.039, selles:1.162, selline:1.039, sest:1.299, siin:1.375, siis:1.005, sõdureid:1.375, sõja:1.375, sõnul:0.755, ta:1.148, tagasi:1.375, tallinna:1.428, tegelikult:1.039, teisel:1.375, tema:0.714, temperatuurid:1.375, tsybulenko:1.375, tuleb:0.903, tuli:1.684, tulnud:1.039, tähendab:1.375, tähtis:0.821, täna:1.131, tänavu:0.631, ukraina:2.917, ukrainas:1.375, ukrainlased:1.375, urmas:0.821, vabadussõda:1.375, vabadussõja:2.175, vabadussõjas:1.375, vaevalt:1.039, vahendas:1.039, vaja:1.375, vald:1.375, vastu:1.005, vihma:1.375, viisat:1.375, väga:1.566, välja:0.798, või:1.224, võib:0.821, võidelnute:1.375, võimaluse:1.375, võrra:1.375, võtta:1.282, www.mnt.ee:1.375, ühe:1.162, ühes:1.423, ühinemisraha:1.375, üle:0.714, ülejõe:1.375]}
Top Terms:
ja => 1.9576633373896282
see => 1.6297114690144856
kui => 1.4834059476852417
vastu => 1.3625396092732747
siis => 1.3625396092732747
väga => 1.3529229958852131
me => 1.3353430827458699
ukraina => 1.3047189712524414
tallinna => 1.2775271733601887
oli => 1.2556068897247314
Thous are 3 clusters top 10 words. Yes there are room for tuning but this is a simple how to.
- Bind clustered word and text document
[hduser@vm38 mahout-0.9]$ bin/mahout seqdumper -i /user/hduser/demo-kmeans-clusters6/clusteredPoints/part-m-00000
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/hduser/mahout-0.9/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Warning: $HADOOP_HOME is deprecated.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [file:/usr/local/hadoop-1.0.4/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/11/28 11:48:04 INFO common.AbstractJob: Command line arguments: {–endPhase=[2147483647], –input=[/user/hduser/demo-kmeans-clusters6/clusteredPoints/part-m-00000], –startPhase=[0], –tempDir=[temp]}
Input Path: /user/hduser/demo-kmeans-clusters6/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 8: Value: 1.0: /uudis1.txt = [6:2.204, 8:3.117, 11:1.916, 13:2.204, 16:2.204, 21:3.690, 22:4.520, 23:3.690, 25:2.204, 28:6.392, 30:2.204, 36:1.693, 43:1.916, 52:2.204, 55:4.520, 57:1.916, 65:3.829, 67:3.117, 70:3.117, 72:1.563, 73:2.204, 92:1.919, 95:3.117, 98:3.817, 105:2.204, 116:2.204, 120:3.690, 121:4.520, 122:2.204, 123:2.933, 127:2.204, 128:1.916, 130:2.204, 133:1.693, 141:3.319, 145:2.204, 148:2.204, 149:4.147, 150:1.693, 158:3.690, 159:3.690, 160:3.690, 162:2.204, 165:2.204, 171:2.204, 172:1.916, 174:2.204, 182:2.710, 183:1.511, 185:3.690, 186:2.204, 189:2.204, 191:1.693, 192:3.378, 194:1.693, 195:2.933, 196:2.204, 198:2.204, 199:3.319, 200:3.690, 201:2.394, 208:3.690, 215:3.690, 221:4.520, 227:1.693, 235:2.204, 236:2.204, 239:3.690, 241:1.693, 244:4.285, 246:1.693, 251:2.204, 257:3.117, 258:3.817, 259:3.690, 262:3.690]
Key: 8: Value: 1.0: /uudis10.txt = [3:3.690, 5:5.219, 7:3.690, 9:1.916, 15:3.690, 16:2.204, 17:2.204, 27:2.204, 29:3.690, 36:2.933, 37:4.694, 39:2.137, 65:3.495, 70:3.817, 72:1.563, 75:3.690, 76:3.690, 83:3.690, 85:1.511, 92:1.919, 95:2.204, 100:4.520, 117:3.690, 123:1.693, 124:2.204, 126:5.835, 128:1.916, 133:2.933, 135:3.690, 136:3.117, 137:3.117, 138:2.204, 141:1.916, 142:1.563, 143:5.219, 147:2.204, 148:2.204, 150:1.693, 151:2.204, 152:2.137, 155:2.204, 167:2.204, 169:3.690, 171:2.204, 172:1.916, 175:2.204, 177:5.219, 179:4.520, 180:2.204, 191:1.693, 192:1.511, 198:2.204, 199:1.916, 202:3.690, 203:3.690, 204:1.511, 214:2.204, 222:2.204, 231:2.204, 232:3.690, 233:5.835, 234:3.690, 241:2.394, 248:3.690, 261:1.916]
Key: 8: Value: 1.0: /uudis2.txt = [24:3.690, 39:1.511, 43:1.916, 57:1.916, 61:2.204, 62:2.204, 63:3.690, 74:4.520, 81:3.690, 85:2.137, 87:2.204, 89:1.916, 92:1.357, 97:2.204, 123:1.693, 142:1.105, 149:1.693, 152:1.511, 162:2.204, 178:3.690, 182:1.916, 183:1.511, 192:1.511, 194:1.693, 195:1.693, 201:1.693, 204:1.511, 214:2.204, 216:1.916, 220:1.916, 237:3.690, 241:1.693, 245:1.693]
Key: 8: Value: 1.0: /uudis3.txt = [9:1.916, 11:3.319, 13:2.204, 25:2.204, 30:2.204, 34:4.520, 38:2.394, 39:2.617, 40:2.204, 42:3.690, 43:1.916, 47:1.916, 49:3.817, 50:3.690, 51:3.690, 58:1.916, 62:2.204, 65:2.211, 72:1.105, 73:2.204, 79:3.690, 85:1.511, 92:1.357, 97:3.117, 108:4.520, 116:2.204, 124:2.204, 139:3.117, 142:1.105, 145:3.117, 146:3.690, 147:2.204, 149:1.693, 150:1.693, 151:2.204, 152:1.511, 155:2.204, 164:2.204, 172:1.916, 174:2.204, 180:2.204, 182:1.916, 183:1.511, 184:2.204, 188:2.204, 190:1.693, 191:1.693, 197:3.117, 201:1.693, 207:2.933, 210:1.916, 219:3.690, 220:1.916, 222:2.204, 226:1.693, 228:7.828, 229:3.690, 230:3.690, 235:2.204, 236:2.204, 243:3.690, 244:1.916, 246:2.933]
Key: 8: Value: 1.0: /uudis4.txt = [2:1.916, 6:2.204, 9:1.916, 19:3.117, 31:2.204, 35:2.204, 60:3.690, 68:1.916, 72:1.563, 82:3.690, 86:3.690, 92:2.350, 94:2.204, 96:4.520, 101:3.690, 102:4.520, 103:4.520, 109:2.204, 111:3.690, 112:2.204, 113:2.204, 114:4.520, 123:1.693, 125:3.690, 127:2.204, 128:2.710, 133:1.693, 140:3.690, 142:1.105, 167:2.204, 173:5.219, 186:3.117, 191:1.693, 192:3.378, 194:1.693, 195:1.693, 196:2.204, 201:2.394, 204:1.511, 207:1.693, 210:3.833, 224:3.690, 225:2.204, 226:1.693, 241:2.394, 244:1.916, 245:1.693, 249:3.690, 250:3.690, 251:3.117]
Key: 5: Value: 1.0: /uudis5.txt = [18:5.835, 27:2.204, 32:6.904, 35:2.204, 36:1.693, 37:2.710, 38:1.693, 39:1.511, 45:3.690, 49:2.204, 53:3.690, 54:3.690, 58:1.916, 64:4.520, 65:4.421, 68:2.710, 72:1.563, 85:1.511, 89:1.916, 112:2.204, 113:2.204, 129:4.520, 136:2.204, 138:2.204, 142:1.915, 144:3.690, 152:1.511, 153:5.219, 164:2.204, 165:2.204, 166:1.916, 181:2.204, 192:2.137, 194:1.693, 204:2.137, 207:2.394, 209:4.520, 211:3.690, 212:2.204, 213:4.520, 220:1.916, 238:3.690, 247:2.204, 254:3.690, 255:3.690, 256:3.690, 260:3.690]
Key: 6: Value: 1.0: /uudis6.txt = [8:2.204, 19:2.204, 37:4.694, 38:2.394, 40:3.117, 41:3.690, 47:1.916, 61:2.204, 65:2.472, 67:2.204, 68:1.916, 72:1.563, 90:5.219, 91:5.219, 92:1.357, 106:3.690, 107:3.690, 109:2.204, 110:3.690, 130:2.204, 133:3.386, 141:1.916, 142:1.563, 163:3.690, 181:2.204, 183:1.511, 190:1.693, 204:1.511, 212:2.204, 225:2.204, 226:1.693, 227:1.693, 231:2.204, 245:1.693, 253:2.204, 261:1.916]
Key: 9: Value: 1.0: /uudis7.txt = [0:5.219, 1:3.690, 4:4.520, 10:6.904, 20:2.204, 26:3.690, 36:4.789, 39:3.022, 44:3.690, 48:3.690, 52:7.310, 56:2.204, 59:4.520, 65:1.563, 69:2.204, 71:3.817, 72:1.915, 77:3.690, 78:3.690, 84:3.690, 85:3.701, 88:3.690, 89:1.916, 93:3.690, 94:2.204, 98:5.399, 99:3.690, 105:2.204, 118:2.204, 122:3.817, 131:8.655, 132:3.690, 134:3.817, 139:2.204, 142:3.126, 149:1.693, 152:3.701, 154:3.690, 156:3.690, 157:4.520, 161:4.520, 168:3.690, 170:4.520, 175:2.204, 176:4.520, 184:3.117, 187:4.520, 188:3.817, 189:3.117, 190:1.693, 195:2.394, 205:3.690, 206:3.117, 216:3.319, 218:3.690, 223:3.690, 227:3.386, 240:3.690, 245:2.394, 246:1.693, 253:2.204, 257:2.204, 261:1.916, 263:2.204]
Key: 8: Value: 1.0: /uudis8.txt = [2:1.916, 38:2.933, 57:1.916, 58:1.916, 65:2.211, 72:1.105, 87:2.204, 104:3.690, 115:3.690, 119:4.520, 137:2.204, 150:2.394, 166:1.916, 183:1.511, 210:1.916, 217:3.690, 226:2.933, 242:3.690, 246:2.394, 247:2.204, 252:3.690]
Key: 9: Value: 1.0: /uudis9.txt = [2:1.916, 11:1.916, 12:4.520, 14:3.690, 17:2.204, 20:2.204, 31:2.204, 33:3.690, 47:1.916, 56:2.204, 65:2.211, 66:4.520, 69:2.204, 71:2.204, 80:3.690, 118:2.204, 134:2.204, 142:1.563, 166:1.916, 190:1.693, 193:4.520, 197:2.204, 199:1.916, 206:3.117, 207:1.693, 216:2.710, 227:1.693, 258:2.204, 263:2.204]
Count: 10
13/11/28 11:48:05 INFO driver.MahoutDriver: Program took 667 ms (Minutes: 0.011116666666666667)
Links
https://mahout.apache.org/users/basics/creating-vectors-from-text.html
http://mahout.apache.org/users/clustering/k-means-clustering.html
https://mahout.apache.org/users/clustering/cluster-dumper.html