Let us create a derived RDD that will represent the used memory on the platform.
val used: RDD[Double] = total.zip(free).map(e => e._1 - e._2)
What we have done here is that we have created a new RDD, representing a metric data that were calculated using two
other metric streams. The transformation was trivial, but RDD provides also methods for aggregation.
Now, let’s calculate the correlation between the used and free memory. There should be some correlation, right?
val correlation: Double = Statistics.corr(used, free, "pearson")
println(s"Correlation is: $correlation")
This will print
Correlation is: -0.9999999999999781. That means there is total negative
correlation between the two
(the higher the first one, the lower the second one).
This example was artificial, but in general, one can calculate the correlation between any two metric streams. Of course,
correlation doesn’t imply causation, so we can’t extract any higher level information like business rules here.
But we obtain some information that metrics are somehow related together and with further analysis we can for example detect, that
change in one metric stream always precedes the change in the second metric stream. Again, it’s not causation, but something stronger than
correlation. Granger causality test is possible method,
nonetheless, this is out of the scope of this blog post.
Last thing, I wanted to show is the k-means clustering.
val usedMemoryVector = used.map(x => Vectors.dense(x))
val numClusters = 3
val numIterations = 20
val clusters: KMeansModel = KMeans.train(usedMemoryVector, numClusters, numIterations)
The code above will print the three cluster representatives. It is also possible to test the new data point by
clusters.predics(point) that will return the expected cluster the point is attracted to.
There is much more in the MLlib. One can find outliers,
common patterns, do a classification, etc.
All the code examples can be downloaded and run against the Cassandra with the metric data.