Monday, August 18, 2014

R


R is a great tool for plotting performance graphs. Every developer who worries about the performance profile of his application should learn it. Here are some miscellaneous scripts I've been using for the last few months.

But first...

A Crash Course in R

R is Modular

You might find that the standard installation does not have all the modules you need. Any R GUI will help you download the ones you want. But to use them, you need something like this:

.libPaths("YOUR_MODULE_DIRECTORY")
require("MODULE_NAME")

Think of this as setting the classpath and then using import in your Java files.

Loading the data

Take this example:

timings_data <- read.table("C:/Users/henphi/db_saves.txt")
timings      <- c(timings_data$V15)
times        <- timings_data$V1

The first line reads the raw data that I saved in a file.

The second line takes the 15th column of data (think of it as the awk command in Unix) and puts it into a vector. Like awk, R treats the first column as index 1, not zero. There is a 0th column but this is the line number. We put the  data into a vector using the c() command.

The third lines is just like the second but takes just the 1st ($V1) column. Here, I chose not to use the c() command but that's OK. In both cases it doesn't matter as they are both vectors already. In the paragraph above, I just wanted to introduce the standard way of making vectors in R.

Coercing the Data

We've already mentioned one data structure - vectors that are created using the c() command. Another is a list that can be created using (unsurprisingly) the list() command.

In this example, I am taking the variables defined in the section above and manipulating the types to make something I can plot.

data <- structure (
    list(
        time = structure(
            as.POSIXct(strptime(times, '%H:%M:%S')),
            class = c("POSIXct", "POSIXt"),
            tzone = ""
        ),
        variable = timings
    ),
    .Names    = c("time", "variable"),
    row.names = c(NA, -10L),
    class     = "data.frame"
)

Here, I have made a data structure of a type called a data frame. "Although a data set of 50 employees with 4 variables per worker has the look and feel of a 50x4 matrix, it does not qualify as such in R because it mixes types. Instead of a matrix we use a data frame. A data frame in R is a list with each component of the list being a vector corresponding to a column in our 'matrix' of data... Typically, data frames are created by reading in a data set from a file". [1]

As part of this structure, we have a list that has two entries that go by the arbitrary names of time and variable. Variable is simple, it's just the timings variable that we loaded from a file earlier.

The element of this list, time, is also a list but we have coerced the string representations of our times to a type that represents a date. This is done by the strptime function that converts the string in our data file to a standard type using regexp. Once in the standard format, it is then converted to a Unix time-from-epoch using as.POSIXct.

(Since the data file only has time of day entries in it, R will assume they belong to the day on which the line runs and give it "today's" date. That's OK as I know my data is only within an hour but I don't care which hour it was.)

The .Names element defines the names of the data objects within this structure. They would be used as default in the graphs we generated but we'll override them later.

A real world example

Let's say we want to plot the Garbage Collection times of our app. A couple of command line switches will help.

  • -XX:+PrintGCApplicationStoppedTime which GC guru, Gil Tene, considers "is probably the only number in GC logs that you should sort-of-semi-believe."
  • -XX:+PrintGCDateStamps which makes the logs much easier to deal with than time-since-epoch.
  • -Xloggc:/YOUR/FILE/HERE to make the JVM output log stats in the first place.

I clean this file up just to grab the application pause times:

$ grep "Total time for which application threads were stopped" jstringserver.gc.log > jstringserver_cleaned.gc.log 

such that the slightly cleaned GC logs look like this:

$ more /home/henryp/Documents/Blogs/jstringserver_cleaned.gc.log
2014-08-18T15:17:29.824+0100: 1.096: Total time for which application threads were stopped: 0.0004190 seconds
2014-08-18T15:17:32.895+0100: 4.166: Total time for which application threads were stopped: 0.0001930 seconds
2014-08-18T15:18:40.902+0100: 72.173: Total time for which application threads were stopped: 0.0001650 seconds
2014-08-18T15:18:43.903+0100: 75.174: Total time for which application threads were stopped: 0.0000450 seconds
.
.

Now, my R script looks like this:

gc_data <- read.table("/home/henryp/Documents/Blogs/jstringserver_cleaned.gc.log")

data <- structure (
  list(
    time = structure(
      as.POSIXct(strptime(gc_data$V1, '%Y-%m-%dT%H:%M:%S')),
      class = c("POSIXct", "POSIXt"),
      tzone = ""
    ),
    variable = gc_data$V11
  ),
  .Names    = c("time", "variable"),
  class     = "data.frame"
)

plot(
  data,
  xaxt="n",
  type="h",
  main="GC times for JStringServer",
  ylab="GC Pause Time (ms)",
  xlim=c(as.POSIXct(strptime("15:20", "%H:%M")),
         as.POSIXct(strptime("15:30", "%H:%M")))
)

axis.POSIXct(side=1, at=cut(data$time, "min"), format="%H:%M")

Where I have zoomed-in on a period of time I am interested in, namely from 15:20 to 15:30 and axis.POSIXct just put's the time ticks on the x-axis after we disabled x-axis marking in the plot() command with xaxt="n".

The generated graph looks like:


Note that if you prefer, you can set log="y" as an argument in the plot command and have the y-axis scaled logarithmically.


[1] The Art of R Programming.

No comments:

Post a Comment