Adrian Cockcroft's Blog: cockcroftplot

Monday, July 14, 2008

Enhanced headroom plot in R

For some reason I seem to find time to write code in R when I'm on an airplane. The last two trips I made resulted in significant enhancements and debugging of the code for my headroom plot. It started off simple but it now has a lot of bells and whistles, including color coding.

Main changes: the quantile used to remove outliers now only removes outliers that exceed the 95th percentile response time by default. It keeps all the throughput values unless you use qx=True.

In each of the throughput bins used to draw the histogram, the maximum response time for that bin is now calculated and displayed as a staircase line unless you set max=False.

The set of data is now split into ranges and color coded. The times series plot is coded so you can see the split, and the scatterplot shows how those points fall. I have been plotting weekly data at one minute intervals with split=7, which looks pretty good.

I read in some data that has been extracted from vxstat into a csv format at a known URL and plotted it three ways.
I plot the first 2880 data points, picking the read data rather than write, two days at one minute intervals.


> stime <- read.csv(url("http://somewhere/vxstat_response"))
> sops <- read.csv(url("http://somewhere/vxstat_throughput"))

> names(sops)
[1] "DateTime"                             "vxstat_dg_operationsRead" 
[3] "vxstat_dg_operationsWrite"

> chp(sops[1:2880,2],stime[1:2880,2])
> chp(sops[1:2880,2],stime[1:2880,2],q=1.0)
> chp(sops[1:2880,2],stime[1:2880,2],q=1.0,splits=8)

Here is the code that generates the plot.


> chp <-function(throughput,response, q=0.95, qx=F, xl="Throughput",yl="Response",tl="Throughput Over Time",
ml="Headroom Plot", fit=T, max=T, splits=0) {
      # remove zero throughput and response values
      nonzer <- (throughput != 0) & (response != 0)  # array of true/false
      y <- response[nonzer]
      x <- throughput[nonzer]
      # remove outliers, keep response time points inside 95% by default
      if (q != 1.0) {
              quant <- (y < quantile(y,q))
              # optionally trim throughput outliers as well
              if (qx) quant <- quant & (x < quantile(x, q))
              x <- x[quant]
              y <- y[quant]
      }
      # make histograms and record end points for scaling
      xhist <- hist(x,plot=FALSE)
      yhist <- hist(y,plot=FALSE)
      xbf <- xhist$breaks[1]                          # first
      ybf <- yhist$breaks[1]                          # first
      xbl <- xhist$breaks[length(xhist$breaks)]       # last
      ybl <- yhist$breaks[length(yhist$breaks)]       # last
      xcl <- length(xhist$counts)                     # count length
      ycl <- length(yhist$counts)                     # count length
      xrange <- c(0.0,xbl)
      yrange <- c(0.0,ybl)
      xlen <- length(x)
      # make a multi-region layout
      nf <- layout(matrix(c(1,3,4,2),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE)
      layout.show(nf)
      # set plot margins for throughput histogram and plot it
      par(mar=c(0,4,3,0))
      barplot(xhist$counts, axes=FALSE,
              xlim=c(xcl*0.00-xbf/((xbl-xbf)/(xcl-0.5)),xcl*1.00),
              ylim=c(0, max(xhist$counts)), space=0, main=ml)
      # set plot margins for response histogram and plot it sideways
      par(mar=c(5,0,0,1))
      barplot(yhist$counts, axes=FALSE, xlim=c(0,max(yhist$counts)),
              ylim=c(ycl*0.00-ybf/((ybl-ybf)/(ycl-0.5)),ycl*1.00),
              space=0, horiz=TRUE)
      # set plot margins for time series plot
      par(mar=c(2.5,1.7,3,1))
      plot(x, main=tl, cex.axis=0.8, cex.main=0.8, type="S")
      if (splits > 0) {
   step <- xlen/splits
   for(n in 0:(splits-1)) {
    lines((1+n*step):min((n+1)*step,xlen), x[(1+n*step):min((n+1)*step,xlen)], col=4+n)
   }
      }
      # set plot margins for main plot area
      par(mar=c(5,4,0,0))
      plot(x, y, xlim=xrange, ylim=yrange, xlab=xl, ylab=yl, pch=20)
      if (max) {
        # max curve
        b <- xhist$breaks
        i <- b[2] - b[1] # interval
        maxl <- list(y[b[1] < x & x <= (b[1]+i)])
        for(n in b[c(-1,-length(b))]) maxl <- c(maxl,list(y[n < x & x <= (n+i)]))
        #print(maxl)
        maxv <- unlist(lapply(maxl,max)) # apply max function to elements of list
        #print(maxv)
        #lines(xhist$mids,maxv,col=2)  # join the dots
        #staircase plot showing the range for each max response
        lines(rep(b,1,each=2)[2:(2*length(maxv)+1)],rep(maxv,1,each=2),col=3)
        
      }
      if (fit) {
        # fit curve, weighted to predict high throughput
        # create persistent chpfit object using <<-
        chpfit <- glm(y ~ x, inverse.gaussian, weights=as.numeric(x))
        # add fitted values to plot, sorted by throughput
        lines(x[order(x)],chpfit$fitted.values[order(x)],col=2)
      }
      if (splits > 0) {
   step <- xlen/splits
   for(n in 0:(splits-1)) {
    Sys.sleep(1)
    points(x[(1+n*step):min((n+1)*step,xlen)],y[(1+n*step):min((n+1)*step,xlen)], xlim=xrange, ylim=yrange, col=4+n)
   }
      }
}

Sunday, November 19, 2006

The Cockcroft Headroom Plot - Part 1 - Introducing R

I've recently written a paper for CMG06 called "Utilization is Virtually Useless as a Metric!". Regular readers of this blog will recognize much of the content in that paper. The follow-on question is what to use instead? The answer I have is to plot response time vs. throughput, and I've been thinking about a very specific way to display this kind of plot. Since I'm feeling quite opinionated about this I'm going to call it a "Cockcroft Headroom Plot" and I'm going to try and construct it using various tools. I will blog my way through the development of this, and I welcome advice and comments along the way.

The starting point is a dataset to work with, and I found an old iostat log file that recorded a fairly busy disk at 15 minute intervals over a few days. This gives me 250 data points, which I fed into the R stats package to look at. I'll also have a go at making a spreadsheet version.

The iostat data file starts like this:

                    extended device statistics              
 r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
14.8 78.4  183.0 2446.3  1.7  0.6   18.6    6.6   1  21 c1t5d0
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    5.0   0   0 c0t6d0
...

I want the second line as a header, so save it (my command line is actually on OSX, but could be Solaris, Linux or Cygwin on Windows)
% head -2 iostat.txt | tail -1 > header

I want the c1t5d0 disk, but don't want the first line, since its the average since boot, and want to add back the header
% grep c1t5d0 iostat.txt | tail +2 > tailer
% cat header tailer > c1t5.txt

Now I can import into R as a space delimited file with a header line. R doesn't allow "/" or "%" in names, so it rewrites the header to use dots instead. R is a script based tool with a command line and a very powerful vector/object based syntax. A "data frame" is a table of data object like a sheet in a spreadsheet, it has names for the rows and columns, and can be indexed.
> c1t5 <- read.delim("c1t5.txt",header=T,sep="")
> names(c1t5)
[1] "r.s" "w.s" "kr.s" "kw.s" "wait" "actv" "wsvc_t" "asvc_t" "X.w" "X.b" "device"

I only want to work with the first 250 data points so I subset the data frame by indexing the rows with an array (1:250) that selects the rows I want and leaving the column selector blank.
> io250 <- c1t5[1:250,]

The first thing to do is summarize the data, the output is too wide for the blog so I'll do it in chunks by selecting columns.


> summary(io250[,1:4])
      r.s              w.s             kr.s             kw.s        
 Min.   :  1.80   Min.   :  1.8   Min.   :  13.5   Min.   :   38.5  
 1st Qu.: 10.30   1st Qu.: 87.1   1st Qu.: 107.4   1st Qu.: 2191.7  
 Median : 18.90   Median :172.4   Median : 182.8   Median : 4279.4  
 Mean   : 22.85   Mean   :187.5   Mean   : 290.1   Mean   : 4448.5  
 3rd Qu.: 28.88   3rd Qu.:274.6   3rd Qu.: 287.4   3rd Qu.: 6746.6  
 Max.   :130.90   Max.   :508.8   Max.   :4232.3   Max.   :13713.1  
> summary(io250[,5:8])
      wait             actv            wsvc_t           asvc_t      
 Min.   : 0.000   Min.   :0.0000   Min.   : 0.000   Min.   : 1.000  
 1st Qu.: 0.000   1st Qu.:0.3250   1st Qu.: 0.400   1st Qu.: 3.125  
 Median : 0.600   Median :0.8000   Median : 2.550   Median : 4.700  
 Mean   : 1.048   Mean   :0.9604   Mean   : 5.152   Mean   : 4.634  
 3rd Qu.: 1.300   3rd Qu.:1.5000   3rd Qu.: 6.350   3rd Qu.: 5.700  
 Max.   :10.600   Max.   :3.5000   Max.   :88.900   Max.   :15.100  
> summary(io250[,9:10])
      X.w             X.b       
 Min.   :0.000   Min.   : 2.00  
 1st Qu.:0.000   1st Qu.:20.00  
 Median :1.000   Median :39.50  
 Mean   :1.428   Mean   :37.89  
 3rd Qu.:2.000   3rd Qu.:55.00  
 Max.   :9.000   Max.   :92.00

Looks like a nice busy disk, so lets plot everything against everything (pch=20 sets a solid dot plotting character)
> plot(io250[,1:10],pch=20)

The throughput is either reads+writes or KB read+KB written, the response time is wsvc_t+asvc_t since iostat records time taken waiting to send to a disk as well as time spent actively waiting for a disk.

To save typing, I attach to the data frame so that the names are recognized directly.
> attach(io250)
> plot(r.s+w.s, wsvc_t+asvc_t)

This looks a bit scattered, because there is a mixture of average I/O sizes that varies during the time period. Lets look at throughput in KB/s instead.
> plot(kr.s+kw.s,wsvc_t+asvc_t)

That looks promising, but its not clear what the distribution of throughput is over the range. We can look at this using a histogram.
> hist(kr.s+kw.s)

We can also look at the distribution of response times.
> hist(wsvc_t+asvc_t)

The starting point for the thing that I want to call a "Cockcroft Headroom Plot" is all three of these plots superimposed on each other. This means rotating the response time plot 90 degrees so that its axis lines up with the main plot. After looking around in the manual pages I eventually found an example that I could use as the basis for my plot. It needs some more cosmetic work but I defined a new function chp(throughput, response) shown below.


> chp <- function(x,y,xl="Throughput",yl="Response",ml="Cockcroft Headroom Plot") {
   xhist <- hist(x,plot=FALSE)
   yhist <- hist(y, plot=FALSE)
   xrange <- c(0,max(x))
   yrange <- c(0,max(y))
   nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE)
   layout.show(nf)
   par(mar=c(3,3,1.5,1.5))
   plot(x, y, xlim=xrange, ylim=yrange, main=xl)   par(mar=c(0,3,3,1))
   barplot(xhist$counts, axes=FALSE, ylim=c(0, max(xhist$counts)), space=0, main=ml)
   par(mar=c(3,0,1,1))
   barplot(yhist$counts, axes=FALSE, xlim=c(0, max(yhist$counts)), space=0, main=yl, horiz=TRUE)
}

The result of running chp(kr.s+kw.s,wsvc_t+asvc_t)is close...

That's enough to get started.

Adrian Cockcroft's Blog

Archive

Monday, July 14, 2008

Enhanced headroom plot in R

Sunday, November 19, 2006

The Cockcroft Headroom Plot - Part 1 - Introducing R