Colmux Tutorial

Quick Refresher

It is assumed the reader has at least started at the colmux home page, but let's first review a couple of basic points:

colmux works best with collectl V3.5 and later
colmux requires passwordless ssh between its node and all target nodes
colmux requires collectl output in one or more identically formatted columns
there are 2 display formats: single line and multi-line
there are 2 run modes: real-time and playback
all nodes must have consistent time for real-time mode to work correctly

Single Line	Output is displayed from all specified nodes for the named columns, based on --cols, and displayed on a single line.
Multi Line	All output is displayed on multiple lines (similar to the top command) and sorted by the column specified with --column. This is the default mode and the default column is 1. The column and sort order can be changed dynamically with the arrow keys (if TermReadKey is installed).
Real-Time	In single line format the output for all nodes is always shown. To insure all nodes report data, the last seen values are reported and if any samples are older then --age (the default is 5 seconds), a value of -1 will be displayed. In multi-line format, all specified nodes must be reachable or they will be dropped from the list of monitored nodes when colmux first initializes. The display will refresh at the same rate as the collectl monitoring interval.
Playback	Data is played back from collectl raw files which must be in the same directory on all nodes or in the same, single directory on the node colmux is being run on.

The basics

The very first thing to keep in mind is that you should always know your column numbers because this is used for sorting in multi-line mode and for identifying which column(s) to report in single-line mode. The first column that collectl reports is always column 1. This also means if you include a time option with the collectl command, the column you want to display will be shifted and this can be confusing. Therefore remember - the -test switch is your friend!.

As you can see in the following example using -test, the output of the collectl command is shown in the Headers section with columns 3 and 4 shown as bold characters (in fact they as displayed on the terminal in reverse video for easier identification). The individual column numbering is shown in the second section. This format can be especially helpful when a collectl command can produce dozens of columns such as with -command "-sD -P", remembering the only way to get detail data on a single line is to display it in plot format.

Tip: It is often easiest when first using the -test to not specify a column, using the default of 1, and letting the output tell you the correct number(s) to actually choose rather than trying to figure them out yourself.

colmux -command "-sn -oT" -cols 3,4 --test

>>> Headers <<<
#                  <----------Network---------->
#Host    Time        KBIn  PktIn  KBOut  PktOut

>>> Column Numbering <<<
 0 #Host   1 Time    2 KBIn    3 PktIn   4 KBOut   5 PktOut

How to specify the remote nodes

If you're already familiar with pdsh, the parallel distributed shell utility, node names can be specified in the same way. If you're not familiar with it I highly suggest downloading it and trying it out.

With colmux you use -addr to specify one of three things:

one or more node names, comma separated
the name of a file containing one or more node names, one per line
a node list in pdsh -w format

The typical pdsh node specification is very intuitive and looks like this: node[1-10] and simply translates into node1, node2, etc. Further, if you specify a leading zero like this: node[01-10], all node names will have their numeric portion padded as well like this: node01, node02 ... node10.

You can also include individual nodes or multiple expressions like this: node[1-10,13,18,20-30] and include multiple prefixes and even suffixes: node[1-10,13],pre[3-4]fix.

So now that we've gotten the basics out of the way, let's get started.

Multi-Line Format in Real-Time Mode

Since everyone is familiar with the top command, let's begin with this mode since it is also more straight forward than single-line mode. Just think of top on steroids. You simply specify a collectl command and its output is sorted by any column and displayed up to the number of lines required to fill the display. Don't worry if you choose the wrong column, because you can always change it dynamically with the arrow keys if you've installed TermReadKey or by typing in the column number followed by the enter key. Since the selected column is highlighted you won't lose track of where you are.

In the following example, we've chosen to look at the slab memory usage on 1/2 dozen nodes. As you can see, like top, the current time is displayed and colmux also displays how many nodes are reporting data, which in this case all 6 are. You can also see that the slab column is highlighted. It's that simple.

colmux -addr cn[5-10] -command "-sm" -colum 5

# Thu Nov 17 07:01:41 2011  Connected: 6 of 6
#         <-----------Memory----------->
#Host     Free Buff Cach Inac Slab  Map
cn8       123G    0  42M  26M 103M  87M
cn5       495G    0  58M  41M 102M  89M
cn7       123G    0  41M  27M  99M  88M
cn10      123G    0 158M 121M  97M  47M
cn9       123G    0 158M 121M  97M  47M
cn6       123G    0  41M  27M  92M  88M

Next, I've chosen to see which my busiest network interfaces are based on their packet input rates, only showing the first 10 lines below even though the output actually fills the display:

colmux -addr cn[5-10] -command "-sN" -colum 4
# NETWORK STATISTICS (/sec) Thu Nov 17 07:08:13 2011  Connected: 6 of 6
#Host     Num    Name   KBIn  PktIn SizeIn  MultI   CmpI  ErrsI  KBOut PktOut  SizeO   CmpO ErrsO
cn6         1   eth0:      0      5     54      0      0      0      1      5    239      0      0
cn10        1   eth0:      0      4     54      0      0      0      1      4    288      0      0
cn9         1   eth0:      0      4     54      0      0      0      1      4    285      0      0
cn8         1   eth0:      0      4     54      0      0      0      1      4    285      0      0
cn7         1   eth0:      0      4     54      0      0      0      1      4    285      0      0
cn5         1   eth0:      0      4     54      0      0      0      1      4    285      0      0
cn6         0     lo:      0      0      0      0      0      0      0      0      0      0      0
cn5         8    ib3:      0      0      0      0      0      0      0      0      0      0      0
cn5         7    ib2:      0      0      0      0      0      0      0      0      0      0      0
cn5         6    ib1:      0      0      0      0      0      0      0      0      0      0      0

So which process is using the most CPU user time? Remember, you can also use subsystem specific switches and perhaps display which processes across the cluster are doing the most disk I/O too:

colmux -addr cn[5-10] -command "-sZ -i:1" -column 11

# PROCESS SUMMARY (counters are /sec) Thu Nov 17 07:14:56 2011  Connected: 6 of 6
#Host     PID   User     PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
cn5      36723  mjs      20 36722    0 R  162M   23M  0  0.11  0.47  57   0:01.82    0    0    0   42 /usr/bin/perl
cn9       3075  root     20     1    0 S    9M  660K  1  0.05  0.01   6  19:41.10    0    0    0  213 irqbalance
cn6        221  root     20     2    0 S     0     0 19  0.00  0.00   0   0:00.00    0    0    0    0 kintegrityd/19
cn6        220  root     20     2    0 S     0     0 18  0.00  0.00   0   0:00.00    0    0    0    0 kintegrityd/18
cn6        219  root     20     2    0 S     0     0 17  0.00  0.00   0   0:00.00    0    0    0    0 kintegrityd/17
cn6        218  root     20     2    0 S     0     0 16  0.00  0.00   0   0:00.00    0    0    0    0 kintegrityd/16

Be sure to try out some of your favorite collectl commands as there are far too many combinations to show here. Hopefully you should have gotten the basics of multi-line mode down by now.

Single-Line Format in Real-Time Mode

This a very powerful mechanism but understanding when best to use it will vary by situation. Do you remember collectl's basic concept of brief mode is to let you display everything on a single line to make it easier to spot change? While colmux in multi-line mode makes it easy to sort the output, like top it can be very difficult to spot change. Remember, when looking at top resouces, the consumers can often change from cycle to cycle and the output can be difficult to watch.

Tip - if you want to look at multi-line output across a set of nodes and not have the sort field continually changing, simply sort on the hostname field!

Getting back to single line format, the thing to remember is that less is more. In other words to get the most out of this you should probably settle on one or two variables you want to examine. Let's go back to our command that displays slab memory and change -column 5 to -cols 3,5, which will let us watch cache and slab memory at the same time.

In the following example, you can see when colmux first starts out the values are set to -1. This is because the remote nodes have not yet been connected. Once they are the values start to show, column 3 data on the left side of the dsplay and column 5 data on the right.

colmux -addr cn[5-10] -command "-sm" -cols 3,5

    cn5    cn6    cn7    cn8    cn9   cn10 |     cn5    cn6    cn7    cn8    cn9   cn10
     -1     -1     -1     -1     -1     -1 |      -1     -1     -1     -1     -1     -1
     -1     -1     -1     -1     -1     -1 |      -1     -1     -1     -1     -1     -1
  59816     -1     -1     -1     -1     -1 |  108340     -1     -1     -1     -1     -1
  59816  42516  43036  43416 162692 162692 |  108340  96916 105016 109356 102576 101748
  59816  42516  43036  43416 162688 162692 |  108340  96916 105016 109356 102664 101740
  59816  42516  43036  43416 162688 162692 |  108348  96916 105024 109356 102656 101740

However, aren't the numbers kind of hard to interpret because they're so big? At least they are for me and that's where -colk comes in. It tells colmux to divide each value by 1024 before displaying it. Also, sometimes it's useful to display the sample times and that can be accomplished by including your favorite time format option with the collectl command - this includes dates as well as msec. Also note that since we've added a time option we've actually shifted the columns one to the right, which changes the column numbering.

In some cases, you might also be interested in the totals for each column across all the nodes and that's what -coltot is for. As a side benefit, -coltot include column names as well.

colmux -addr cn[5-10] -command "-sm -oT" -cols 4,6 -colk -coltot
#Time       cn5    cn6    cn7    cn8    cn9   cn10 |     cn5    cn6    cn7    cn8    cn9   cn10 |      Cach     Slab
08:21:21     -1     -1     -1     -1     -1     -1 |      -1     -1     -1     -1     -1     -1 |         0        0
08:21:22     58     41     42     42    158    158 |     105     94    102    106    100     99 |       499      606
08:21:23     58     41     42     42    158    158 |     105     94    102    106    100     99 |       499      606
08:21:24     58     41     42     42    158    158 |     105     94    102    106    100     99 |       499      606

Perhaps the best way to appreciate what colmux can do in single-line format, is to consider the following screenshot. Here we are looking at the infiniband traffic between 6 parallel file-serving nodes and 17 clients. The lefthand columns show the KBout and the righthand columns show the KBin.

Even though you may not easily be able to read the numbers, you can still see that this is a read test because of the high output rates (these are being reported in MB) on the servers with essentially 0 output on the clients. You can also tell servers 1 and 2 are not participating. Similarly the bulk of the infiniband input is all on the clients and minimal on the servers. It is also easy to see something is wrong with clients 1 and 3 since their input rates are all 0. You can also see erratic behavior on the servers since the numbers are not evenly balanced and this is effecting the clients as well and that they do not finish the test together.

Playback Mode

Playback mode works exactly the same way as real-time mode execpt you include the -p switch with the collectl command to instruct it to play back the data from a previously recorded raw file. Since the contents of -command are actually passed directly to collectl, there are a few things to remember:

all raw files must be in directories of the name on all machines
you must wild card the hostname portion of the file name
this command only works for data on a specific date and so that much of the filename must be included

Armed with this knowledge and knowing that our prevous multi-line real-time memory command looked like this:

colmux -addr cn[5-10] -command "-sm" -colum 5

So to play back the data in multi-line format from files recorded on Feb 19, 2011 all that needs to be done is to add -p and the file location to the collectl command like this:

colmux -addr cn[5-10] -command "-sm -p /var/log/collectl/*20110219*" -colum 5

The result will be to display up to 10 lines of output, sorted by the specified column. Since the display will run as fast as collectl can send the data back over the ssh connections it will probably fly by you and that's where the -delay switch comes in. It allows you to pause the specified number of seconds between each time sample.

Also keep in mind that you can use --from and --thru switches just as you'd do running collectl standalone.

The exact technique also applies to single-line format, but with respect to the playback switch format as well as --from and --thru.

updated November 21, 2011