As its name implies, colmux is a collectl multiplexor, which allows one to collect data from multiple systems and treat it as a single data stream, essentially extending collectl's functionality to a set of hosts rather than a single one. Colmux has been tested on clusters of over 1000 nodes but one should also take note that this will put a heavier load on the system on which colmux is running.
Colmux runs in 2 distinct modes: Real-Time and Playback. In real-time mode, colmux actually communicates with instances of collectl running on remote systems which in turn are collecting real-time performance metrics. In playback mode colmux also communicates with a remote copy of collectl but in this case collectl is playing back a data file collected some time in the past.
Colmux can also provide its output in 2 distinct formats: single-line and multi-line. In single-line format colmux reports the multiplexed data from all systems on a single line by allowing the user to choose a small number of variables to display, based on both the display width and the number of systems. While it is possibly to handle more than a couple of dozen systems, (see the example at the bottom of this page), one rarely does so because of the screen width or their ability to read 1-point font. However it is also possible to redirect the output to a file for off-line viewing, via a text editor or a spreadsheet.
Colmux has been extensively tested on versions of collectl from V3.3.6 forward and there have been some additional enhancements made to V3.5.0, which is the recommended minimal version. You should first make sure all the systems of interest have the latest versions of collectl installed or at least those at V3.3.6 or newer.
Colmux also provides the ability for dynamic interaction with the keyboard arrow keys if the optional perl module Term::ReadKey has been installed. To see if this is the case and that colmux can find it run with -v and you should see the following:
colmux -v colmux: 3.0 (Term::ReadKey: V2.30)
|Colmux requires passwordless ssh between it and all hosts it is monitoring|
The inclusion of a playback filename in the collectl command instructs colmux to run in playback mode and the use of colmux's -cols switch tells it to produce output in single-line format. By using various combinations of these switches you can get colmux to run in any 4 distinct modes as shown in the following table:
|Single-line||-cols||-command "-p filename" -cols|
|Multi-line||default||-command "-p filename"|
Let's discuss these 4 options separately to give a better feel for what they actually mean and when you might use them. Note that the 2 operational modes have nothing to do with the way the data is displayed and the 2 formats have nothing to do with the way the data is collected - in other words a complete separation between form and function.
If you've ever run collectl before, and you probably have if you're looking at these utilities, you already know the real-time nature of the tool. The difference here is that with colmux you're actually able to look at multiple systems at the same time. By default, colmux runs in real-time mode unless you explicitly instruct it to run in playback mode by including -p in the collectl command.
The way you tell colmux to run in this mode is to simply point the collectl command at a file with the -p switch as you would normally do when you want to play back a file. The main restriction is that the file needs to exist on all the systems you've pointed colmux to and therefore wild cards are required for portion of the filename and that includes the hostname and are often used for the timestamp portion as well.
Typically one simply uses a collectl playback filename in a format something like /var/log/collectl/*20101225*, wildcarding all but the date of interest. If you want, you can put all the collectl files in one directory on the same system colmux is running on, but this method is restricted to only running on the local system.
During playback by colmux, only the data falling in the same time interval will be reported and so the header that reports how many nodes have had their data included becomes more meaningful in case there is missing data.
Once you've decided which systems you want to monitor and what collectl command you want to execute, you need to decide whether you're interested in single- or multi-line output. Most users will probably be interested in the multi-line format, at least at first.
Multi-line format reports all data provided by collectl in its original format. Further, it sorts it by a column of the user's choice and presents as much data as will fit on the screen. The result is a top-like utilility capable of reporting the top consumers of virtually any resource on the cluster be it the more tradional process statistics or something more exotic such as memory or network consumption.
Since this IS native collectl data it can be virtually anything, including that provided by any custom import modules you may have written. You will also see the identical information in playback mode, though this is presented as scrolling text (there is also a -home switch to display the data in top format if you wish, but unless you include -delay it may scroll by too fast to be of use).
The main consideration with multi-line format is that colmux can only deal with collectl commands that themselves only produce single line output OR multiple lines that are all the same format, noting that data provided by a custom import are considered to be a single device themselves. These include:
While the column number to sort on should also be a consideration, and you can manually select it at startup with -column, you can easily change the sort column once colmux is running by either using the arrow keys (if Term::ReadKey has been installed) or simply typing the desired column number followed by the enter key. This works in both real-time and playback mode.
Below is an example of examing the network traffic on 5 nodes and sorting it by column 2. As you can also see, all 5 nodes have reported data for the interval being displayed and colmux highlights the selected column, though not as the bolded text shown in the examples that follow but rather in reverse video:
colmux -addr 'xx1n[1-5]' -command "-sn" -column 2 # Wed Dec 29 05:42:21 2010 Connected: 5 of 5 # <----------Network----------> #Host KBIn PktIn KBOut PktOut xx1n1 9 82 42 326 xx1n2 9 77 41 320 xx1n5 8 75 41 318 xx1n4 8 74 41 317 xx1n3 8 71 40 314
Unlike multi-line format which only displays output for the top systems which can change from interval to interval, in single-line format you always see the selected data for all systems and it is never sorted but rather reported in a fixed format. Therefore you need to tell colmux which data fields you're interested in when you first start it up. To determine the correct column numbers you can either run the desired collectl command manually and start counting columns, noting the first column is always 1, or you can use colmux's -test switch, which you can also use in multi-line format. This switch will display the header line of collectl's output including the hostname as column 0, with the column(s) you have selected highlighted, as well as a list of all the columns and their numbers for quick reference.
colmux -command "-sc" -test -cols 3,4 >>> Headers <<< # <--------CPU--------> #Host cpu sys inter ctxsw >>> Column Numbering <<< 0 #Host 1 cpu 2 sys 3 inter 4 ctxsw
Once you have decided on the column numbers, there are a couple of other optional switches you may choose to use, including timestamps, data type totals and for very wide displays you can even request the columns be narrower and to divide each value by 1000 or 1024. To preface each line with a timestamp, you actually include the appropriate time format switch with the collectl command itself, rather than using a distinct colmux switch.
Here is an example of the same command to look at network data, except in this case colmux has been instructed report data for only columns 3 and 4, to print time stamps at the beginning of each line and to report totals at the far right. As colmux first starts you can see the data being reported as all -1s since those systems have not yet sent any data back:
colmux -addr 'xx1n[1-5]' -command "-sn -oT" -cols 2,4 -coltot #Time xx1n1 xx1n2 xx1n3 xx1n4 xx1n5 | xx1n1 xx1n2 xx1n3 xx1n4 xx1n5 | KBIn KBOut 05:29:48 -1 -1 -1 -1 -1 | -1 -1 -1 -1 -1 | 0 0 05:29:49 2 2 2 2 2 | 0 0 0 0 0 | 10 0 05:29:50 2 2 2 2 2 | 0 0 0 0 0 | 10 0 05:29:51 9 10 4 8 9 | 41 42 3 41 42 | 40 169 05:29:52 2 2 9 2 2 | 0 0 40 0 0 | 17 40 05:29:53 2 2 2 2 2 | 0 0 0 0 0 | 10 0 05:29:54 2 2 2 2 2 | 0 0 0 0 0 | 10 0
The following screenshot is an example of looking at Infiniband traffic between 16 clients writing to 4 lustre servers. The left half of the display shows network received KB and the right half network transmitted KB. The first 4 columns in each section are the lustre servers and the next 16 columns the clients. As expected during a client write test, the lustre servers show high receive traffic and the clients show high transmit traffic. Look how easy it is to see drops in the client transmission rates even if you can't easily read the numbers. Also notice that the second client isn't doing any transmitting at all and since it's not displaynig -1 we know collectl is running correctly.
Here's an even more dense example showing CPU load on a large cluster which is so wide it takes 3 monitors to display it all. Even though you can't read the output you can still see different patterns as some systems start/stop and others sit idle.
|updated Feb 22, 2011|