Snarky Brill's techblog

Saturday, August 27, 2011

Test driven service operation vs. Nagios et al.

I am constantly thinking about network and server outages handling. This was my focus few years ago, I worked as an admin/op in a small hosting company and I was bombarded by SMSes from our home-brewed server/service monitoring system written in Perl. The system has bunch of drawbacks so we decided to replace it by Nagios (3 dot something). It has drawbacks as well, and I would say that even more serious in some cases.

I was responsible for the Nagios migration but I am not an op anymore so I have less experience with that. From my point of view, being aware of my limited insight, I can describe some drawbacks in Nagios 3 (= Nagios Core, simply the OSS version you get when apt-get install Nagios3 on top of Debian...).

The first one and probably the most severe: The configuration is complicated by nature. In addition Debian forces/strongly suggest some ideas about how you should write the config files. And when you try to do so, you have to read manual which does not give answers to all questions. For example: Are multiple parents in dependency tree of hosts/services in AND, OR or whatever relation? You can find lots of small questions and eventually Google some answers, dig into documentation or whatever, but it takes time. Anyway, writing Nagios configs takes time and one should ask himself: Why? Of course you can use Swiss-made NConf to convert writing config into clicking configs in web interface, but it is not a real improvement. Why can't it be automatic? Let's say the system can auto-discover hosts and test-run all yet-know service tests. If some of tests results to OK, it can suggest that test. It should be able to clone hosts, categorize hosts, make exceptions etc. but on the other hand I prefer text config files over some sophisticated database schema...

The another thing is that sending alarms should be smarter than only triggering scripts like send mail to contacts and send messages to pagers or cell phones in modern days. Well, I like the idea of master alarm. I would like to have a possibility to set some alarms as not crucial for business/system operation and have them listed in web interface/reports and alarmed by less aggressive way to ops. I would like to have a possibility to have some threshold for sounding master alarm and then sending this master alarm (once or N-times but not overflowing ops with hundreds of different and probably correlated errors). And I would like to have a permissive and easy to use system, not a system which does not allow to acknowledge all reported errors and when not acknowledged, it bothers by SMSes over and over. I would like the system to accept my input and respect what I want or want not to save, not like Nagios->Acknowledge->Error: You have to write a comment. Wtf.? I have major network problem, I want to investigate what is going on and not writing stupid comments, especially in situation I do not know what to say, I just want to stop SMSes from bothering me.

I would like to have a monitoring cluster, able to monitor network/servers/services from more locations and give me a overall report. I would like to have a possibility to write own triggers on errors/warnings, to report more complex situations. Let's say that I have a cluster of 10 servers with loadbalancing and I know that 5 servers would be sufficient. I would make sense not to send alarm during nighttime when one of these 10 servers went down. But it make sense to send alarm if only 6 or less servers remains operational. Event more complex situations could be described and it would be nice to set this triggers easily.

And I would like to have a overview on my system. I want to see what is going on, what happened in past and write afterwards how did I solved the problem to have op's log and tip for next time.

I think that technically it should be relatively easy to run few thousands of test each minute on a decent Intel server. Not speaking about parallelization. Then it comes an idea: We have a paradigm/style/philosophy of test driven development. Why not to have a test driven system operation? I think there are two "contras": ComplexNess of configuration and complications with data acquisition and interpretation - i.e. people fears that it would be more complicated to answer a question "what is broken?". But I believe that both "con's" a only drawbacks of current software. Discussion will be appreciated.

Tuesday, July 12, 2011

Nagios NRPE @OpenIndiana

I tried to install $title and I have found excellent howto for (Open?)Solaris 10 here: http://www.utahsysadmin.com/2008/03/14/configuring-nagios-plugins-nrpe-on-solaris-10/

There is as catch which demonstrates itself by a fucking compilation error:

root@spagetka:/usr/src/nrpe-2.12# make
cd ./src/; make ; cd ..
make[1]: Entering directory `/usr/share/src/nrpe-2.12/src'
cc -g -I/usr/include/openssl -I/usr/include -DHAVE_CONFIG_H -o nrpe nrpe.c utils.c -L/usr/lib  -lssl -lcrypto -lnsl -lsocket   
nrpe.c:
"nrpe.c", line 616: invalid source character: <0xffffffe2>
"nrpe.c", line 616: invalid source character: <0xffffff80>
"nrpe.c", line 616: invalid source character: <0xffffff9d>
"nrpe.c", line 616: invalid source character: <0xffffffe2>
"nrpe.c", line 616: invalid source character: <0xffffff80>
"nrpe.c", line 616: invalid source character: <0xffffff9d>
"nrpe.c", line 616: undefined symbol: authpriv
"nrpe.c", line 616: warning: improper pointer/integer combination: arg #2
"nrpe.c", line 618: invalid source character: <0xffffffe2>
"nrpe.c", line 618: invalid source character: <0xffffff80>
"nrpe.c", line 618: invalid source character: <0xffffff9d>
"nrpe.c", line 618: invalid source character: <0xffffffe2>
"nrpe.c", line 618: invalid source character: <0xffffff80>
"nrpe.c", line 618: invalid source character: <0xffffff9d>
"nrpe.c", line 618: undefined symbol: ftp
"nrpe.c", line 618: warning: improper pointer/integer combination: arg #2
"nrpe.c", line 1505: warning: initializer will be sign-extended: -1
"nrpe.c", line 1506: warning: initializer will be sign-extended: -1
"nrpe.c", line 1652: warning: initializer will be sign-extended: -1
"nrpe.c", line 1653: warning: initializer will be sign-extended: -1
cc: acomp failed for nrpe.c
utils.c:
make[1]: *** [nrpe] Error 1
make[1]: Leaving directory `/usr/share/src/nrpe-2.12/src'

*** Compile finished ***

So you have to edit the nrpe.c according to linked howto.

Monday, July 11, 2011

Compiling DBD::mysql @OpenIndiana

I tried to compile DBD::mysql module from CPAN on oi_148 by SunStudio cc using commands:

export PATH="$PATH:/usr/mysql/5.1/bin/amd64"
perl -MCPAN -e shell
cpan[1]> install DBD::mysql

I got:

cpan[1]> install DBD::mysql

(...)

I will use the following settings for compiling and testing:

  cflags        (mysql_config) = -I/usr/mysql/5.1/include/mysql  -xprefetch=auto -xprefetch_level=3 -mt -fns=no -fsimple=1 -xbuiltin=%all -xlibmil -xlibmopt -xnorunpath -m64   -DHAVE_RWLOCK_T -DUNIV_SOLARIS
  embedded      (mysql_config) = 
  libs          (mysql_config) = -lrt -L/usr/mysql/5.1/lib/amd64/mysql -R/usr/mysql/5.1/lib/amd64/mysql -lmysqlclient -lz -lsocket -lnsl -lm
  mysql_config  (guessed     ) = mysql_config
  nocatchstderr (default     ) = 0
  nofoundrows   (default     ) = 0
  ssl           (guessed     ) = 0
  testdb        (default     ) = test
  testhost      (default     ) = 
  testpassword  (default     ) = 
  testsocket    (default     ) = 
  testuser      (guessed     ) = root

To change these settings, see 'perl Makefile.PL --help' and
'perldoc INSTALL'.

Checking if your kit is complete...
Looks good
Using DBI 1.616 (for perl 5.008004 on i86pc-solaris-64int) installed in /usr/perl5/site_perl/5.8.4/i86pc-solaris-64int/auto/DBI/
Writing Makefile for DBD::mysql
Writing MYMETA.yml and MYMETA.json
cp lib/DBD/mysql.pm blib/lib/DBD/mysql.pm
cp lib/DBD/mysql/GetInfo.pm blib/lib/DBD/mysql/GetInfo.pm
cp lib/DBD/mysql/INSTALL.pod blib/lib/DBD/mysql/INSTALL.pod
cp lib/Bundle/DBD/mysql.pm blib/lib/Bundle/DBD/mysql.pm
cc -c  -I/usr/perl5/site_perl/5.8.4/i86pc-solaris-64int/auto/DBI -I/usr/mysql/5.1/include/mysql  -xprefetch=auto -xprefetch_level=3 -mt -fns=no -fsimple=1 -xbuiltin=%all -xlibmil -xlibmopt -xnorunpath -m64   -DHAVE_RWLOCK_T -DUNIV_SOLARIS -DDBD_MYSQL_INSERT_ID_IS_GOOD -g  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_TS_ERRNO -xO3 -xspace -xildoff   -DVERSION=\"4.019\" -DXS_VERSION=\"4.019\" -KPIC "-I/usr/perl5/5.8.4/lib/i86pc-solaris-64int/CORE"   dbdimp.c
/usr/perl5/5.8.4/bin/perl -p -e "s/~DRIVER~/mysql/g" /usr/perl5/site_perl/5.8.4/i86pc-solaris-64int/auto/DBI/Driver.xst > mysql.xsi
/usr/perl5/5.8.4/bin/perl /usr/perl5/5.8.4/lib/ExtUtils/xsubpp  -typemap /usr/perl5/5.8.4/lib/ExtUtils/typemap  mysql.xs > mysql.xsc && mv mysql.xsc mysql.c
Warning: duplicate function definition 'do' detected in mysql.xs, line 242
Warning: duplicate function definition 'rows' detected in mysql.xs, line 749
cc -c  -I/usr/perl5/site_perl/5.8.4/i86pc-solaris-64int/auto/DBI -I/usr/mysql/5.1/include/mysql  -xprefetch=auto -xprefetch_level=3 -mt -fns=no -fsimple=1 -xbuiltin=%all -xlibmil -xlibmopt -xnorunpath -m64   -DHAVE_RWLOCK_T -DUNIV_SOLARIS -DDBD_MYSQL_INSERT_ID_IS_GOOD -g  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_TS_ERRNO -xO3 -xspace -xildoff   -DVERSION=\"4.019\" -DXS_VERSION=\"4.019\" -KPIC "-I/usr/perl5/5.8.4/lib/i86pc-solaris-64int/CORE"   mysql.c
"mysql.xs", line 687: warning: implicit function declaration: mysql_st_next_results
"mysql.xs", line 897: warning: implicit function declaration: is_prefix
Running Mkbootstrap for DBD::mysql ()
chmod 644 mysql.bs
rm -f blib/arch/auto/DBD/mysql/mysql.so
LD_RUN_PATH="/lib:/usr/mysql/5.1/lib/amd64/mysql" /usr/perl5/5.8.4/bin/perl myld cc  -G dbdimp.o mysql.o  -o blib/arch/auto/DBD/mysql/mysql.so  \
           -lrt -L/usr/mysql/5.1/lib/amd64/mysql -R/usr/mysql/5.1/lib/amd64/mysql -lmysqlclient -lz -lsocket -lnsl -lm          \
          
ld: fatal: file dbdimp.o: wrong ELF class: ELFCLASS64
ld: fatal: file processing errors. No output written to blib/arch/auto/DBD/mysql/mysql.so
make: *** [blib/arch/auto/DBD/mysql/mysql.so] Error 1
  CAPTTOFU/DBD-mysql-4.019.tar.gz
  /usr/gnu/bin/make -- NOT OK
'YAML' not installed, will not store persistent state
Running make test
  Can't test without successful make
Running make install
  Make had returned bad status, install seems impossible
Failed during this command:
 CAPTTOFU/DBD-mysql-4.019.tar.gz              : make NO

The problem was that I mixed up 64-bit AMD64 MySQL libraries with stock 32-bit Perl binaries. Solution was to omit amd64 from path:

export PATH="$PATH:/usr/mysql/5.1/bin"
perl -MCPAN -e shell
cpan[1]> install DBD::mysql

Wednesday, July 6, 2011

Telling OpenIndiana firewall to accept my rules

Let's say that I have my own rules:-) For firewall of course. And I want them to be set to kernel instead of automatically generated rules by services, FEA, or whatever.

Then I have to prepare rules into files: /etc/ipf/ipf.conf (rules for IPv4) and /etc/ipf/ipf6.conf (for IPv6).

For example it can be something like this:

/etc/ipf/ipf.conf

#################### top section #####################
block in all
pass in quick on lo0 all
#################### end of top section #####################

# special rules here

########################## default policy ################################

pass out all keep state
pass out proto icmp all

pass in proto tcp from any to any port = ssh keep state

# Munin & Nagios
pass in proto tcp from any to any port = 4949 keep state
pass in proto tcp from any to any port = 5666 keep state

pass in proto icmp all

# Traceroute
pass in proto udp from any to any port 33433 >< 33626 keep state

/etc/ipf/ipf6.conf

#################### top section #####################
block in log all
pass in quick on lo0 all
pass out all keep state
pass out proto ipv6-icmp all
#################### end of top section #####################

pass in proto tcp from any to any port = ssh keep state
pass in proto ipv6-icmp all
pass in proto udp from any to any port 33433 >< 33626 keep state

Then you have to change service parametes to accept the files:

svccfg -s network/ipfilter:default setprop firewall_config_default/policy = astring: custom
svccfg -s network/ipfilter:default setprop firewall_config_default/custom_policy_file = astring: "/etc/ipf/ipf.conf" 
svcadm refresh network/ipfilter

svcadm enable network/ipfilter

And finaly verify that rules are present:

ipfstat -nio
ipfstat -nio6

Saturday, June 25, 2011

OpenIndiana Build 148 @ VMWare ESXi 4.1

I have been trying to run an OpenIndiana server in a virtualized envrionment (sort of a development and test machine). And I ran into problems, actually into one huge fucking problem at the very beginning: The OI installer CD image does not boot inside VMWare virtual server when you have more than 4 virtual CPUs set in the virtual machine profile.

Easy solution: Set only 2 CPUs and everything works fine. Well, not everything... The Open Indiana fucking installer freezes in the end instead of restarting the system, but it worked for me after all.

My next article: Why do we virtualize? What the fuck virtualization brings to our lives and what it takes? And why do we suffer so miserably with commodity servers...?:-)

Sunday, December 12, 2010

ZFS pools surveillance using Nagios and NRPE

Just short post: I have found this useful and almost out-of-box working HOWTO. Thanks.

(One small issue was that path to zpool binary on FreeBSD is /sbin/zpool, not /usr/sbin/zpool. Trivial change.)

Wednesday, November 17, 2010

Hibernation on Debian unstable (sid) sucks badly

As the header says... It sucks completely. Basically what you need to make hibernation working on pretty decent laptop like for instance ThinkPad X301 as in my case? First you need running Debian of course, SWAP partition of sufficient size (= greater than your RAM:-)) and install packages hibernate and uswsusp. Then try it... For instance run s2disk or click on some button on your Gnome/KDE/... desktop. It should hibernate (some percentage growing, disk working and then it is off), hopefully.

When you turn it on you may see message:

Invalidating stale software suspend images

and then the systems boot from scratch like it has beeing rebooted... What the fuck? Well in my case the

resume=swap:/dev/sda2

line was missing in /boot/grub/grub.cfg.

Well, you can add it there by hard. (Why it is not there by default? Well I am not a Debian guy, so I do not even know where to fill the bug actually, but I am providing a hack. Stone me.) Just add something like this line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet resume=swap:/dev/sda2"

to file /etc/default/grub.

That't it. It worked for me.