Massimo Bollati bio photo

Massimo Bollati

Non e' mai troppo tardi!

The main X4600 specs:

  • 8x2 CPUs AMD Opteron(tm) Processor 8222 3000Mhz
  • 32G of RAM
  • 2x76G +1x156G SAS disks
  • LSI Logic / Symbios Logic SAS1064 PCI-X Fusion-MPT SAS
  • 4 GbE Ethernet ports
  • …and a nice ILOM to play with!

Old server but…

  • 7 minutes and a few seconds to compile the whole 4.2 kernel

Numa testing

numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 0 1 2 3 4 5 6 7
-
numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1
node 0 size: 3962 MB
node 0 free: 3794 MB
node 1 cpus: 2 3
node 1 size: 4039 MB
node 1 free: 3932 MB
node 2 cpus: 4 5
node 2 size: 4039 MB
node 2 free: 3849 MB
node 3 cpus: 6 7
node 3 size: 4039 MB
node 3 free: 3820 MB
node 4 cpus: 8 9
node 4 size: 4039 MB
node 4 free: 3928 MB
node 5 cpus: 10 11
node 5 size: 4039 MB
node 5 free: 3786 MB
node 6 cpus: 12 13
node 6 size: 4039 MB
node 6 free: 3882 MB
node 7 cpus: 14 15
node 7 size: 4038 MB
node 7 free: 3969 MB
node distances:

node 0 1 2 3 4 5 6 7
—— —- —- —- —- —- —- —- —-
0: 10 12 12 14 14 14 14 16
1: 12 10 14 12 14 14 12 14
2: 12 14 10 14 12 12 14 14
3: 14 12 14 10 12 12 14 14
4: 14 14 12 12 10 14 12 14
5: 14 14 12 12 14 10 14 12
6: 14 12 14 14 12 14 10 12
7: 16 14 14 14 14 12 12 10

Now we launch stress-ng on node 0 and node 7 to see how the load is spread between the CPUs

stress-ng is a small tool to load and stress a computer system
numactl --cpunodebind=0 stress-ng -cpu2 &
numactl --cpunodebind=7 stress-ng -cpu2 &
-
Numa Testing

ILOM

The Integrated Lights Out Manager provides advanced service processor hardware and software that you can use to manage and monitor your Oracle Sun servers.
We can check a lot of HW parameters, such the CPU/MB temperatures, PSUs presence/faults, Hard Disk presence/faults, etc, etc.
On the Oracle site there is plenty of very detailed information about their ILOMs.
On the X4600 the ILOM has a dedicated ethernet port, so we can access the ILOM remotely even when the server is off: to keep the ILOM running we only need one PSU to be switched on and this consumes only 15W.
The ILOM shell is very intuitive:
-help for the online help
-show for checking
-set for settings
-
To check the current power consumption and the available power:

-> show /SP/powermgmt

Properties:  
actual_power = 763  
permitted_power = 1726  
available_power = 3800  

To check the CPU1 max power usage:

-> show /SP/powermgmt/advanced/1

Properties:  
name = CPU board 0 Estimated Maximum Power Usage  
unit = Watts  
value = 143  

Most commonly we will be checking for faults, for example whether there is a led ON because of a faulty PSU:

-> show /SYS/PSU_FAULT

Properties:  
type = Indicator  
ipmi_name = SYS/PSFAIL/LED  
value = Off  

We can go deeper and check if the PSU3 is ON:

-> show /SYS/PS3/PWROK

Properties:  
type = Power Supply  
ipmi_name = PS3/PWROK  
class = Discrete Sensor  
value = State Asserted  
alarm_status = cleared  

The PSU3 is on and there are no alerts

  • Now I switch the PSU3 OFF ( afterwards we will see how to do this remotely… )

-> show /SYS/PS3/PWROK

Properties:  
type = Power Supply  
ipmi_name = PS3/PWROK  
class = Discrete Sensor  
value = State Deasserted  
alarm_status = major  

The PSU3 is OFF and there is a “major” alert!

Jumping here and there through the ILOM is very interesting but not very practical!
Luckily it is possible to send the alerts through smtp ( email ), snmp and/or IPMI… the latter is the best!
On the X4600 there is no smtps client, only smtp without ssl. I don’t like snmp probes, so IPMI is my preference!
The Intelligent Platform Management Interface provides management and monitoring capabilities independently of the host system’s CPU, firmware ( BIOS or UEFI ) and operating system.
We can trigger alerts with different criteria and send them to a system with an ipmievd daemon.
Here’s an alert received in a remote system after the PSU3 has been switched OFF ( see above ):

Oct 11 19:54:05 MRS ipmievd: 192.168.1.88: Power Supply sensor PS3/VINOK - State Deasserted Asserted We can tune the alert manager to receive different kinds of alerts.
We can also set the ILOM to send the log to a remote system. The above example ( PSU3 OFF ) generated the following alert logged in the remote system:

Oct 11 19:54:01 192.168.1.88 logmgr[404]: ID = 2937 : Sun Oct 11 19:54:01 2015 : IPMI : Log : critical : ID = 1b2 : 10/11/2015 : 19:54:01 : Power Supply : PS3/PWROK : State Deasserted This seems almost perfect, but what if we receive an alert and we are away? When we come back we may see that our server has died because of a faulty fan, CPU… not so consoling!

Messing with the ILOM

I wanted the ILOM to send the alerts to my mini server Pokini ( see Vocal Commands project for the Pokini’s details ) because it is always online and does all I need in my “home-datacenter”:
firewall, samba/ssh/ftp server, owncloud, etc. etc… with just 6W!
Now there is another task for my Pokini:

  • catch the alerts through ipmievd and syslogd
  • filter the information and if there is something critical send an inquiry to the ILOM using ipmitool
  • gather the information from the ILOM and format it in a way that can be emailed and read by my smartphone!

All of this is done by a 2.1kB script run by cron. The script also checks the core temperature of the Pokini and a few months ago it saved it from certain death.

Here’s the email I received after I switched the PSU3 OFF:
From: ILOM@bollati.info
To: events@bollati.info
Subject: ILOM events
Date: Sun, 11 Oct 2015 19:56:20 +0100
User-Agent: Heirloom mailx 12.5 6/20/10
—————————————

  • 19:53:56 Power Supply PS3/VINOK State Deasserted
  • 19:54:01 Power Supply PS3/PWROK State Deasserted
  • System Status:
  • System Power on
  • Power Overload no
  • Main Power Fault yes
  • Power Control Fault no
  • Drive Fault no
  • Cooling/Fan Fault no

How it works

Filtering the log to produce text that can be easily pronounced by the smartphone TTS ( Text To Speech ):

cat /var/log/syslog | egrep 'logmgr|ipmievd' |grep -v minor |cut -d " " -f 28-90| sed -e s/": IPMI : Log : critical : ID ="/""/g -e s/"^:"/""/g -e s/"^ "/""/g -e s/" : "/" "/g -e s/" "*/" "/g | grep -v '^$'`
Here’s the query to add to the log:

ipmitool -I lanplus -H 192.169.4.88 -U my_user -P my_password chassis status| egrep -v 'Interlock|Policy|Last|Intrusion|Lockout' | awk -F ":" '{gsub (/false/,"no")gsub (/true/,"yes");print $1" "$2}'
This is almost perfect but what if we receive an email with an alert related to an overheating but we are not at home/office?
That would be useful only for a “post mortem” diagnosis of the server!

Controlling the ILOM vocally

We need to talk with our server and if necessary take action. Let’s imagine the worst scenario:
we are far from our “datacentre/home/office” and from a PC.
We receive an email from our system with an alert for a Cooling/Fan fault…
My smartphone reads the subject and the sender of the emails, so I’m aware that there is a problem.
This is how I enquire about the status of the ILOM:

  • I use Klets to create the vocal command ( Klets is a free vocal command android application which allows you to create your own vocal commands ).
  • The vocal command launches a Tasker’s task ( Tasker is a well-known android application ).
  • Tasker uses juice ssh with an ssh plugin.
  • The vocal command program launches a “tasker task” which connects to the Pokini through ssh and executes the script to gather the required information.

The Pokini’s script sends an enquiry to the ILOM gathering information about temperatures, faults, etc, etc.
This information is then formatted to be easily pronouncable by my telephone.
The telephone shows a full screen message and a notification.
At the same time the TTS reads the message and auto-sends a text message with all the info.
The text message can be read by the TTS anytime even without data connection.
We can check as often as we want and if the alert is still valid we can take an action:
We can shutdown the server nicely or nastily with a simple word like ilomstop or ilomforcestop and have some feedback.
I have only three vocal commands for my server: ilomstart ( start the host ), ilomstop ( stop the host) and ilomstatus ( gather information about the status ).
.
.
Full screen feedback

.
.
Down” is not really POWER OFF…:

The X4600’s power suppliers don’t actually power off when the server is shut down.
I need only one of them ON to keep the ILOM powered to receive my commands.
The other three PSUs consume about 50W, doing nothing other than making a lot of noise!

Smartphone Notification
Notification

OFF should be OFF

I bought a Power Management System and Surge Protector with a usb interface; a LAN one is better but the price would be almost what I paid for my server!

  • The sequence is simple:

  • The managed socket is connected to the Pokini usb port
  • In the Pokini a small program, sispmctl, can switch ON/OFF, set a timer and gather the status information of each plug on the managed socket
  • I made a very simple web interface ( big buttons for my smartphone screen ) for sispmctl and now I can switch ON/OFF each of the four PSUs remotely!

sispmctl web gui
Managed Sockets


Post a comment

All comments are held for moderation; basic HTML formatting accepted.

Name: (required)
E-mail: (required, not published)
Website: (optional)