SaltTemperatureLoggingAndShutdown

For more information about SALT

https://cs.uwaterloo.ca/twiki/view/CF/SaltStackCSCF

Determine if the target system is a SALT minion system

On the master machine:
  $ ssh linux.cscf
  ubuntu1404-202:~$ sudo -s
  ubuntu1404-202: # ssh salt-rsg-1604
  salt-rsg-1604:# salt "our_minion_machine_FQDN" test.ping
   - our minion machine
  True
If the "test.ping" returns true skip to the section "To begin temperature logging"

SSH is installed on yo"> Make sure SSH is installed on your minion

   apt-get install openssh-server

Put the linux.cscf root key on the minion system

  $ ssh linux.cscf.uwaterloo.ca
  ubuntu1404-202:$ sudo -s
  root@ubuntu1404-202:# cd /root
  root@ubuntu1404-202:/root# cd .ssh
  root@ubuntu1404-202:/root/.ssh# scp id_dsa.pub cscf-adm@our-minion-machine:
  root@ubuntu1404-202:/root/.ssh# scp id_ed25519.pub cscf-adm@our-minion-machine:
On the minion machine:
  # cd /root/.ssh
  # cat /home/cscf-adm/id_dsa.pub >> authorized_keys2
  # cat /home/cscf-adm/id_ed25519.pub >> authorized_keys2

Make sure curl is installed on the minion.

   dpkg -l | grep curl
   apt-get install curl

Install the SALT minion script

See https://cs.uwaterloo.ca/twiki/view/CF/SaltStackCSCF
  # curl -L https://bootstrap.saltstack.com -o install_salt.sh
  # sh install_salt.sh
  # ls /etc/salt
  minion minion.d minion_id pki proxy proxy.d
  # ls -l /etc/salt/minion
  -rw-r--r-- 1 root root 35305 Oct 4 12:02 /etc/salt/minion
  # cd /etc/salt
  /etc/salt# vi minion
  # head -2 minion
  master: salt-rsg-1604.cscf.uwaterloo.ca
  hash_type: sha256

Start the communication with the master

On the minion machine:
  # systemctl restart salt-minion
  # /etc/init.d/salt-minion restart
  # systemctl status salt-minion
  salt-minion.service - The Salt Minion
  Loaded: loaded (/lib/systemd/system/salt-minion.service; enabled; vendor preset: enabled)
  Active: active (running) since Wed 2017-11-15 12:12:13 EST; 1min 42s ago
  Docs: man:salt-minion(1)
  file:///usr/share/doc/salt/html/contents.html
  https://docs.saltstack.com/en/latest/contents.html
  Main PID: 26985 (salt-minion)
  CGroup: /system.slice/salt-minion.service
  ├─26985 /usr/bin/python /usr/bin/salt-minion
  ├─26990 /usr/bin/python /usr/bin/salt-minion
  └─26994 /usr/bin/python /usr/bin/salt-minion
  Nov 15 12:12:24 salt-minion[26985]: [ERROR ] The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate

On the master machine:
  $ ssh linux.cscf
  ubuntu1404-202:~$ sudo -s
  ubuntu1404-202: # ssh salt-rsg-1604
  salt-rsg-1604:~# salt-key
  Accepted Keys:
   - lists all machines already connected to the master
  Denied Keys:
  Unaccepted Keys:
   - our minion machine will be listed here
  Rejected Keys:

Have the master accept the key.

  salt-rsg-1604:# salt-key -a "our_minion_machine_FQDN"
  The following keys are going to be accepted:
  Unaccepted Keys:
   - our minion machine
  Proceed? [n/Y] Y
  Key for minion  accepted.
  salt-rsg-1604:# salt-key
  Accepted Keys:
   - our minion machine
   - lists all machines already connected to the master
  Denied Keys:
  Unaccepted Keys:
  Rejected Keys:

  salt-rsg-1604:# salt "our_minion_machine_FQDN" test.ping
   - our minion machine
  True

Note: The above "our_minion_machine_FQDN" may not work. Just use the machine name and add the FQDN name to salt-rsg-1604 /etc/hosts file.

To begin temperature logging

If your minion is a Supermicro server

Create a path to a file named "max" from the /srv/saltstack/pillar/file_tree/hosts directory:
  root@salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts# mkdir -p "our_minion_machine_FQDN"/temperature
  root@salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts# echo '99' > "our_minion_machine_FQDN"/temperature/max
Run the next two commands:
  salt-rsg-1604:/srv/saltstack/pillar# salt 'our_minion_machine_FQDN' state.apply common.monitoring.temperature state=True
  salt-rsg-1604:/srv/saltstack/pillar# salt 'our_minion_machine_FQDN' state.apply common.monitoring.temperature
If the Supermicro minion uses an IPMI temperature sensor named System or Sys then SALT will automatically set its maximum temperature to 99 degrees.

Have the master create the maximum temperature before shutdown

To set a maximum temperature shutdown value first view the temperature logs after a few hours of logging.
The maximum temperature is that temperature at which the minion will auto-shutdown.
View the file /var/log/temperature-current-week
This will show the normal internal temperature of the minion.
In the example below a maximum temperature of 45 was chosen which should be higher than any temperature found in the log file /var/log/temperature-current-week

  salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts# mkdir "our_minion_machine_FQDN"
  salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts# cd "our_minion_machine_FQDN"
  salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts/"our_minion_machine_FQDN"# mkdir temperature
  salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts/"our_minion_machine_FQDN"# cd temperature
  salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts/"our_minion_machine_FQDN"/temperature# echo '45' > max         Note 45 is an example temperature. Change it to your desired maximum.
  salt-rsg-1604:/srv/saltstack/pillar/file_tree/hosts/"our_minion_machine_FQDN"/temperature# cd ../../..
  salt-rsg-1604:/srv/saltstack/pillar# cd common/temperature
  salt-rsg-1604:/srv/saltstack/pillar/common/temperature# ls
  ambient.sls default.sls pci.sls sys.sls system.sls inlet.sls

If the minion uses an IPMI temperature sensor named "System" or just "Sys" then SALT will automatically set its maximum temperature to 99 degrees.
This suggests that if only temperature logging is desired don't create the /srv/saltstack/pillar/file_tree/hosts/"our_minion_machine_FQDN"/temperature/max file.

If your minion is not a Supermicro server

Install the ipmitool package on the minion.
   apt-get install ipmitool
   apt-get install freeipmi-tools
   modprobe ipmi_si type=kcs ports=0xCA2 regspacings=1
   modprobe ipmi_devintf
   modprobe ipmi_msghandler
   modprobe ipmi_poweroff
   modprobe ipmi_watchdog

Find the sensor output of ipmitool.

   ipmitool sdr list

Make sure "ipmitool sdr list" reported some data.
From the output determine the system or ambient temperature of the machine.

   ipmitool sdr list
   CPU1 Temp        | 53 degrees C      | ok
   CPU2 Temp        | 54 degrees C      | ok
   PCH Temp         | 44 degrees C      | ok
   System Temp      | 36 degrees C      | ok
   Peripheral Temp  | 43 degrees C      | ok
   MB_10G Temp      | 58 degrees C      | ok
   ...

  • Here the "System" temperature is what will be used for retrieving normal "room" temperature operation. Other temperature sensor names include:
    ambient for some Sun machines
    PCI Area Temp for Celestica servers
    Inlet for Huawei machines
  • Of course since the system itself will generate heat this temperature will not be the actual room temperature, but a steady temperature that fluctuates with room temperature

Install the logging script via SALT

SALT is able to track several temperature sensors names. The list is found at:
salt-rsg-1604:/srv/saltstack/pillar/common/temperature# ls
ambient.sls  default.sls  pci.sls  sys.sls  system.sls inlet.sls
If the minion's temperature sensor name found by ipmitool is one of "ambient","PCI", "Sys", "System", or "Inlet" then proceed to the next step at "Add the minion to the top.sls file".
Otherwise create a file naming the sensor found in "ipmitool sdr list" like the example given below:
salt-rsg-1604:/srv/saltstack/pillar/common/temperature# cat pci.sls
temperature:
  sensor: 'PCI'

Add the minion to the top.sls file in salt-rsg-1604 at /srv/saltstack/pillar/

   salt-rsg-1604:/srv/saltstack/pillar# cat top.sls
      base:
        '*':
          - common
          - salt
          - temperature.default

        'our_minion_machin_FQDN':
          - temperature.pci                   <- Note this is the temperature sensor-name "pci".  Determine that name from the ipmitool output "ipmitool sdr list"

Run this SALT command to enable the temperature logging script.

  salt-rsg-1604:/srv/saltstack/pillar/common/temperature# salt 'our_minion_machine_FQDN' state.apply common.monitoring.temperature state=True
  salt-rsg-1604:/srv/saltstack/pillar/common/temperature# salt 'our_minion_machine_FQDN' state.apply common.monitoring.temperature

SALT will

  • Put the script in root's folder /root/ipmitool-log-temperature
  • The script is owned and executable by root
  • Create files /var/log/temperature-week{1,2,3,4} and temperature-current-week
  • Add the script to the root crontab
   */2 * * * *  /root/ipmitool-log-temperature sensor-name
where the sensor-name was previously determined as the machine's temperature sensor

/root/ipmitool-log-temperature

#!/bin/bash
# Gordon Boerke 2017-11-21 (RT#672618)
#
# sensor is the ipmitool sensor reporting the room temperature.
# The script is executed as:     ipmitool-log-temperature sensor-name
#  One must determine the "sensor-name"
# Test first the command to extract the temperature:
#  ipmitool sdr list | grep "the-sensor-name"|head -1|sed 's/[^0-9]*//'|sed -r 's/(.{2}).*/\1/'
#
sensor=$1
getroomtemp=$(ipmitool sdr list | grep $sensor|head -1|sed 's/[^0-9]*//'|sed -r 's/(.{2}).*/\1/')

fileOldTime=`stat -c %Y /var/log/temperature-week4`
echo $(date +"%Y-%m-%d %H:%M") $getroomtemp "degrees" >> /var/log/temperature-current-week
fileNewTime=`stat -c %Y /var/log/temperature-current-week`

declare -i difference
difference=fileNewTime-fileOldTime

if [ $difference -gt 604800 ];then
  cp /var/log/temperature-week2 /var/log/temperature-week1
  cp /var/log/temperature-week3 /var/log/temperature-week2
  cp /var/log/temperature-week4 /var/log/temperature-week3
  cp /var/log/temperature-current-week /var/log/temperature-week4
  true > /var/log/temperature-current-week
fi

High temperature shutdown and alerts

After reviewing the temperature logs determine normal temperature operating range. From that information select a maximum system temperature to shut down the minion. This is a qualitative decision. The system temperature will follow the room temperature. A high temperature can be set to shutdown the system or alert users of excessive temperature.

SALT will

  • Put the script in root's folder /root/ipmitool-max-temperature
  • The script is owned and executable by root
  • Add the script to the root crontab
   */2 * * * *  /root/ipmitool-max-temperature sensor-name maximum-temperature-allowed
where the sensor-name was previously determined as the System or ambient temperature sensor and maximum-temperature-allowed is determined to be the highest allowed temperature before shutdown
Note a warning will be sent out via "wall" to users once the temperature is within 5 degrees of the max.
ipmitool-max-temperature will call another script named "external-shutdown".
The external-shutdown script will send shutdown signals to other machines. See explanation below.

/root/ipmitool-max-temperature

#!/bin/bash
# Gordon Boerke 2017-11-21 (RT#672618)
#
# sensor is the ipmitool sensor reporting the room temperature.
# temp_max is the temperature at which to shutdown the machine.
# warn_temp is the temperature to report a warning of excessive heat build-up.
# The script is executed as:
#        ipmitool-max-temperature sensor-name maximum-temperature-allowed
#   One must determine the "sensor-name" and a "maximum-temperature-allowed" value
# Test with the following command to extract the temperature:
#   ipmitool sdr list | grep "the-sensor-name"|head -1|sed 's/[^0-9]*//'|sed -r 's/(.{2}).*/\1/')
#
sensor=$1
declare -i temp_max
temp_max=$2
declare -i warn_temp

getroomtemp=$(ipmitool sdr list | grep $sensor|head -1|sed 's/[^0-9]*//'|sed -r 's/(.{2}).*/\1/')

if [ "$getroomtemp" -gt $temp_max ];then
  echo "System temperature is at "$getroomtemp" which is above the max "$temp_max | wall -n 2>&1 > /dev/null
  echo "Shutting the system down now" | wall -n 2>&1 > /dev/null
   [ -f /root/external-shutdown ] && /root/external-shutdown
   /sbin/shutdown -h now
else
  warn_temp=$temp_max-5
  if [ "$getroomtemp" -gt $warn_temp ];then
   echo "System temperature is at "$getroomtemp" which is close to the max "$temp_max | wall -n 2>&1 > /dev/null
   echo "The machine will shutdown once the temperature exceeds "$temp_max | wall -n 2>&1 > /dev/null
   [ -f /root/external-shutdown2 ] && /root/external-shutdown2
  fi
fi

The external-shutdown script will send a shutdown command via SSH to other machines. Typically these machines cannot be configured with SALT.
The script will be placed in the same location as the ipmitool-max-temperature and ipmitool-log-temperature scripts under /root.
SSH keys must be added to the machines.

/root/external-shutdown

#!/bin/bash
# Gordon Boerke 2017-12-18 (RT#672618)
#
# This script is called from another script ipmitool-max-temperature.
# This script will send shutdown commands to other machines.
# Other machines to be determined. They must have this machines SSH key.
#
# Report a machine shutdown first:
#
# echo "Machine xxx.cs.uwaterloo.ca is shutting down now" | wall -n 2>&1 > /dev/null
# ssh xxx.cs.uwaterloo.ca 'shutdown -h now'
#  or whatever the shutdown command may be for the system
#
# In the "echo" above include the "| wall -n 2>&1 > /dev/null" as the ipmitool-max-temperature script is run from cron.

/root/external-shutdown2
This will be the same as /root/external-shutdown. This second script will run at 5 degrees below the maximum level.

Include email notification

Emailing a notification that the temperature is at or above the warning temperature requires a flag set so as not to continuously email the notification.

New flag file /root/ipmitool-email-temperature Use a value of 0 in the file to represent the machine running at normal temperature. Set the value to 1 when the machine temperature hits or exceeds the warning temperature and has sent an email.

/root/ipmitool-max-temperature

#!/bin/bash
# Gordon Boerke 2017-11-21 (RT#672618)
#
# sensor is the ipmitool sensor reporting the room temperature.
# temp_max is the temperature at which to shutdown the machine.
# warn_temp is the temperature to report a warning of excessive heat build-up.
# The script is executed as:
#        ipmitool-max-temperature sensor-name maximum-temperature-allowed
#   One must determine the "sensor-name" and a "maximum-temperature-allowed" value
# Test with the following command to extract the temperature:
#   ipmitool sdr list | grep "the-sensor-name"|head -1|sed 's/[^0-9]*//'|sed -r 's/(.{2}).*/\1/'
#
sensor=$1
declare -i temp_max
temp_max=$2
declare -i warn_temp

getroomtemp=$(ipmitool sdr list | grep $sensor|head -1|sed 's/[^0-9]*//'|sed -r 's/(.{2}).*/\1/')

if [ "$getroomtemp" -gt $temp_max ];then
  echo "System temperature is at "$getroomtemp" which is above the max "$temp_max | wall -n 2>&1 > /dev/null
  echo "Shutting the system down now" | wall -n 2>&1 > /dev/null
  # Have the email flag set back to "false" = 0, ready for when the machine is turned back on.
  echo "0" > /root/ipmitool-email-temperature
  # Send the shutdown to other machines by executing the external-shutdown script.
  [ -f /root/external-shutdown ] && /root/external-shutdown
  /sbin/shutdown -h now
else
  warn_temp=$temp_max-5
  if [ "$getroomtemp" -gt $warn_temp ];then
    echo "System temperature is at "$getroomtemp" which is close to the max "$temp_max | wall -n 2>&1 > /dev/null
    echo "The machine will shutdown once the temperature exceeds "$temp_max | wall -n 2>&1 > /dev/null
    # Send the shutdown to other machines by executing the externa-shutdown2 script.
    [ -f /root/external-shutdown2 ] && /root/external-shutdown2
    # Send out an email warning. The email flag is in the file /root/ipmitool-email-temperature
    input="/root/ipmitool-email-temperature"
    while read line
      do
        email="$line"
      done < "$input"
    # If an email hasn't already been sent then send it out now. Set the email flag to 1.
    if [ $email -eq 0 ];then
      echo "1" > /root/ipmitool-email-temperature
      emaildomain="@cs.uwaterloo.ca"
      host="$(hostname)"
      emailaddress=$host$emaildomain
      /usr/bin/sendemail -u "System is overheating" -m "Warning that $host is reaching its shutdown temperature." -f $emailaddress -t gboerke@uwaterloo.ca -s connect.uwaterloo.ca:25
    fi
  # If we are no longer in the warning temperature range, then reset the email flag. 
  else
    input="/root/ipmitool-email-temperature"
    while read line
      do
        email="$line"
      done < "$input"
    # If an email has already been sent then send out an "alls clear now". Reset the email flag to 0.
    if [ $email -eq 1 ];then
      emaildomain="@cs.uwaterloo.ca"
      host="$(hostname)"
      emailaddress=$host$emaildomain
      /usr/bin/sendemail -u "System is within acceptable temperature range" -m "$host is operating below its shutdown temperature." -f $emailaddress -t gboerke@uwaterloo.ca -s connect.uwaterloo.ca:25
      # Reset the email flag.
      echo "0" > /root/ipmitool-email-temperature
    fi
  fi
fi

Problems Encountered

No temperatures are being logged

Make sure that the system has an IPMI interface.
Make sure that ipmitool was installed. If not then try installing it manually.
   apt-get install ipmitool
   apt-get install freeipmi-tools
   modprobe ipmi_si type=kcs ports=0xCA2 regspacings=1
   modprobe ipmi_devintf
   modprobe ipmi_msghandler
   modprobe ipmi_poweroff
   modprobe ipmi_watchdog

Find the sensor output of ipmitool.

https://cs.uwaterloo.ca/twiki/view/CF/IPMI https://cs.uwaterloo.ca/twiki/view/CF/ClusterToolsIPMITOOL
   ipmitool sdr list
Make sure "ipmitool sdr list" reported some data.

Determine the system or ambient temperature of the machine.

   ipmitool sdr list
   CPU1 Temp        | 53 degrees C      | ok
   CPU2 Temp        | 54 degrees C      | ok
   PCH Temp         | 44 degrees C      | ok
   System Temp      | 36 degrees C      | ok
   Peripheral Temp  | 43 degrees C      | ok
   MB_10G Temp      | 58 degrees C      | ok
   ...

  • Here the "System" temperature is what will be used for retrieving normal "room" temperature operation. "System" is the default used by our SALT temperature logging.
  • Of course since the system itself will generate heat this temperature will not be the actual room temperature, but a steady temperature that fluctuates with room temperature

Typically Supermicro motherboards report System temperature, which is the default for our SALT configuration.
Other accepted temperature sensor key words are:

*ambient* for some Sun machines
*PCI* Area Temp.
*Sys* for some older Supermicro machines.
*Inlet* for Huawei machines.
  Choose one that fluctuates with room temperature fluctuation.

-- GordBoerke - 2017-12-04

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2021-01-19 - GordBoerke
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback