2.9. Hardware Monitoring

Overview

When cOS Stream is running on dedicated Clavister hardware (and not in a virtual environment), various sensors in the hardware will provide cOS Stream with information about the status of the hardware components. This information ranges from temperatures and fan speeds to indicating if components, such as power supply units, are operational.

The Sensor Polling Interval

By default, cOS Stream polls all hardware sensors every 5 seconds and stores the last retrieved sensor information until the next poll overwrites it. This 5 second interval can be changed globally by the administrator to another value between 1 and 30 seconds. For example, if the polling interval is to be changed to 20 seconds, the following CLI command would be used:
System:/> set Settings HWMONSettings SensorRefreshInterval=20

Type Of Hardware Monitoring

The Clavister NetShield Firewall provides two types of hardware monitoring that makes use of the information collected by sensor polling:

These types are explained in the sections that follow.

2.9.1. User Hardware Monitoring

In contrast to System Hardware Monitoring, the feature called User Hardware Monitoring is used to generate log event messages for specific sensors and conditions that are defined by the administrator.

Enabling User Hardware Monitoring

To enable user hardware monitoring for a sensor, a HWMONMonitor object must be created in the configuration and associated with the sensor. Only one sensor can be associated with one HWMONMonitor object.

Assume that monitoring of the CPU temperature is required. If the sensor name is CPU_Temp and the new object is to be called mon1, user monitoring for this sensor can be enabled with the following CLI command:

System:/> add HWMONMonitor mon1 SensorID=CPU_Temp

The HWMONMonitor object created will generate an event message when its sensor has a value that moves outside the thresholds defined for the object (in the case of temperatures and fan speeds) or when the status changes (in the case of a failed power supply).

Another log message is generated as the sensor value moves back to a normal value and becomes equal to the threshold value. In the above command, no minimum or maximum threshold values are specified so cOS Stream will use the default thresholds for this sensor. All sensors have default thresholds already defined. Note that the SensorID property must always be specified when creating an HWMONMonitor object.

By default, a severity of Warning is used for all log messages generated. The severity can be explicitly set by changing the Severity property of an HWMONMonitor object. For example, to set the severity to Emergency for the mon1 object, the CLI command would be:

System:/> set HWMONMonitor mon1 Severity=Emergency

Sensor Types

Sensors are of two types:
  • Numeric sensors - These provide numeric values. For example, temperature or fan speed.

  • Binary sensors - These provide a value that is either zero or one. For example, a PSU being operational or not operational.

Understanding the thresholds of a numeric sensor is straightforward but binary sensors require some additional explanation. Binary sensors have a value which is either 1 or 0. However, it depends on the sensor which value indicates a "normal" condition. For example, one sensor could indicate that a PSU is installed and its normal value is 1 (the PSU is present). Another sensor might indicate that a PSU has failed and its normal value is 0 (the PSU is operational and has not failed).

The following should be noted about HWMONMonitor object thresholds for binary sensors:

  • A binary sensor that has a normal value of 1 will have low and high thresholds that are both 1. When the sensor's value becomes 0, this will indicate an abnormal condition.

  • A binary sensor that has a normal value of 0 will have low and high thresholds that are both 0. When the sensor's value becomes 1, this will indicate an abnormal condition.

Changing Threshold Values

The lower and upper thresholds for an HWMONMonitor object associated with a numeric sensor can be changed from the default, as shown in the following example:
System:/> set HWMONMonitor mon1 LowThresh=-5 HighThresh=100
Note that negative values can be used for temperatures.

[Important] Important: Do not change thresholds of binary sensors

The thresholds associated with a binary sensor, such as PSU failure, should not be changed by the administrator. The default threshold values should always be used.

Changing the Monitoring Interval for the HWMONMonitor Object

The Interval property of an HWMONMonitor object decides how often, in seconds, the data from the object's associated sensor is examined. A higher value for this property will cause less log messages to be sent. For example, to examine the CPU temperature every 30 seconds, the CLI command would be as follows:
System:/> set HWMONMonitor mon1 Interval=30
Note that this interval does not affect the rate at which the sensors are polled, which is controlled by the global setting SensorRefreshInterval (described at the beginning of this topic). The Interval property only affects the frequency with which the currently stored sensor values are examined by the HWMONMonitor object to determine if a log message should be generated.

Log Messages

A message is generated when the sensor value passes outside the thresholds specified. A second log message is generated when the sensor value passes back inside the thresholds specified. Similarly, for a binary sensor, a log message is generated when the state of the sensor changes in either direction.

The log messages generated by HWMONMonitor objects all belong to the HWMON log message category. As described previously, the severity of the message is Warning by default but this can be changed by setting the Severity property of the HWMONMonitor object generating the message. Below are some examples of the log messages created by user hardware monitoring.

A sensor value above a maximum threshold:

SYSTEM,HWMON: prio=warning id=1081
event=sensor_value_above_monitor_threshold sensorid=”CPU_Temp”
description="CPU Temperature" name="n2" value=81
threshold=80 action=none

A sensor value below a minimum threshold:

SYSTEM,HWMON: prio=warning id=1079
event=sensor_value_below_monitor_threshold sensorid=”CPU_Temp”
description="CPU Temperature" name="n2" value=29
threshold=30 action=none

A sensor value crossing a threshold back to a normal value:

SYSTEM,HWMON: prio=warning id=1082
event=sensor_returned_to_normal sensorid=”CPU_Temp”
description="CPU Temperature" name="n2" value=80
action=none

The Log Repeat Interval

While the sensor value remains outside of its thresholds, a HWMONMonitor object will regenerate the log message, by default, every 6 hours so that the administrator continues to be reminded that the abnormal condition exists.

The default log repeat interval value of 6 hours can be changed by assigning a new value to the LogRepeatInterval property of the HWMONMonitor object. The value is specified as an integer number of seconds. For example, the following command would change the log repeat interval for the monitor called mon1 to become 12 hours:

System:/> Set HWMONMOnitor mon1 LogRepeatInterval=43200

Note that the minimum allowed value for the LogRepeatInterval property is 30 seconds and the maximum is 86,400 seconds (24 hours).

Displaying the Available Sensors

To see all the current available sensors, the hwmon -sensorlist command can be used. The following shows some typical output (the last two right hand columns showing lowest and highest have been truncated to fit the page width):
System:/> hwmon -sensorlist
		
Sensor ID       Description               Unit  Value  Monitor Min    Max
--------------  ------------------------  ----  -----  ------- -----  -----
CPU_Temp        CPU Temperature           C     58     yes     0      98    
System_Temp     System Temperature        C     24     no      0      55    
System_Power    System Power Consumption  W     224    no      0      500   
System_12V      System Internal Power     mV    12126  no      11500  13000 
FAN1_RPM        System FAN1 Speed         RPM   4200   no      1500   6800   
FAN2_RPM        System FAN2 Speed         RPM   4000   no      1500   6800  
FAN3_RPM        System FAN3 Speed         RPM   4100   no      1500   6800  
PSU1_Avail      PSU1 Available            -     1      no      1      1     
PSU1_Fail       PSU1 Failure Detected     -     0      no      0      0    
PSU1_Input_Lost PSU1 Power Input Lost     -     0      no      0      0     
PSU2_Avail      PSU2 Available            -     1      no      1      1   
PSU2_Fail       PSU2 Failure Detected     -     0      no      0      0   
PSU2_Input_Lost PSU2 Power Input Lost     -     0      no      0      0

The following should be noted about the above output:

  • The Value column indicates the value that the sensor returned when it was last polled. Here, the CPU temperature was 58 degrees when that sensor was last polled.

  • The Min and Max columns show the default thresholds that will be used if none are explicitly specified when an HWMONMonitor object is created. These default thresholds are fixed and cannot be changed by the administrator.

  • The value of no under the Monitor column means that no HWMONMonitor object is currently associated with that sensor. If there is at least one HWMONMonitor object associated with the sensor, the value in the column becomes Yes. Disabling an HWMONMonitor object will not affect its associated Yes value in this column.

Displaying the Current HWMONMonitor List

To see all the current HWMONMonitor objects, the hwmon is used with no parameters. The following shows some typical output for two configured HWMONMonitor objects called mon1 and mon2 that are monitoring CPU temperature:
System:/> hwmon
		
Name  Sensor    Description       Value  Low  High  Status  #Low  #High
----  --------  ----------------  -----  ---  ----  ------  ----  -----
mon1  CPU_Temp  CPU Temperature   59     0    98    NORMAL  0     0
mon2  CPU_Temp  CPU Temperature   59     20   70    NORMAL  0     0
Here, the Low and High columns are the currently defined thresholds for these HWMONMonitor objects. The #Low and #High columns are the total number of alarms triggered when the low and high thresholds are passed. Alarms are explained later in this section.

Note that in the above list, there are two HWMONMonitor objects defined for the same sensor. This is permissible and means that separate log messages can be generated for different sensor ranges.

The Status column in the above output indicates the status of the sensor value when it was last examined by the HWMONMonitor object and can show the following values:

  • NORMAL - The value is within the thresholds.

  • ALARM:LOW - The value has exceeded the minimum threshold and has not yet returned to normal.

  • ALARM:HIGH - The value has exceeded the maximum threshold and has not yet returned to normal.

  • WARNING: RETURN LOW - The value has just returned from being outside the minimum threshold to the normal range. The column will show NORMAL after the next reading of sensor values.

  • WARNING: RETURN HIGH - The value has just returned from being outside the maximum threshold to the normal range. The column will show NORMAL after the next reading of sensor values.

  • UNKNOWN STATUS - The possible reasons for this column entry are the following:

    1. cOS Stream is in the process of restarting and no values are yet available.

    2. The SensorID property of the HWMONMonitor object does not correspond to any sensor in the hardware.

    3. The sensor has a fault and is not returning values.

Displaying HWMONMonitor Properties

To see all the properties for an HWMONMonitor object, the hwmon <monitor-name> command can be used. For example, to see all the values for the HWMONMonitor called mon1, the command would be:
System:/> hwmon mon1
Name                       : mon1
Sensor id                  : CPU_Temp
Log severity setting       : warning
Description                : CPU Temperature
Value                      : 58
Low Threshold              : 0
Recommended low threshold  : 0
High Threshold             : 98
Recommended high threshold : 98
Action interval (seconds)  : 5
ALARM:LOW counter          : 1
ALARM:HIGH counter         : 3
Last status                : NORMAL

Note that the Recommended low threshold and the Recommended high threshold in the above output are the default threshold values for the associated sensor.

Sensor Alarms

As shown in the output above, each HWMONMonitor object has two alarm counters associated with it, the ALARM:LOW counter and the ALARM:HIGH counter. These properties of the object can only be read and cannot be manipulated by the administrator.

The following should be noted about the alarm counters:

  • Alarm counters start at zero after starting cOS Stream and are incremented each time the sensor value passes one of the thresholds.

  • For binary sensors, such as a PSU fail/not fail sensor, the alarm is incremented when the sensor value changes from 1 to 0, or 0 to 1. It is a change of state from the normal that increments the counter. However, only ALARM:HIGH is incremented if a normal value of 0 changes to 1 and only ALARM:LOW is incremented if a normal value of 1 changes to 0.

  • Alarm counters are not decremented. This allows the administrator to be able to see the total number of times the alarm has been triggered. The counters are only zeroed when cOS Stream is restarted or when the hardware monitoring is disabled globally and then re-enabled.

2.9.2. System Hardware Monitoring

The sensor information that is gathered by cOS Stream is available to an SNMP client using the SNMP protocol. This type of monitoring is known as System Monitoring.

System monitoring only has one option available that can be set by the administrator, and that is if it is enabled or disabled. All hardware monitoring is enabled by default, but this can be disabled by the following CLI command:

System:/> set Settings HWMONSettings MonitorEnable=No

This will disable both system and user monitoring. Both can be re-enabled with the following command:

System:/> set Settings HWMONSettings MonitorEnable=Yes

This setting affects both system monitoring and user monitoring

2.9.3. Sensors for Clavister Products

The following are the available sensors for the current range of Clavister hardware products running cOS Stream. All fan speeds are given in RPM and temperatures are in degrees centigrade.

  • NetShield S10
Sensor Name Sensor Type Sensor Number Minimum Limit Maximum Limit
CPUTemp TEMP 0 0 80
  • NetShield P40
Sensor Name Sensor Type Sensor Number Minimum Limit Maximum Limit
Left_PSU GPIO 0 1  
Right_PSU GPIO 0 1  
SysTemp1 TEMP 256 0 70
SysTemp2 TEMP 257 0 70
SysFan1 FANRPM 256 1500 12800
SysFan2 FANRPM 258 1500 12800
SysFan3 FANRPM 260 1500 12800
SysFan4 FANRPM 260 1500 12800
CPUTemp1 TEMP 512 0 80

  • NetShield 300 Series

Sensor Name Sensor Type Sensor Number Minimum Limit Maximum Limit
System Vcore Internal VOLT 0 0.494 1.744
System 12V Internal VOLT 1 11.4 13.9
System 5V Internal VOLT 2 4.8 5.8
System 3.3V Internal VOLT 3 2.976 3.632
System CMOS Battery VOLT 4 2.704 3.632
System FAN Speed FANRPM 5 1400 14000
System Temperature 1 TEMP 6 0 71
System Temperature 2 TEMP 7 0 75
System Temperature 3 TEMP 8 0 85
CPU Core Temperature TEMP 9 0 85
  • NetShield 500 Series
Sensor Name Sensor Type Sensor Number Minimum Limit Maximum Limit
System Vcore Internal VOLT 0 0.494 1.302
System 3.3V Internal VOLT 1 3.135 3.465
System 12V Internal VOLT 2 11.4 12.6
System CMOS Battery VOLT 3 1.9 3.465
System 5V Internal VOLT 4 4.75 5.25
System FAN Speed FANRPM 5 1800 14000
System Temperature 1 TEMP 6 0 80
System Temperature 2 TEMP 7 0 80
CPU Socket Temperature TEMP 8 0 95
  • NetShield 6000 Series

    Note that where a 6000 Series sensor has maximum and minimum range values which are both zero, this means that no limits are applied and the sensor cannot trigger an alarm. An example of this is the the PSU1 Input Power sensor.

Sensor Name Sensor Type Sensor Number Minimum Limit Maximum Limit
PSU1 Installed GPIO 0 0 1
PSU1 Power OK GPIO 256 0 1
PSU1 Temperature TEMP 512 0 0
PSU1 Fan Speed FANRPM 768 5000 14000
PSU1 Output Voltage VOLT 1024 0 0
PSU1 Output Current CURR 1280 0 0
PSU1 Input Power POWER 1536 0 0
PSU1 Output Power POWER 1792 0 0
PSU2 Installed GPIO 2048 0 1
PSU2 Power OK GPIO 2304 0 1
PSU2 Temperature TEMP 2560 0 0
PSU2 Fan Speed FANRPM 2816 5000 14000
PSU2 Output Voltage VOLT 3072 0 0
PSU2 Output Current CURR 3328 0 0
PSU2 Input Power POWER 3584 0 0
PSU2 Output Power POWER 3840 0 0
System Vcore Internal VOLT 4096 0 1.744
System 3.3V Internal VOLT 4097 0 0
System 12V Internal VOLT 4098 11.5 13.0
System CMOS Battery VOLT 4099 2.9 3.2
System 5V Internal VOLT 4100 0 0
System FAN1 Speed FANRPM 4101 6000 14000
System FAN2 Speed FANRPM 4102 6000 14000
System FAN3 Speed FANRPM 4103 6000 14000
System Temperature 1 TEMP 4104 0 80
System Temperature 2 TEMP 4105 0 80
Air Intake Temp TEMP 4105 0 50
CPU Temperature TEMP 4352 0 95