Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect fan status reported #36

Open
Blackqpid opened this issue Aug 20, 2024 · 1 comment
Open

Incorrect fan status reported #36

Blackqpid opened this issue Aug 20, 2024 · 1 comment

Comments

@Blackqpid
Copy link

I'm using check_hpasm v4.9 via SNMP against a ProLiant DL360p Gen8 with iLO firmware 2.82

The iLO web interface is reporting:

Fan Block 1     System   OK     94%
Fan Block 2     System   OK     94%
Fan Block 3     System   OK     94%
Fan Block 4     System   OK     94%
Fan Block 5     System   OK     94%
Fan Block 6     System   OK     94%
Fan Block 7     System   Degraded       94%
Fan Block 8     System   OK     94%

The check_hpasm plugin is reporting:

WARNING - system fan  overall status is degraded, fan 1 (system) is not redundant, fan 2  (system) is not redundant, fan 3 (system) is not redundant, fan 4  (system) is not redundant, fan 5 (system) is not redundant, fan 6  (system) is not redundant, fan 7 (system) degraded, fan 7 (system) is  not redundant, fan 8 (system) is not redundant
--
Performance Data: | pc_1=101;460;460  pc_2=108;460;460 fan_1=50% fan_2=50% fan_3=50% fan_4=50% fan_5=50%  fan_6=50% fan_7=50% fan_8=50% temp_1_ambient=18;42;42  temp_2_cpu=40;70;70 temp_3_cpu=40;70;70 temp_4_memory=25;87;87  temp_5_memory=27;87;87 temp_6_memory=24;87;87 temp_7_memory=25;87;87  temp_8_memory=22;70;70 temp_9_memory=25;70;70 temp_10_memory=23;70;70  temp_11_memory=22;70;70 temp_12_system=35;60;60  temp_13_system=44;105;105 temp_14_system=23;70;70 temp_15_powerSupply=19  temp_16_powerSupply=23;70;70 temp_17_powerSupply=20  temp_18_powerSupply=23;65;65 temp_21_system=26;115;115  temp_22_system=28;115;115 temp_23_system=20;115;115  temp_24_system=21;115;115 temp_25_system=23;115;115  temp_26_system=22;115;115 temp_27_system=20;70;70  temp_28_system=20;70;70 temp_29_system=22;70;70 temp_30_system=19;70;70  temp_31_system=40;105;105 temp_32_system=27;65;65  temp_33_system=24;70;70 temp_34_system=24;66;66 temp_36_system=25;65;65  temp_37_system=27;70;70 temp_38_system=24;70;70 temp_39_system=23;70;70  temp_40_system=24;70;70 temp_41_system=24;64;64

As you can see fan 7 is reported twice, and all the other fans are reporting as being not redundant.

Copy link

codeautopilot bot commented Aug 20, 2024

Potential solution

The plan to solve the bug involves addressing two main issues: the duplicate reporting of Fan 7 and the incorrect redundancy status of other fans. The root cause appears to be in the SNMP data parsing and merging logic within the HP::Proliant::Component::FanSubsystem::SNMP module. Specifically, the unite method needs to be updated to handle both he_fans and th_fans arrays, ensuring unique fan entries and correct redundancy status.

What is causing this bug?

The bug is caused by the unite method in the HP::Proliant::Component::FanSubsystem::SNMP module, which only processes the he_fans array and ignores the th_fans array. This leads to incomplete data merging and potential duplication of fan entries. Additionally, the redundancy status might not be correctly parsed or stored, leading to incorrect reporting of fan redundancy.

Code

To fix the bug, we need to modify the unite method to process both he_fans and th_fans arrays and ensure unique fan entries. We also need to verify the correct parsing and storage of the redundancy status.

Modified unite Method

sub unite {
  my $self = shift;
  my $tmpfans = {};
  
  # Process both he_fans and th_fans
  foreach my $fan (@{$self->{he_fans}}, @{$self->{th_fans}}) {
    $tmpfans->{$fan->{cpqHeFltTolFanIndex}} = $fan;
  }
  
  # Link redundant partners
  foreach my $fan (@{$self->{he_fans}}, @{$self->{th_fans}}) {
    if (exists $tmpfans->{$fan->{cpqHeFltTolFanRedundantPartner}}) {
      $fan->{partner} = $tmpfans->{$fan->{cpqHeFltTolFanRedundantPartner}};
    } else {
      $fan->{partner} = undef;
    }
  }
  
  # Ensure unique fan entries
  @{$self->{fans}} = values %$tmpfans;
}

Verify Redundancy Status Parsing

Ensure that the cpqHeFltTolFanRedundant OID is correctly parsed and stored in the he_init and te_init methods.

Example Debugging Statements

Add debugging statements to log the parsed fan data:

sub he_init {
  my $self = shift;
  # Existing code to parse he_fans...
  
  # Debugging statement
  foreach my $fan (@{$self->{he_fans}}) {
    print "HE Fan Index: $fan->{cpqHeFltTolFanIndex}, Redundant: $fan->{cpqHeFltTolFanRedundant}\n";
  }
}

sub te_init {
  my $self = shift;
  # Existing code to parse th_fans...
  
  # Debugging statement
  foreach my $fan (@{$self->{th_fans}}) {
    print "TH Fan Index: $fan->{cpqHeFltTolFanIndex}, Redundant: $fan->{cpqHeFltTolFanRedundant}\n";
  }
}

How to replicate the bug

  1. Use the check_hpasm plugin version 4.9 via SNMP against a ProLiant DL360p Gen8 with iLO firmware 2.82.
  2. Compare the fan status reported by the iLO web interface and the check_hpasm plugin.
  3. Observe that Fan 7 is reported twice and all other fans are reported as not redundant.

By implementing the above changes, the check_hpasm plugin should correctly report the fan status without duplication and with accurate redundancy information.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on plugins-scripts/check_hpasm.pl

Analysis Report

Overview

The check_hpasm.pl script is a Nagios plugin designed to check the hardware status of HP/Compaq ProLiant servers. It uses SNMP to gather data and reports the status of various components, including fans. The issue reported involves incorrect fan status reporting, specifically:

  1. Fan 7 is reported twice.
  2. All other fans are reported as not redundant.

Key Areas to Investigate

  1. SNMP Data Parsing: Ensure that the SNMP data is correctly parsed and mapped to the appropriate fan status.
  2. Redundancy Status: Verify that the redundancy status of each fan is correctly identified and reported.
  3. Output Formatting: Ensure that the output format does not duplicate fan entries and accurately reflects the status of each fan.

Code Review

The check_hpasm.pl script primarily initializes the plugin, sets up arguments, and processes the SNMP data through the HP::Server object. The actual SNMP data parsing and fan status determination are likely handled in the HP::Server module, which is not included in this file.

Key Sections:

  1. Argument Parsing:

    • The script sets up various arguments, including --ignore-fan-redundancy, which might affect how fan redundancy is reported.
  2. Plugin Initialization:

    • The plugin is initialized with various options, including ignore_fan_redundancy, which is passed to the HP::Server object.
  3. Server Initialization and Data Processing:

    • The HP::Server object is created and initialized with runtime options.
    • The script checks messages and adds appropriate messages based on the hardware status.
  4. Output Generation:

    • The final output message is generated and sent to Nagios.

Potential Issues

  1. Redundancy Status Handling:

    • The ignore_fan_redundancy option might be affecting the redundancy status reporting. Ensure that this option is correctly handled and does not inadvertently mark all fans as non-redundant.
  2. Duplicate Fan Reporting:

    • The duplication of Fan 7 might be due to incorrect indexing or mapping in the SNMP data parsing logic. This needs to be verified in the HP::Server module.

Recommendations

  1. Review SNMP Data Parsing:

    • Investigate the HP::Server module to ensure that SNMP data is correctly parsed and mapped to the appropriate fan status.
    • Verify that the indexing logic correctly identifies each fan and does not duplicate entries.
  2. Check Redundancy Status Logic:

    • Ensure that the redundancy status is correctly determined and reported. Pay special attention to the ignore_fan_redundancy option and its impact on the output.
  3. Output Formatting:

    • Ensure that the final output format accurately reflects the status of each fan without duplication.

Conclusion

The primary logic for parsing SNMP data and determining fan status appears to reside in the HP::Server module, which is not included in this file. The check_hpasm.pl script mainly handles argument parsing, plugin initialization, and output generation. The reported issues likely stem from the SNMP data parsing and redundancy status logic in the HP::Server module. Further investigation is needed in that module to identify and resolve the root cause of the incorrect fan status reporting.

Changes on plugins-scripts/HP/Proliant/Component/FanSubsystem/SNMP.pm

Bug Analysis Report

File: plugins-scripts/HP/Proliant/Component/FanSubsystem/SNMP.pm

Overview

This file is responsible for parsing SNMP data related to the fan subsystem of HP ProLiant servers. It initializes and processes the fan data, including their status and redundancy, and unites the data from different sources.

Potential Issues Identified

  1. Duplicate Reporting of Fan 7:

    • The issue of fan 7 being reported twice could be due to the way the unite method processes and merges the fan data from he_fans and th_fans.
    • The unite method currently only processes he_fans and does not seem to handle th_fans. This could lead to incomplete or duplicate data if th_fans also contains entries for fan 7.
  2. Incorrect Redundancy Status:

    • The redundancy status of other fans being reported as "not redundant" might be due to incorrect or missing data processing in the he_init and te_init methods.
    • The cpqHeFltTolFanRedundant OID is used to determine redundancy, but there might be an issue with how this data is being parsed or stored.

Detailed Analysis

unite Method

sub unite {
  my $self = shift;
  my $tmpfans = {};
  foreach (@{$self->{he_fans}}) {
    $tmpfans->{$_->{cpqHeFltTolFanIndex}} = $_;
  }
  foreach (@{$self->{he_fans}}) {
    if (exists $tmpfans->{$_->{cpqHeFltTolFanRedundantPartner}}) {
      $_->{partner} = $tmpfans->{$_->{cpqHeFltTolFanRedundantPartner}};
    } else {
      $_->{partner} = undef;
    }
  }
  @{$self->{fans}} = @{$self->{he_fans}};
}
  • The unite method only processes he_fans and does not consider th_fans. This could lead to incomplete data if th_fans contains relevant entries.
  • The method creates a temporary hash to store fans by their index and then attempts to link redundant partners. However, if there are duplicate entries or missing data, this could lead to incorrect reporting.

he_init and te_init Methods

  • These methods initialize the he_fans and th_fans arrays by parsing SNMP data.
  • The he_init method processes the cpqHeFltTolFanTable and stores fan data in he_fans.
  • The te_init method processes the cpqHeThermalFanTable and stores fan data in th_fans.

Recommendations for Fixes

  1. Ensure Unique Fan Entries:

    • Modify the unite method to also process th_fans and ensure that each fan is uniquely identified and stored.
    • Example modification:
      sub unite {
        my $self = shift;
        my $tmpfans = {};
        foreach my $fan (@{$self->{he_fans}}, @{$self->{th_fans}}) {
          $tmpfans->{$fan->{cpqHeFltTolFanIndex}} = $fan;
        }
        foreach my $fan (@{$self->{he_fans}}, @{$self->{th_fans}}) {
          if (exists $tmpfans->{$fan->{cpqHeFltTolFanRedundantPartner}}) {
            $fan->{partner} = $tmpfans->{$fan->{cpqHeFltTolFanRedundantPartner}};
          } else {
            $fan->{partner} = undef;
          }
        }
        @{$self->{fans}} = values %$tmpfans;
      }
  2. Correct Redundancy Status Parsing:

    • Ensure that the cpqHeFltTolFanRedundant OID is correctly parsed and stored.
    • Verify that the redundancy status is correctly interpreted and reported.
  3. Debug and Validate Data:

    • Add debugging statements to log the parsed fan data and ensure that it matches the expected values.
    • Validate the final fan data structure to ensure no duplicates and correct redundancy status.

Conclusion

The identified issues in the unite method and potential parsing errors in he_init and te_init methods are likely causing the incorrect fan status reporting. By ensuring unique fan entries and correct redundancy status parsing, the reported issues should be resolved.

Disclaimer: This comment was entirely generated using AI. Be aware that the information provided may be incorrect.

Current plan usage: 17.84%

Have feedback or need help?
Discord
Documentation
[email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant