Skip to content

Commit

Permalink
Add UEFI Common Platform Error Record (CPER) support
Browse files Browse the repository at this point in the history
CPER is the format used to describe platform hardware error by various
tables, such as ERST, BERT and HEST etc.

The event severity message is printed here:
https://github.com/torvalds/linux/blob/v6.7/drivers/firmware/efi/cper.c#L639

Examples are as below.

Corrected error:
kernel: {37}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 162
kernel: {37}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {37}[Hardware Error]: event severity: corrected
kernel: {37}[Hardware Error]:  Error 0, type: corrected
kernel: {37}[Hardware Error]:   section_type: memory error
kernel: {37}[Hardware Error]:   error_status: 0x0000000000000400
kernel: {37}[Hardware Error]:   physical_address: 0x000000b50c68ce80
kernel: {37}[Hardware Error]:   node: 1 card: 4 module: 0 rank: 0 bank: 1 device: 14 row: 58165 column: 816
kernel: {37}[Hardware Error]:   error_type: 2, single-bit ECC
kernel: {37}[Hardware Error]:   DIMM location: CPU 2 DIMM 30

Recoverable error:
kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
kernel: {3}[Hardware Error]: event severity: recoverable
kernel: {3}[Hardware Error]:  Error 0, type: recoverable
kernel: {3}[Hardware Error]:  fru_text: B1
kernel: {3}[Hardware Error]:   section_type: memory error
kernel: {3}[Hardware Error]:   error_status: 0x0000000000000400
kernel: {3}[Hardware Error]:   physical_address: 0x000000393cfe5040
kernel: {3}[Hardware Error]:   node: 2 card: 0 module: 0 rank: 0 bank: 3 device: 0 row: 34719 column: 320
kernel: {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000

Fatal error:
kernel: BERT: Error records from previous boot:
kernel: [Hardware Error]: event severity: fatal
kernel: [Hardware Error]:  Error 0, type: fatal
kernel: [Hardware Error]:  fru_text: DIMM B5
kernel: [Hardware Error]:   section_type: memory error
kernel: [Hardware Error]:   error_status: 0x0000000000000400
kernel: [Hardware Error]:   physical_address: 0x000000393d7e4040
kernel: [Hardware Error]:   node: 2 card: 4 module: 0 rank: 0 bank: 3 device: 0 row: 34743 column: 256

Steps to test the new metrics.
# echo "{1}[Hardware Error]: event severity: fatal" >  /dev/kmsg
# echo "{1}[Hardware Error]: event severity: recoverable" >  /dev/kmsg
# echo "{1}[Hardware Error]: event severity: corrected" >  /dev/kmsg

Expected metrics are as below:
$ curl localhost:20257/metrics
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="HardwareErrorCorrected"} 1
problem_counter{reason="HardwareErrorFatal"} 1
problem_counter{reason="HardwareErrorRecoverable"} 1
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="HardwareErrorFatal",type="HardwareErrorFatal"} 1

Signed-off-by: Jian Wen <[email protected]>
  • Loading branch information
wenjianhn committed Jan 22, 2024
1 parent e9eddcc commit ec831d2
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions config/kernel-monitor.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@
"type": "ReadonlyFilesystem",
"reason": "FilesystemIsNotReadOnly",
"message": "Filesystem is not read-only"
},
{
"type": "HardwareErrorFatal",
"reason": "HardwareHasNoFatalError",
"message": "Hardware has no fatal error"
}
],
"rules": [
Expand Down Expand Up @@ -63,6 +68,22 @@
"reason": "MemoryReadError",
"pattern": "CE memory read error .*"
},
{
"type": "temporary",
"reason": "HardwareErrorCorrected",
"pattern": ".*\\[Hardware Error\\]: event severity: corrected$"
},
{
"type": "temporary",
"reason": "HardwareErrorRecoverable",
"pattern": ".*\\[Hardware Error\\]: event severity: recoverable$"
},
{
"type": "permanent",
"condition": "HardwareErrorFatal",
"reason": "HardwareErrorFatal",
"pattern": ".*\\[Hardware Error\\]: event severity: fatal$"
},
{
"type": "permanent",
"condition": "KernelDeadlock",
Expand Down

0 comments on commit ec831d2

Please sign in to comment.