Skip to content

Commit

Permalink
Add UEFI Common Platform Error Record (CPER) support
Browse files Browse the repository at this point in the history
CPER is the format used to describe platform hardware error by various
tables, such as ERST, BERT and HEST etc.

The event severity message is printed here:
https://github.com/torvalds/linux/blob/v6.7/drivers/firmware/efi/cper.c#L639

Examples are as below.

Corrected error:
kernel: {37}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 162
kernel: {37}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {37}[Hardware Error]: event severity: corrected
kernel: {37}[Hardware Error]:  Error 0, type: corrected
kernel: {37}[Hardware Error]:   section_type: memory error
kernel: {37}[Hardware Error]:   error_status: 0x0000000000000400
kernel: {37}[Hardware Error]:   physical_address: 0x000000b50c68ce80
kernel: {37}[Hardware Error]:   node: 1 card: 4 module: 0 rank: 0 bank: 1 device: 14 row: 58165 column: 816
kernel: {37}[Hardware Error]:   error_type: 2, single-bit ECC
kernel: {37}[Hardware Error]:   DIMM location: CPU 2 DIMM 30

Recoverable error:
kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
kernel: {3}[Hardware Error]: event severity: recoverable
kernel: {3}[Hardware Error]:  Error 0, type: recoverable
kernel: {3}[Hardware Error]:  fru_text: B1
kernel: {3}[Hardware Error]:   section_type: memory error
kernel: {3}[Hardware Error]:   error_status: 0x0000000000000400
kernel: {3}[Hardware Error]:   physical_address: 0x000000393cfe5040
kernel: {3}[Hardware Error]:   node: 2 card: 0 module: 0 rank: 0 bank: 3 device: 0 row: 34719 column: 320
kernel: {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000

Fatal error:
kernel: BERT: Error records from previous boot:
kernel: [Hardware Error]: event severity: fatal
kernel: [Hardware Error]:  Error 0, type: fatal
kernel: [Hardware Error]:  fru_text: DIMM B5
kernel: [Hardware Error]:   section_type: memory error
kernel: [Hardware Error]:   error_status: 0x0000000000000400
kernel: [Hardware Error]:   physical_address: 0x000000393d7e4040
kernel: [Hardware Error]:   node: 2 card: 4 module: 0 rank: 0 bank: 3 device: 0 row: 34743 column: 256

Signed-off-by: Jian Wen <[email protected]>
  • Loading branch information
wenjianhn committed Jan 11, 2024
1 parent e9eddcc commit c224938
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions config/kernel-monitor.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,21 @@
"reason": "MemoryReadError",
"pattern": "CE memory read error .*"
},
{
"type": "temporary",
"reason": "HardwareErrorCorrected",
"pattern": ".*\[Hardware Error\]: event severity: corrected$"
},
{
"type": "temporary",
"reason": "HardwareErrorRecoverable",
"pattern": ".*\[Hardware Error\]: event severity: recoverable$"
},
{
"type": "permanent",
"reason": "HardwareErrorFatal",
"pattern": ".*\[Hardware Error\]: event severity: fatal$"
},
{
"type": "permanent",
"condition": "KernelDeadlock",
Expand Down

0 comments on commit c224938

Please sign in to comment.