AArch64: Add equivalence tester

By default, SLOTHY includes a 'selfcheck' verifying that the DFGs of input and output are isomorphic with respect to the rescheduling found by SLOTHY. This selfcheck is very powerful and should catch most modelling errors in SLOTHY. However, some aspects of SLOTHY remain outside of the scope of the DFG-based selfcheck: - Address offset fixup This is a historically rather error prone feature to which the self-check is blind (since it relies on making the DFG blind to address increments). - Loop boundaries If SW pipelining is enabled, the selfcheck checks equivalence of a _bounded unrolling_ of the input and output loops. Therefore, it would not catch issues with (a) the loop boundary, or (b) exceptional iterations for low loop counts. While it is not SLOTHY's goal to provide formal guarantees of correctness of the optimizations -- this should be gauged/confirmed through extensive test and ideally formal verification -- development would be faster if bugs in the aforementioned areas were caught before the developer integrates and tests the SLOTHY output manually. As a step towards catching more potential pitfalls early in the SLOTHY workflow, this commit starts the development of an _equivalence tester_. The idea of the equivalence tester is simple: Just run original and optimized assembly with various random inputs, and check that they indeed lead to equivalent outputs. This commit implements this idea for AArch64, leveraging the keystone assembler and the unicorn emulation framework. Extending the checker to other architectures is left for future commits. Another initial limitation is support for loops: For now, the equivalence checker tests bounded unrollings of loops, just like the selfcheck. That is against the point of the equivalenc check, but it will be improved at a later point, after some refactoring of the loop generation code. The checker added in this commit is able to detect issues with address offset fixup, however, which would previously have gone unnoticed by the selfcheck. The new equivalence test is configured via config.selftest, which is enabled by default, but silently skipped for unsupported platforms or target architectures.
slothy-optimizer · Dec 7, 2024 · b1e7899 · b1e7899
1 parent 13f38d0
commit b1e7899
Show file tree

Hide file tree

Showing 11 changed files with 1,460 additions and 2 deletions.
diff --git a/slothy/core/config.py b/slothy/core/config.py
@@ -100,6 +100,44 @@ def reserved_regs_are_locked(self):
         to eliminate uses of particular registers from some piece of assembly."""
         return self._reserved_regs_are_locked
 
+    @property
+    def selftest(self):
+        """Indicates whether SLOTHY performs an empirical equivalence-test on the
+        optimization results.
+
+        When this is set, and if the target architecture and host platform support it,
+        this will run an empirical equivalence checker trying to confirm that the
+        input and output of SLOTHY are likely functionally equivalent.
+
+        The primary purpose of this checker is to detect issue that would presently
+        be overlooked by the selfcheck:
+        - The selfcheck is currently blind to address offset fixup. If something goes
+          wrong, the input and output will not be functionally equivalent, but we would
+          only notice once we actually compile and run the code. The selftest will
+          likely catch issues.
+        - When using software pipelining, the selfcheck reduces to a straightline check
+          for a bounded unrolling of the loop. An unbounded selfcheck is currently not
+          implemented.
+          With the selftest, you still need to fix a loop bound, but at least you can
+          equivalence-check the loop-form (including the compare+branch instructions
+          at the loop boundary) rather than the unrolled code.
+
+        NOTE: This is so far implemented as a repeated randomized test -- nothing clever.
+        """
+        return self._selftest
+
+    @property
+    def selftest_address_gprs(self):
+        """Dictionary of (reg, sz) items indicating which registers are assumed to be
+        pointers to memory, and if so, of what size."""
+        return self._selftest_address_gprs
+
+    @property
+    def selftest_default_memory_size(self):
+        """Default buffer size to use for registers which are automatically inferred to be
+        used as pointers and for which no memory size has been configured via `address_gprs`."""
+        return self._selftest_default_memory_size
+
     @property
     def selfcheck(self):
         """Indicates whether SLOTHY performs a self-check on the optimization result.
@@ -1143,6 +1181,9 @@ def __init__(self, Arch, Target):
         self._reserved_regs = None
         self._reserved_regs_are_locked = True
 
+        self._selftest = True
+        self._selftest_address_gprs = None
+        self._selftest_default_memory_size = 1024
         self._selfcheck = True
         self._selfcheck_failure_logfile = None
         self._allow_useless_instructions = False
@@ -1260,6 +1301,17 @@ def reserved_regs_are_locked(self,val):
     @variable_size.setter
     def variable_size(self,val):
         self._variable_size = val
+    @selftest.setter
+    def selftest(self,val):
+        if hasattr(self.arch, "Checker") is False:
+            raise InvalidConfig("Trying to enable checker, but architecture model does not seem to support it")
+        self._selftest = val
+    @selftest_address_gprs.setter
+    def selftest_address_gprs(self,val):
+        self._selftest_address_gprs = val
+    @selftest_default_memory_size.setter
+    def selftest_default_memory_size(self,val):
+        self._selftest_default_memory_size = val
     @selfcheck.setter
     def selfcheck(self,val):
         self._selfcheck = val

diff --git a/slothy/core/core.py b/slothy/core/core.py
@@ -818,6 +818,38 @@ def selfcheck(self, log):
             raise SlothySelfCheckException("Isomorphism between computation flow graphs: FAIL!")
         return res
 
+    def selftest(self, log):
+        """Run empirical self test, if it exists for the target architecture"""
+        if self._config.selftest is False:
+            return
+
+        if hasattr(self._config.arch, "Checker") is False:
+            log.info(f"Architecture model {self._config.arch.arch_name} does not offer a checker -- skipping")
+            return
+
+        address_gprs = self._config.selftest_address_gprs
+        if address_gprs is None:
+            # Try to infer which registes need to be pointers
+            log_addresses = log.getChild("infer_address_gprs")
+            tree = DFG(self._orig_code, log_addresses, DFGConfig(self.config, outputs=self.outputs))
+            # Look for load/store instructions and remember addresses
+            addresses = set()
+            for t in tree.nodes:
+                addr = getattr(t.inst, "addr", None)
+                if addr is None:
+                    continue
+                addresses.add(addr)
+
+            # For now, we don't look into increments and immedate offsets
+            # to gauge the amount of memory we actually need. Instaed, we
+            # just allocate a buffer of a configurable default size.
+            log.info(f"Inferred that the following registers seem to act as pointers: {addresses}")
+            log.info(f"Using default buffer size of {self._config.selftest_default_memory_size} bytes. "
+                     "If you want different buffer sizes, set selftest_address_gprs manually.")
+            address_gprs = { a: self._config.selftest_default_memory_size for a in addresses }
+
+        self._config.arch.Checker.check(self, address_gprs, log)
+
     def selfcheck_with_fixup(self, log):
         """Do selfcheck, and consider preamble/postamble fixup in case of SW pipelining
 
@@ -1877,6 +1909,8 @@ def _extract_result(self):
         self._result.selfcheck_with_fixup(self.logger.getChild("selfcheck"))
         self._result.offset_fixup(self.logger.getChild("fixup"))
 
+        self._result.selftest(self.logger.getChild("selftest"))
+
     def _extract_positions(self, get_value):
 
         if self.config.variable_size:
@@ -2954,7 +2988,7 @@ def _add_constraints_latencies(self):
             if isinstance(latency, int):
                 self.logger.debug("General latency constraint: [%s] >= [%s] + %d",
                     t, i.src, latency)
-                # Some microarchitectures have instructions with 0-cycle latency, i.e., the can 
+                # Some microarchitectures have instructions with 0-cycle latency, i.e., the can
                 # forward the result to an instruction in the same cycle (e.g., X+str on Cortex-M7)
                 # If that is the case we need to make sure that the consumer is after the producer
                 # in the output.

diff --git a/slothy/targets/aarch64/aarch64_neon.py b/slothy/targets/aarch64/aarch64_neon.py
@@ -40,15 +40,202 @@ class which generates instruction parsers and writers from instruction templates
 import logging
 import inspect
 import re
+import os
 import math
+import platform
+import subprocess
 from enum import Enum
 from functools import cache
-
 from sympy import simplify
+from unicorn import *
+from unicorn.arm64_const import *
 
 arch_name = "Arm_AArch64"
 llvm_mca_arch = "aarch64"
 
+class AArch64CheckerFailure(Exception):
+    """Exception thrown when something goes wrong during equivalence testing"""
+
+class AArch64CheckerNotAvailable(Exception):
+    """Exception thrown when the checker cannot run (because of an unsupported
+    platform or configuration, for example)"""
+
+class Checker:
+    """Empirical equivalence checker for AArch64 kernels
+
+    It takes as argument an instance of the `Result` class and tests
+    equivalence of input and output by assembling and emulating it, checking
+    that registers marked as output are indeed always the same.
+
+    The checker also tries to deal with pointers and memory contents, but
+    needs to be instructed ot know how much memory is assumed to be present,
+    and which GPRs reference it.
+    """
+
+    @staticmethod
+    def unicorn_reg_by_name(reg):
+        d = {
+            "x0":  UC_ARM64_REG_X0,
+            "x1":  UC_ARM64_REG_X1,
+            "x2":  UC_ARM64_REG_X2,
+            "x3":  UC_ARM64_REG_X3,
+            "x4":  UC_ARM64_REG_X4,
+            "x5":  UC_ARM64_REG_X5,
+            "x6":  UC_ARM64_REG_X6,
+            "x7":  UC_ARM64_REG_X7,
+            "x8":  UC_ARM64_REG_X8,
+            "x9":  UC_ARM64_REG_X9,
+            "x10": UC_ARM64_REG_X10,
+            "x11": UC_ARM64_REG_X11,
+            "x12": UC_ARM64_REG_X12,
+            "x13": UC_ARM64_REG_X13,
+            "x14": UC_ARM64_REG_X14,
+            "x15": UC_ARM64_REG_X15,
+            "x16": UC_ARM64_REG_X16,
+            "x17": UC_ARM64_REG_X17,
+            "x18": UC_ARM64_REG_X18,
+            "x19": UC_ARM64_REG_X19,
+            "x20": UC_ARM64_REG_X20,
+            "x21": UC_ARM64_REG_X21,
+            "x22": UC_ARM64_REG_X22,
+            "x23": UC_ARM64_REG_X23,
+            "x24": UC_ARM64_REG_X24,
+            "x25": UC_ARM64_REG_X25,
+            "x26": UC_ARM64_REG_X26,
+            "x27": UC_ARM64_REG_X27,
+            "x28": UC_ARM64_REG_X28,
+            "x29": UC_ARM64_REG_X29,
+            "x30": UC_ARM64_REG_X30,
+            "v0":  UC_ARM64_REG_V0,
+            "v1":  UC_ARM64_REG_V1,
+            "v2":  UC_ARM64_REG_V2,
+            "v3":  UC_ARM64_REG_V3,
+            "v4":  UC_ARM64_REG_V4,
+            "v5":  UC_ARM64_REG_V5,
+            "v6":  UC_ARM64_REG_V6,
+            "v7":  UC_ARM64_REG_V7,
+            "v8":  UC_ARM64_REG_V8,
+            "v9":  UC_ARM64_REG_V9,
+            "v10": UC_ARM64_REG_V10,
+            "v11": UC_ARM64_REG_V11,
+            "v12": UC_ARM64_REG_V12,
+            "v13": UC_ARM64_REG_V13,
+            "v14": UC_ARM64_REG_V14,
+            "v15": UC_ARM64_REG_V15,
+            "v16": UC_ARM64_REG_V16,
+            "v17": UC_ARM64_REG_V17,
+            "v18": UC_ARM64_REG_V18,
+            "v19": UC_ARM64_REG_V19,
+            "v20": UC_ARM64_REG_V20,
+            "v21": UC_ARM64_REG_V21,
+            "v22": UC_ARM64_REG_V22,
+            "v23": UC_ARM64_REG_V23,
+            "v24": UC_ARM64_REG_V24,
+            "v25": UC_ARM64_REG_V25,
+            "v26": UC_ARM64_REG_V26,
+            "v27": UC_ARM64_REG_V27,
+            "v28": UC_ARM64_REG_V28,
+            "v29": UC_ARM64_REG_V29,
+            "v30": UC_ARM64_REG_V30,
+            "v31": UC_ARM64_REG_V31,
+        }
+        return d.get(reg, None)
+
+    @staticmethod
+    def assemble(code, txt=None):
+        code = '\n'.join(map(lambda l: l.to_string(), code))
+        p = subprocess.run(["kstool", "-b", "arm64"], input=code.encode(), capture_output=True)
+        return p.stdout
+
+    @staticmethod
+    def check(result, address_gprs, log):
+        """Run an equivalence test on an instance of the Result class
+
+        Arguments:
+         - result: Instance of Result class representing the SLOTHY run
+                   to be checked. This includes both the optimized and the
+                   original source code.
+         - address_gprs: GPR-indexed dictionary indicating which registers
+                   contain pointers to memory. The value indicates the size
+                   of the memory. For example, { "x0" : 1024 } would lead to
+                   X0 being set to the base of a 1024 byte buffer."""
+
+        log.info("Running AArch64 selftest...")
+
+        # This produces _unrolled_ code, the same that is checked in the selfcheck.
+        # The selftest should instead use the rolled form of the loop.
+        iterations = 7
+        if result.config.sw_pipelining.enabled is True:
+            old_source, new_source = result.get_fully_unrolled_loop(iterations)
+        else:
+            old_source, new_source = result.orig_code, result.code
+
+        CODE_BASE = 0x010000
+        CODE_SZ =   0x010000
+        RAM_BASE =  0x020000
+        RAM_SZ =    0x010000
+        initial_memory = os.urandom(RAM_SZ)
+
+        cur_ram = RAM_BASE
+        # Set initial register contents arbitrarily, except for registers
+        # which must hold valid memory addresses.
+        initial_register_contents = {}
+        for r in RegisterType.list_registers(RegisterType.GPR):
+            initial_register_contents[r] = int.from_bytes(os.urandom(8))
+        for v in RegisterType.list_registers(RegisterType.NEON):
+            initial_register_contents[v] = int.from_bytes(os.urandom(16))
+        for (reg, sz) in address_gprs.items():
+            initial_register_contents[reg] = cur_ram
+            cur_ram += sz
+        def run_code(code, txt=None):
+            objcode = Checker.assemble(code, txt=txt)
+
+            # Setup emulator
+            mu = Uc(UC_ARCH_ARM64, UC_MODE_ARM)
+            # Copy initial register contents into emulator
+            for r,v in initial_register_contents.items():
+                ur = Checker.unicorn_reg_by_name(r)
+                if ur is None:
+                    continue
+                mu.reg_write(ur, v)
+            # Copy code into emulator
+            mu.mem_map(CODE_BASE, CODE_SZ)
+            mu.mem_write(CODE_BASE, objcode)
+            # Copy initial memory contents into emulator
+            mu.mem_map(RAM_BASE, RAM_SZ)
+            mu.mem_write(RAM_BASE, initial_memory)
+            # Run emulator
+            mu.emu_start(CODE_BASE, CODE_BASE + len(objcode))
+
+            final_register_contents = {}
+            for r in RegisterType.list_registers(RegisterType.GPR) + \
+                     RegisterType.list_registers(RegisterType.NEON):
+                r_unicorn = Checker.unicorn_reg_by_name(r)
+                if r_unicorn is None:
+                    continue
+                final_register_contents[r] = mu.reg_read(r_unicorn)
+            final_memory_contents = mu.mem_read(RAM_BASE, RAM_SZ)
+
+            return final_register_contents, final_memory_contents
+
+        final_regs_old, final_mem_old = run_code(old_source, txt="old")
+        final_regs_new, final_mem_new = run_code(new_source, txt="new")
+
+        # Check if memory contents are the same
+        if final_mem_old != final_mem_new:
+            raise AArch64CheckerFailure(f"AArch64 checker failed: Memory mismatch")
+
+        # Check if register contents are the same
+        regs_expected = set(result.config.outputs).union(result.config.reserved_regs)
+        # Ignore hint registers, flags and sp for now
+        regs_expected = set(filter(lambda t: t.startswith("t") is False and
+                                             t != "sp" and t != "flags", regs_expected))
+        for r in regs_expected:
+            if final_regs_old[r] != final_regs_new[r]:
+                raise AArch64CheckerFailure(f"AArch64 checker failed: Register mismatch for {r}: {hex(final_regs_old[r])} != {hex(final_regs_new[r])}")
+
+        log.info("Selftest: OK")
+
 class RegisterType(Enum):
     GPR = 1
     NEON = 2

diff --git a/tests/aarch64/checker/Makefile b/tests/aarch64/checker/Makefile
@@ -0,0 +1,5 @@
+checker: kernel0.i kernel1.i harness.S checker.c harness_config.i
+	gcc harness.S checker.c -o $@
+
+run: checker
+	./checker
diff --git a/tests/aarch64/checker/README.md b/tests/aarch64/checker/README.md
@@ -0,0 +1,33 @@
+# AArch64 equivalence tester
+
+This directory contains an empirical equivalence tester for AArch64 assembly snippets.
+
+## Usage
+
+So far, the checker can only be used on AArch64 hosts; in principle, however, you can cross-compile and emulate it to
+run on x86_64 system too.
+
+* Copy the kernels to compare into `kernel0.i` and `kernel1.i`.
+* Configure the checker (see below)
+* Build the checker: `make checker`
+* Run the checker: `./checker`
+
+The checker will repeatedly run the assembly snippets with randomly chosen register inputs (GPR and FPR) and compare the final
+register states. The output is the set of registers that have been found to always agree across all tests.
+
+The checker also supports kernels accessing memory buffers, provided the base of the buffers is passed through GPRs. In
+this case, the checker will randomly generate initial memory contents, and compare the final memory contents. When
+memory contents differ, the checker fails.
+
+### Configuration
+
+You can configure the checker by adjusting `harness_config.i`:
+
+* With `#define KERNEL_STACK_SIZE XXX` you can indicate that the kernels assume memory of size `XXX` starting from
+  `sp`. Note that you do not need to specify anything for the kernel(s) to use additional stack below `sp`. However,
+  it is assumed that the kernels restore the original value of `sp` (otherwise, the checker will likely segfault).
+* With `#define FPR_FIX_VALUE_X{0-30} VALUE` you can omit random assignment to a GPR and instead force a given
+  value. This is intended to force loop counters to sensible values.
+* With `#define KERNEL_MEMORY_IN_X{0-30} XXX` you can indicate that the kernels assume that a GPR is assumed to hold the
+  base address of a buffer of size `XXX`. The checker will then allocate and randomly initialize such buffer, and set
+  the GPR to point to ti.