diff --git a/README.md b/README.md
index c30d7f84..325adea9 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,7 @@
+<p align="center">
+    <image src="./docs/slothy_logo.png" width=160>
+</p>
+
 **SLOTHY** - **S**uper (**L**azy) **O**ptimization of **T**ricky **H**andwritten assembl**Y** - is an assembly-level superoptimizer
 for:
 1. Instruction scheduling
@@ -6,7 +10,7 @@ for:
 
 SLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for the
 the Cortex-M55 and Cortex-M85 CPUs implementing Armv8.1-M + Helium, and the Cortex-A55 and Cortex-A72
-CPUs implementing Armv8-A + Neon. There is an experimental model for Cortex-X/Neoverse-V cores. 
+CPUs implementing Armv8-A + Neon. There is an experimental model for Cortex-X/Neoverse-V cores.
 
 SLOTHY is discussed in [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303).
 
@@ -16,10 +20,10 @@ SLOTHY enables a development workflow where developers write 'clean' assembly by
 
 ### How it works
 
-SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and 
+SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and
 builds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop
-interleavings. The model is passed to an external constraint solver and, upon success, 
-a satisfying assignment converted back into the final code. Currently, SLOTHY uses 
+interleavings. The model is passed to an external constraint solver and, upon success,
+a satisfying assignment converted back into the final code. Currently, SLOTHY uses
 [Google OR-Tools](https://developers.google.com/optimization) as its constraint solver backend.
 
 ### Performance
@@ -51,9 +55,11 @@ and build from scratch, e.g. as follows (also available as [submodules/setup-ort
 for convenience):
 
 ```
+% apt install -y git build-essential python3-pip cmake swig
 % git submodule init
 % git submodule update
 % cd submodules/or-tools
+% git apply ../0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
 % mkdir build
 % cmake -S. -Bbuild -DBUILD_PYTHON:BOOL=ON
 % make -C build -j8
@@ -270,4 +276,4 @@ The [examples](examples/naive) directory contains numerous exemplary assembly sn
 `python3 example.py --examples={YOUR_EXAMPLE}`. See `python3 examples.py --help` for the list of all available examples.
 
 The use of SLOTHY from the command line is illustrated in [scripts/](scripts/) supporting the real-world optimizations
-for the NTT, FFT and X25519 discussed in [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303).
\ No newline at end of file
+for the NTT, FFT and X25519 discussed in [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303).
diff --git a/docs/faq.md b/docs/faq.md
new file mode 100644
index 00000000..d8bc21f9
--- /dev/null
+++ b/docs/faq.md
@@ -0,0 +1,49 @@
+---
+layout: default
+---
+
+## Frequently asked questions
+
+[back](index.md)
+
+#### Is SLOTHY a peephole optimizer?
+
+No. SLOTHY is a _fixed-instruction_ super-optimizer: It keeps instructions and optimizes
+register allocation, instruction scheduling, and software pipelining. It is the developer's or another tool's
+responsibility to map the workload at hand to the target architecture.
+
+<!-- #### When should I use SLOTHY?
+
+You may want to use SLOTHY on performance-critical workloads for which precise control over instruction-selection
+is beneficial (e.g. because other code-generation techniques do not find ideal instruction sequences) or needed
+(e.g. because some instructions or instruction patterns have to be avoided for security). -->
+
+#### Is SLOTHY better than {name your favourite superoptimizer}?
+
+Most likely, they serve different purposes. SLOTHY aims to do one thing well: Optimization _after_ instruction selection.
+It is thus independent of and potentially combinable with superoptimizers operating at earlier stages of the code-generation process, such as [souper](https://github.com/google/souper) and [CryptOpt](https://github.com/0xADE1A1DE/CryptOpt).
+
+#### Does SLOTHY support x86?
+
+The core of SLOTHY is architecture- and microarchitecture-agnostic and can accommodate x86. As it stands, however,
+there is no model of the x86 architecture. Feel free to build one!
+
+#### Does SLOTHY support RISC-V?
+
+As for x86.
+
+#### Is SLOTHY formally verified?
+
+No. Arguably, that wouldn't be a good use of time. The more relevant question is the following:
+
+#### Is SLOTHY-generated code formally verified to be equivalent to the input code?
+
+Not yet. SLOTHY runs a self-check confirming that input and output have isomorphic data flow graphs,
+but pitfalls remain, such as bad user configurations allowing SLOTHY to clobber a register that's not
+meant to be reserved. More work is needed for formal verification of the equivalence of input
+and output.
+
+#### Why is my question not here?
+
+Ping us! ([GitHub](https://github.com/slothy-optimizer/slothy/issues), or see [paper](https://eprint.iacr.org/2022/1303.pdf) for
+contact information).
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
index 2d7d04a3..7c3d8df5 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -12,12 +12,14 @@ super-optimizes:
 `SLOTHY` enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the
 computation, while `SLOTHY` automates microarchitecture-specific micro-optimizations. Since `SLOTHY` does not change
 instructions, and scheduling/allocation optimizations are tightly controlled through configurable and extensible
-constraints, the developer keeps close control over the final assembly, while being freed from the most tedious and
-readability- and verifiability-impeding micro-optimizations.
+constraints, the developer keeps close control over the final assembly, while being freed from tedious
+micro-optimizations.
+
+See also [FAQ](faq.md)
 
 #### Architecture/Microarchitecture support
 
-`SLOTHY` is generic in the target architecture and microarchitecture. So far, it supports Cortex-M55 and Cortex-M85
+`SLOTHY` is generic in the target architecture and microarchitecture. It currently supports Cortex-M55 and Cortex-M85
 implementing Armv8.1-M + Helium, and Cortex-A55 and Cortex-A72 implementing
 Armv8-A + Neon. Moreover, there is an experimental model for Cortex-X/Neoverse-V cores.
 
diff --git a/docs/slothy_logo.png b/docs/slothy_logo.png
index 27f8bb74..00e3102e 100644
Binary files a/docs/slothy_logo.png and b/docs/slothy_logo.png differ
diff --git a/example.py b/example.py
index 1feb4800..3f48cd6e 100644
--- a/example.py
+++ b/example.py
@@ -25,31 +25,35 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
-import argparse, logging, sys
-from io import StringIO
+import argparse
+import logging
+import sys
 
-from slothy.slothy import Slothy
-from slothy.core import Config
+from slothy import Slothy, Config
 
-import targets.arm_v81m.arch_v81m as Arch_Armv81M
-import targets.arm_v81m.cortex_m55r1 as Target_CortexM55r1
-import targets.arm_v81m.cortex_m85r1 as Target_CortexM85r1
+import slothy.targets.arm_v81m.arch_v81m as Arch_Armv81M
+import slothy.targets.arm_v81m.cortex_m55r1 as Target_CortexM55r1
+import slothy.targets.arm_v81m.cortex_m85r1 as Target_CortexM85r1
 
-import targets.aarch64.aarch64_neon as AArch64_Neon
-import targets.aarch64.cortex_a55 as Target_CortexA55
-import targets.aarch64.cortex_a72_frontend as Target_CortexA72
+import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
+import slothy.targets.aarch64.cortex_a55 as Target_CortexA55
+import slothy.targets.aarch64.cortex_a72_frontend as Target_CortexA72
 
 target_label_dict = {Target_CortexA55: "a55",
                      Target_CortexA72: "a72",
                      Target_CortexM55r1: "m55",
                      Target_CortexM85r1: "m85"}
 
+class ExampleException(Exception):
+    """Exception thrown when an example goes wrong"""
 
 class Example():
+    """Common boilerplate for SLOTHY examples"""
+
     def __init__(self, infile, name=None, funcname=None, suffix="opt",
                  rename=False, outfile="", arch=Arch_Armv81M, target=Target_CortexM55r1,
                  **kwargs):
-        if name == None:
+        if name is None:
             name = infile
 
         self.arch = arch
@@ -61,7 +65,7 @@ def __init__(self, infile, name=None, funcname=None, suffix="opt",
             self.outfile = f"{infile}_{self.suffix}_{target_label_dict[self.target]}"
         else:
             self.outfile = f"{outfile}_{self.suffix}_{target_label_dict[self.target]}"
-        if funcname == None:
+        if funcname is None:
             self.funcname = self.infile
         subfolder = ""
         if self.arch == AArch64_Neon:
@@ -1127,8 +1131,8 @@ def run_example(name, debug=False):
             if e.name == name:
                 ex = e
                 break
-        if ex == None:
-            raise Exception(f"Could not find example {name}")
+        if ex is None:
+            raise ExampleException(f"Could not find example {name}")
         ex.run(debug=debug)
 
     for e in todo:
diff --git a/examples/misc/gen_roots.py b/examples/misc/gen_roots.py
index eb41a87b..ebf63201 100644
--- a/examples/misc/gen_roots.py
+++ b/examples/misc/gen_roots.py
@@ -21,28 +21,37 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.
 
-import math, sys
+"""Helper script for the generation of twiddle factors for various NTTs"""
+
+import math
+
+class NttRootGenInvalidParameters(Exception):
+    """Invalid parameters for NTT root generation"""
 
 class NttRootGen():
+    """Helper class for the generation of NTT twiddle factors"""
 
     def __init__(self,*,
                  size,
                  modulus,
                  root,
                  layers,
-                 print_label=False,
-                 pad = [],
-                 bitsize     = 16,
-                 inverse     = False,
+                 print_label = False,
+                 pad = None,
+                 bitsize = 16,
+                 inverse = False,
                  vector_length = 128,
-                 word_offset_mod_4=None,
+                 word_offset_mod_4 = None,
                  incomplete_root = True,
-                 widen_single_twiddles_to_words=True,
-                 block_strided_twiddles=True,
-                 negacyclic  = True,
+                 widen_single_twiddles_to_words = True,
+                 block_strided_twiddles = True,
+                 negacyclic = True,
                  iters = None):
+        if pad is None:
+            pad = []
 
-        assert bitsize in [16,32]
+        if bitsize not in [16, 32]:
+            raise NttRootGenInvalidParameters("Invalid bit width")
 
         self.pad = pad
         self.print_label=print_label
@@ -76,7 +85,7 @@ def __init__(self,*,
 
         # Need an odd prime modulus
         if self.modulus % 2 == 0:
-            raise Exception("Modulus must be odd")
+            raise NttRootGenInvalidParameters("Modulus must be odd")
         self._inv_mod = pow(self.modulus,-1,2**self.bitsize)
 
         # Check that we've indeed been given a root of unity of the correct order
@@ -87,12 +96,12 @@ def __init__(self,*,
 
         self.log2size = int(math.log(size,2))
         if size != pow(2,self.log2size):
-            raise Exception(f"Size {size} not a power of 2")
+            raise NttRootGenInvalidParameters(f"Size {size} not a power of 2")
 
         self.layers = layers
         self.incompleteness_factor = 2**(self.log2size - self.layers)
 
-        if iters == None:
+        if iters is None:
             if self.layers % 2 == 0:
                 self.iters = [(x,2) for x in range(0,self.layers,2)]
             else:
@@ -106,18 +115,22 @@ def __init__(self,*,
 
         if ( pow(root, real_root_order,      modulus) != 1 or
              pow(root, real_root_order // 2, modulus) == 1 ):
-            raise Exception(f"{root} is not a primitive {real_root_order}-th root of unity modulo {modulus}")
+            raise NttRootGenInvalidParameters(f"{root} is not a primitive {real_root_order}-th "
+                                              f"root of unity modulo {modulus}")
 
         self.radixes = [2] * self.log2size
 
     def get_root_pow(self, exp):
+        """Returns specific power of base root of unity"""
+
         if not exp % self.incompleteness_factor == 0:
-            raise Exception(f"Invalid exponent {exp} for incompleteness factor {self.incompleteness_factor}")
+            raise NttRootGenInvalidParameters(f"Invalid exponent {exp} for incompleteness "
+                                              f"factor {self.incompleteness_factor}")
         if self.incomplete_root:
             exp = exp // self.incompleteness_factor
         return pow(self.root,exp,self.modulus)
 
-    def _prepare_root(self,root,layer=None):
+    def _prepare_root(self,root):
 
         # Force _signed_ representation of root?
         if root > self.modulus // 2:
@@ -143,6 +156,8 @@ def _bitrev_list(self,num,radix_list):
         return result
 
     def root_of_unity_for_block(self,layer,block):
+        """Returns the twiddle factor to be used for a specific layer and block"""
+
         actual_layer = layer
         if self.negacyclic:
             block += pow(2,layer)
@@ -157,10 +172,14 @@ def root_of_unity_for_block(self,layer,block):
         if self.inverse:
             log = (self.root_order - log) % self.root_order
         root = self.get_root_pow(log)
-        root, root_twisted = self._prepare_root(root,layer)
+        root, root_twisted = self._prepare_root(root)
         return root, root_twisted
 
-    def roots_of_unity_for_layer_core(self, layer, merged):
+    def _roots_of_unity_for_layer_core(self, layer, merged):
+
+        if not merged in [1,2,3,4]:
+            raise NttRootGenInvalidParameters("Invalid layer merge")
+
         for cur_block in range(0,2**layer):
             if merged == 1:
                 root, root_twisted = self.root_of_unity_for_block(layer, cur_block)
@@ -204,11 +223,13 @@ def roots_of_unity_for_layer_core(self, layer, merged):
 
                 if layer in self.pad:
                     yield ([root0, root1, root2, root3, root4, root5, root6, 0],
-                           [root0_tw, root1_tw, root2_tw, root3_tw, root4_tw, root5_tw, root6_tw, 0])
+                           [root0_tw, root1_tw, root2_tw, root3_tw, root4_tw,
+                            root5_tw, root6_tw, 0])
                 else:
                     yield ([root0, root1, root2, root3, root4, root5, root6],
                            [root0_tw, root1_tw, root2_tw, root3_tw, root4_tw, root5_tw, root6_tw])
-            elif merged == 4:
+            else:
+                assert merged == 4
                 # Compute the roots of unity that we need at this stage
                 fst_layer = layer + 0
                 snd_layer = layer + 1
@@ -261,10 +282,10 @@ def roots_of_unity_for_layer_core(self, layer, merged):
                            root5_tw, root6_tw, root7_tw, root8_tw, root9_tw,
                            root10_tw, root11_tw, root12_tw, root13_tw,
                            root14_tw])
-            else:
-                raise Exception("Something went wrong")
 
     def roots_of_unity_for_layer(self, layer, merged):
+        """Generator yielding the twiddle factors for a number of merged layers"""
+
         num_blocks = 2 ** layer
         block_size = self.size // num_blocks
         butterfly_size = block_size // 2 ** merged
@@ -273,7 +294,7 @@ def roots_of_unity_for_layer(self, layer, merged):
         if butterfly_size < self.vector_length // self.bitsize:
             stride = (self.vector_length // self.bitsize) // butterfly_size
 
-        all_root_pairs    = list(self.roots_of_unity_for_layer_core(layer, merged))
+        all_root_pairs    = list(self._roots_of_unity_for_layer_core(layer, merged))
         all_roots         = [ x[0] for x in all_root_pairs ]
         all_roots_twisted = [ x[1] for x in all_root_pairs ]
         num_pairs = len(all_root_pairs)
@@ -288,7 +309,8 @@ def roots_of_unity_for_layer(self, layer, merged):
             res = [(z,stride) for x in roots for y in x for z in y]
             yield from res
 
-    def get_roots_of_unity_core(self):
+    def get_roots_of_unity(self):
+        """Yields roots of unity for NTT"""
         iters = self.iters.copy()
         if self.inverse:
             iters.reverse()
@@ -297,17 +319,17 @@ def get_roots_of_unity_core(self):
                 yield f"roots_l{''.join([str(i) for i in range(cur_iter,cur_iter+merged)])}:"
             yield from self.roots_of_unity_for_layer(cur_iter,merged)
 
-    def get_roots_of_unity_real(self):
+    def _get_roots_of_unity_asm(self):
+        """Yields roots of unity """
         if self.bitsize == 16:
             twiddlesize = "short"
-        elif self.bitsize == 32:
-            twiddlesize = "word"
         else:
-            raise Exception("Should not happen")
+            assert self.bitsize == 32
+            twiddlesize = "word"
 
         count = 0
         last_stride = None
-        for x in self.get_roots_of_unity_core():
+        for x in self.get_roots_of_unity():
             if isinstance(x,str):
                 yield x
                 continue
@@ -320,7 +342,7 @@ def get_roots_of_unity_real(self):
                 count += 1
             if stride > 1:
                 if last_stride == 1:
-                    if self.word_offset_mod_4 != None:
+                    if self.word_offset_mod_4 is not None:
                         yield f"// Word count until here: {count}"
                         cc4 = count % 4
                         diff = self.word_offset_mod_4 - cc4
@@ -338,10 +360,13 @@ def get_roots_of_unity_real(self):
             last_stride = stride
 
     def export(self, filename):
-        license = """
+        """Export twiddle factors as file"""
+
+        license_text = """
 ///
 /// Copyright (c) 2022 Arm Limited
 /// Copyright (c) 2022 Hanno Becker
+/// Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer
 /// SPDX-License-Identifier: MIT
 ///
 /// Permission is hereby granted, free of charge, to any person obtaining a copy
@@ -364,18 +389,19 @@ def export(self, filename):
 ///
 
 """
-        f = open(filename,"w")
-        f.write(license)
-        f.write('\n'.join(self.get_roots_of_unity_real()))
-        f.close()
+        with open(filename, "w", encoding="utf-8") as f:
+            f.write(license_text)
+            f.write('\n'.join(self._get_roots_of_unity_asm()))
 
-def main():
+def _main():
 
-    ntt_kyber_l345 = NttRootGen(size=256,modulus=3329,root=17,layers=7,iters=[(0,2),(2,3),(5,2)], word_offset_mod_4=2)
+    ntt_kyber_l345 = NttRootGen(size=256,modulus=3329,root=17,layers=7,iters=[(0,2),(2,3),(5,2)],
+                                word_offset_mod_4=2)
     ntt_kyber_l345.export("../naive/ntt_kyber_12_345_67_twiddles.s")
     ntt_kyber_l345.export("../opt/ntt_kyber_12_345_67_twiddles.s")
 
-    ntt_kyber_l123 = NttRootGen(size=256,modulus=3329,root=17,layers=7,iters=[(0,3),(3,2),(5,2)], pad=[0,3], print_label=True, widen_single_twiddles_to_words=False)
+    ntt_kyber_l123 = NttRootGen(size=256,modulus=3329,root=17,layers=7,iters=[(0,3),(3,2),(5,2)], 
+                                pad=[0,3], print_label=True, widen_single_twiddles_to_words=False)
     ntt_kyber_l123.export("../naive/ntt_kyber_123_45_67_twiddles.s")
     ntt_kyber_l123.export("../opt/ntt_kyber_123_45_67_twiddles.s")
 
@@ -387,11 +413,13 @@ def main():
     intt_kyber.export("../naive/intt_kyber_1_23_45_67_twiddles.s")
     intt_kyber.export("../opt/intt_kyber_1_23_45_67_twiddles.s")
 
-    ntt_dilithium = NttRootGen(size=256,bitsize=32,modulus=8380417,root=1753,layers=8, word_offset_mod_4=2)
+    ntt_dilithium = NttRootGen(size=256,bitsize=32,modulus=8380417,root=1753,layers=8,
+                               word_offset_mod_4=2)
     ntt_dilithium.export("../naive/ntt_dilithium_12_34_56_78_twiddles.s")
     ntt_dilithium.export("../opt/ntt_dilithium_12_34_56_78_twiddles.s")
 
-    ntt_dilithium_l1234 = NttRootGen(size=256, bitsize=32, modulus=8380417, root=1753, layers=8, iters=[(0,4),(4,2),(6,2)], pad=[0], print_label=True)
+    ntt_dilithium_l1234 = NttRootGen(size=256, bitsize=32, modulus=8380417, root=1753,
+                                     layers=8, iters=[(0,4),(4,2),(6,2)], pad=[0], print_label=True)
     ntt_dilithium_l1234.export("../naive/aarch64/ntt_dilithium_1234_5678_twiddles.s")
     ntt_dilithium_l1234.export("../opt/aarch64/ntt_dilithium_1234_5678_twiddles.s")
 
@@ -400,8 +428,8 @@ def main():
     ntt_dilithium_l123.export("../naive/ntt_dilithium_123_456_78_twiddles.s")
     ntt_dilithium_l123.export("../opt/ntt_dilithium_123_456_78_twiddles.s")
 
-    ntt_dilithium_l123 = NttRootGen(size=256,bitsize=32,modulus=8380417,root=1753,layers=8, print_label=True, pad=[0,3],
-                               iters=[(0,3),(3,3),(6,2)])
+    ntt_dilithium_l123 = NttRootGen(size=256,bitsize=32,modulus=8380417,root=1753,layers=8,
+                                    print_label=True, pad=[0,3], iters=[(0,3),(3,3),(6,2)])
     ntt_dilithium_l123.export("../naive/aarch64/ntt_dilithium_123_456_78_twiddles.s")
     ntt_dilithium_l123.export("../opt/aarch64/ntt_dilithium_123_456_78_twiddles.s")
 
@@ -448,4 +476,4 @@ def main():
     intt_n256_s32_l8_test.export("../opt/intt_n256_l8_s32_twiddles.s")
 
 if __name__ == "__main__":
-   main()
+    _main()
diff --git a/paper/artifact/0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch b/paper/artifact/0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
new file mode 100644
index 00000000..526860d5
--- /dev/null
+++ b/paper/artifact/0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
@@ -0,0 +1,25 @@
+From 3b6f6999c042322268eb3ba84e829097014b7428 Mon Sep 17 00:00:00 2001
+From: Hanno Becker <beckphan@amazon.co.uk>
+Date: Tue, 19 Dec 2023 21:24:57 +0000
+Subject: [PATCH] Pin pybind11_protobuf commit in cmake files
+
+---
+ cmake/dependencies/CMakeLists.txt | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/cmake/dependencies/CMakeLists.txt b/cmake/dependencies/CMakeLists.txt
+index c39a44fb89..27923ccedb 100644
+--- a/cmake/dependencies/CMakeLists.txt
++++ b/cmake/dependencies/CMakeLists.txt
+@@ -177,7 +177,7 @@ if(BUILD_PYTHON AND BUILD_pybind11_protobuf)
+   FetchContent_Declare(
+     pybind11_protobuf
+     GIT_REPOSITORY "https://github.com/pybind/pybind11_protobuf.git"
+-    GIT_TAG "main"
++    GIT_TAG "5baa2dc9d93e3b608cde86dfa4b8c63aeab4ac78"
+     PATCH_COMMAND git apply --ignore-whitespace "${CMAKE_CURRENT_LIST_DIR}/../../patches/pybind11_protobuf.patch"
+   )
+   FetchContent_MakeAvailable(pybind11_protobuf)
+-- 
+2.39.3 (Apple Git-145)
+
diff --git a/paper/artifact/slothy.Dockerfile b/paper/artifact/slothy.Dockerfile
index bc93a1ff..ae50ccc1 100644
--- a/paper/artifact/slothy.Dockerfile
+++ b/paper/artifact/slothy.Dockerfile
@@ -33,6 +33,8 @@ RUN unzip or-tools.zip
 RUN rm or-tools.zip
 RUN mv or-tools-9.7 or-tools
 WORKDIR /home/ubuntu/slothy/submodules/or-tools
+COPY 0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch .
+RUN git apply 0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
 RUN mkdir /home/ubuntu/slothy/submodules/or-tools/build
 RUN cmake -S. -Bbuild -DBUILD_PYTHON:BOOL=ON -DBUILD_SAMPLES:BOOL=OFF -DBUILD_EXAMPLES:BOOL=OFF
 WORKDIR /home/ubuntu/slothy/submodules/or-tools/build
@@ -45,4 +47,4 @@ RUN ln -s /home/ubuntu/slothy /home/ubuntu/pqax/slothy
 RUN rm -rf /home/ubuntu/pqmx/slothy
 RUN ln -s /home/ubuntu/slothy /home/ubuntu/pqmx/slothy
 WORKDIR /home/ubuntu
-RUN ln -s /home/ubuntu/slothy/paper/README.md /home/ubuntu/README.md
\ No newline at end of file
+RUN ln -s /home/ubuntu/slothy/paper/README.md /home/ubuntu/README.md
diff --git a/paper/clean/neon/X25519-AArch64-simple.s b/paper/clean/neon/X25519-AArch64-simple.s
index d7e9d6b8..a43f2841 100644
--- a/paper/clean/neon/X25519-AArch64-simple.s
+++ b/paper/clean/neon/X25519-AArch64-simple.s
@@ -113,7 +113,7 @@
 .endm
 
 # TODO: also unwrap
-.macro fcsel_dform out, in0, in1, cond // slothy:no-unfold
+.macro fcsel_dform out, in0, in1, cond // @slothy:no-unfold
   fcsel dform_\out, dform_\in0, dform_\in1, \cond
 .endm
 
@@ -416,10 +416,10 @@ sZ48 .req x22
     stack_vstr_dform \offset\()_32, \vA\()8
 .endm
 
-.macro vector_load_lane vA, offset, lane
     // TODO: eliminate this explicit register assignment by converting stack_vld2_lane to AArch64Instruction
     xvector_load_lane_tmp .req x26
 
+.macro vector_load_lane vA, offset, lane
     add xvector_load_lane_tmp, sp, #\offset\()_0
     stack_vld2_lane \vA\()0, \vA\()1, xvector_load_lane_tmp, \offset\()_0,  \lane, 8
     stack_vld2_lane \vA\()2, \vA\()3, xvector_load_lane_tmp, \offset\()_8,  \lane, 8
@@ -591,8 +591,6 @@ sZ48 .req x22
     scalar_decompress_inner \sA\()0, \sA\()1, \sA\()2, \sA\()3, \sA\()4, \sA\()5, \sA\()6, \sA\()7, \sA\()8, \sA\()9
 .endm
 
-.macro vector_addsub_repack_inner vA0, vA1, vA2, vA3, vA4, vA5, vA6, vA7, vA8, vA9, \
-                    vC0, vC1, vC2, vC3, vC4, vC5, vC6, vC7, vC8, vC9
     // TODO: eliminate those. should be easy
     vR_l4h4l5h5 .req vADBC4
     vR_l6h6l7h7 .req vADBC5
@@ -620,6 +618,8 @@ sZ48 .req x22
     vrepack_inner_tmp .req v19
     vrepack_inner_tmp2 .req v0
 
+.macro vector_addsub_repack_inner vA0, vA1, vA2, vA3, vA4, vA5, vA6, vA7, vA8, vA9, \
+                    vC0, vC1, vC2, vC3, vC4, vC5, vC6, vC7, vC8, vC9
     vuzp1    vR_l4h4l5h5, \vC4, \vC5
     vuzp1    vR_l6h6l7h7, \vC6, \vC7
     stack_vld1r vrepack_inner_tmp, STACK_MASK1
@@ -949,6 +949,8 @@ scalar_mul_inner \
         \sB\()0,  \sB\()1,  \sB\()2,  \sB\()3,  \sB\()4,  \sB\()5,  \sB\()6,  \sB\()7,  \sB\()8,  \sB\()9
 .endm
 
+xtmp_scalar_sub_0 .req x21
+  
 // sC0 .. sC4   output C = A +  4p - B  (registers may be the same as A)
 // sA0 .. sA4   first operand A
 // sB0 .. sB4   second operand B
@@ -957,8 +959,6 @@ scalar_mul_inner \
         sA0, sA1, sA2, sA3, sA4, \
         sB0, sB1, sB2, sB3, sB4
 
-  xtmp_scalar_sub_0 .req x21
-
   ldr    xtmp_scalar_sub_0, #=0x07fffffe07fffffc
   add    \sC1, \sA1, xtmp_scalar_sub_0
   add    \sC2, \sA2, xtmp_scalar_sub_0
diff --git a/paper/scripts/slothy_ntt_helium.py b/paper/scripts/slothy_ntt_helium.py
index 202054f6..14a34a0e 100644
--- a/paper/scripts/slothy_ntt_helium.py
+++ b/paper/scripts/slothy_ntt_helium.py
@@ -28,12 +28,11 @@
 import argparse, logging, sys, os, time
 from io import StringIO
 
-from slothy.slothy import Slothy
-from slothy.core import Config
+from slothy import Slothy, Config
 
-import targets.arm_v81m.arch_v81m as Arch_Armv81M
-import targets.arm_v81m.cortex_m55r1 as Target_CortexM55r1
-import targets.arm_v81m.cortex_m85r1 as Target_CortexM85r1
+import slothy.targets.arm_v81m.arch_v81m as Arch_Armv81M
+import slothy.targets.arm_v81m.cortex_m55r1 as Target_CortexM55r1
+import slothy.targets.arm_v81m.cortex_m85r1 as Target_CortexM85r1
 
 target_label_dict = {Target_CortexM55r1: "m55",
                      Target_CortexM85r1: "m85"}
@@ -42,7 +41,7 @@ class Example():
     def __init__(self, infile, name=None, funcname=None, suffix="opt",
                  rename=False, outfile="", arch=Arch_Armv81M, target=Target_CortexM55r1,
                  **kwargs):
-        if name == None:
+        if name is None:
             name = infile
 
         self.arch = arch
@@ -54,7 +53,7 @@ def __init__(self, infile, name=None, funcname=None, suffix="opt",
             self.outfile = f"{infile}_{self.suffix}_{target_label_dict[self.target]}"
         else:
             self.outfile = f"{outfile}_{self.suffix}_{target_label_dict[self.target]}"
-        if funcname == None:
+        if funcname is None:
             self.funcname = self.infile
         self.infile_full  = f"../clean/helium/ntt/{self.infile}.s"
         self.outfile_full = f"../opt/helium/ntt/{self.outfile}.s"
diff --git a/slothy-cli b/slothy-cli
index b6408c8f..ff8ea9e7 100755
--- a/slothy-cli
+++ b/slothy-cli
@@ -28,47 +28,52 @@ import logging
 import time
 import os
 
-from slothy.slothy import Slothy
-from slothy.config import Config as SlothyConfig
-from targets.query import Archery
+from slothy import Slothy, Archery
 
-def main(argv):
+class CmdLineException(Exception):
+    """Exception thrown when a problem is encountered with the command line parameters"""
+
+def _main():
 
     parser = argparse.ArgumentParser(
         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument("arch", type=str,
-                        choices=Archery.list_archs(), help="The target architecture")
-    parser.add_argument("target", type=str,
-                        choices=Archery.list_targets(), help="The target microarchitecture")
+    parser.add_argument("arch", type=str, choices=Archery.list_archs(),
+        help="The target architecture")
+    parser.add_argument("target", type=str, choices=Archery.list_targets(),
+        help="The target microarchitecture")
     parser.add_argument("input", type=str,
-                        help="The name of the assembly source file.")
+        help="The name of the assembly source file.")
     parser.add_argument("-d", "--debug", default=False, action='store_true',
-                        help="Show debug output")
+        help="Show debug output")
     parser.add_argument("-o", "--output", type=str, default=None,
-                        help="The name of the file to write the generated assembly to. "
-                        "If unspecified, the assembly will be printed on the standard output.")
-    parser.add_argument("-c", "--config", default=[], action="append", nargs='*', metavar="OPTION=VALUE",
-                        help="""A (potentially empty) list of modifications to the default configuration of Slothy.""")
+        help="The name of the file to write the generated assembly to. "
+            "If unspecified, the assembly will be printed on the standard output.")
+    parser.add_argument("-c", "--config", default=[], action="append", nargs='*',
+        metavar="OPTION=VALUE", help="""A (potentially empty) list of modifications "\
+            "to the default configuration of Slothy.""")
     parser.add_argument("-l", "--loop", default=[], action='append', type=str,
-                        help="""The starting label for the loop to optimize. This is mutually
-                        exclusive with -s/--start and -e/--end, which allowv you to specify
-                        the code to optimize via start/end separately.""")
+        help="""The starting label for the loop to optimize. This is mutually
+                exclusive with -s/--start and -e/--end, which allowv you to specify
+                the code to optimize via start/end separately.""")
+    parser.add_argument("--fusion", default=False, action='store_true')
+    parser.add_argument("--fusion-only", default=False, action='store_true')
     parser.add_argument("-s", "--start", default=None, type=str,
-                        help="""The label or line at which the to code to optimize begins.
-                        This is mutually exclusive with -l/--loop.""")
+        help="""The label or line at which the to code to optimize begins.
+                This is mutually exclusive with -l/--loop.""")
     parser.add_argument("-e", "--end", default=None, type=str,
-                        help="""The label or line at which the to code to optimize ends
-                        This is mutually exclusive with -l/--loop.""")
+        help="""The label or line at which the to code to optimize ends
+                This is mutually exclusive with -l/--loop.""")
     parser.add_argument("-r", "--rename-function", default=None, type=str,
-                        help="""Perform function renaming. Format: 'old_func_name,new_func_name'""")
+        help="""Perform function renaming. Format: 'old_func_name,new_func_name'""")
     parser.add_argument("--silent", default=False, action='store_true',
-                        help="""Silent mode: Only print warnings and errors""")
+        help="""Silent mode: Only print warnings and errors""")
     parser.add_argument("--log", default=False, action='store_true',
-                        help="""Write logging output to file""")
+        help="""Write logging output to file""")
     parser.add_argument("--logdir", default=".", type=str,
-                        help="""Directory to store log output to""")
+        help="""Directory to store log output to""")
     parser.add_argument("--logfile", default=None, type=str,
-                        help="""File to write logging output to. Can be omitted, in which case a generic name with timestamp is used""")
+        help="""File to write logging output to. Can be omitted, "\
+                "in which case a generic name with timestamp is used""")
 
     args = parser.parse_args()
 
@@ -116,51 +121,51 @@ def main(argv):
 
     logger = logging.getLogger("slothy-cli")
 
-    Arch   = Archery.get_arch(args.arch)
-    Target = Archery.get_target(args.target)
-    slothy = Slothy(Arch,Target,logger=logger)
+    arch   = Archery.get_arch(args.arch)
+    target = Archery.get_target(args.target)
+    slothy = Slothy(arch,target,logger=logger)
 
     def parse_config_value_as(val, ty):
         def parse_as_float(val):
             try:
                 res = float(val)
                 return res
-            except:
+            except ValueError:
                 return None
         def check_ty(ty_real):
-            if ty == None or ty == type(None) or ty == ty_real:
+            if ty is None or ty == type(None) or ty == ty_real:
                 return
-            raise Exception(f"Configuration value {val} isn't correctly typed -- " \
+            raise CmdLineException(f"Configuration value {val} isn't correctly typed -- " \
                             f"expected {ty}, but got {ty_real}")
         if val == "":
-            raise Exception("Invalid configuration value")
-        logger.debug(f"Parsing configuration value {val} with expected type {ty}")
+            raise CmdLineException("Invalid configuration value")
+        logger.debug("Parsing configuration value %s with expected type %s", val, ty)
         if val.isdigit():
             check_ty(int)
-            logger.debug(f"Value {val} parsed as integer")
+            logger.debug("Value %s parsed as integer", val)
             return int(val)
         if val.lower() == "true":
             check_ty(bool)
-            logger.debug(f"Value {val} parsed as Boolean")
+            logger.debug("Value %s parsed as Boolean", val)
             return True
         if val.lower() == "false":
             check_ty(bool)
-            logger.debug(f"Value {val} parsed as Boolean")
+            logger.debug("Value %s parsed as Boolean", val)
             return False
         # Try to parse as RegisterType
-        ty = Arch.RegisterType.from_string(val)
-        if ty != None:
-            logger.debug(f"Value {val} parsed as RegisterType")
+        ty = arch.RegisterType.from_string(val)
+        if ty is not None:
+            logger.debug("Value %s parsed as RegisterType", val)
             return ty
         f = parse_as_float(val)
-        if f != None:
+        if f is not None:
             check_ty(float)
-            logger.debug(f"Value {val} parsed as float")
+            logger.debug("Value %s parsed as float", val)
             return f
         if val[0] == '[' and val[-1] == ']':
             check_ty(list)
             val = val[1:-1].split(',')
-            logger.debug(f"Parsing {val} is a list -- parse recursively")
+            logger.debug("Parsing %s is a list -- parse recursively", val)
             return [ parse_config_value_as(v,None) for v in val ]
         if val[0] == '{' and val[-1] == '}':
             check_ty(dict)
@@ -168,11 +173,11 @@ def main(argv):
             kvs = [ kv.split(':')  for kv in kvs ]
             for kv in kvs:
                 if not len(kv) == 2:
-                    raise Exception("Invalid dictionary entry")
-            logger.debug(f"Parsing {val} is a dictionary -- parse recursively")
+                    raise CmdLineException("Invalid dictionary entry")
+            logger.debug("Parsing %s is a dictionary -- parse recursively", val)
             return { parse_config_value_as(k, None) : parse_config_value_as(v, None)
                      for k,v in kvs }
-        logger.debug(f"Parsing {val} as string")
+        logger.debug("Parsing %s as string", val)
         return val
 
     # A plain '-c' without arguments should list all available configuration options
@@ -196,14 +201,14 @@ def main(argv):
             obj = getattr(obj,attrs.pop(0))
         attr = attrs.pop(0)
         val = parse_config_value_as(val, type(getattr(obj,attr)))
-        logger.info(f"- Setting configuration option {attr} to value {val}")
+        logger.info("Setting configuration option %s to value %s", attr, val)
         setattr(obj,attr,val)
 
-    def check_list_of_fixed_len_list(lst, fixlen):
+    def check_list_of_fixed_len_list(lst):
         invalid = next(filter(lambda o: len(o) != 1, lst), None)
-        if invalid != None:
-            raise Exception(f"Invalid configuration argument {invalid} in {lst}")
-    check_list_of_fixed_len_list(args.config,1)
+        if invalid is not None:
+            raise CmdLineException(f"Invalid configuration argument {invalid} in {lst}")
+    check_list_of_fixed_len_list(args.config)
     config_kv_pairs = [ c[0].split('=') for c in args.config ]
     for kv in config_kv_pairs:
         # We allow shorthands for boolean configurations
@@ -222,17 +227,23 @@ def main(argv):
         elif len(kv) == 2:
             setattr_recursive(slothy.config, kv[0], kv[1])
         else:
-            raise Exception(f"Invalid configuration {kv}")
+            raise CmdLineException(f"Invalid configuration {kv}")
 
     # Read input
     slothy.load_source_from_file(args.input)
 
     # Optimize
-    if len(args.loop) > 0:
-        for l in args.loop:
-            slothy.optimize_loop(l)
-    else:
-        slothy.optimize(start=args.start, end=args.end)
+    if args.fusion is True:
+        if len(args.loop) > 0:
+            for l in args.loop:
+                slothy.fusion_loop(l)
+
+    if not (args.fusion is True and args.fusion_only is True):
+        if len(args.loop) > 0:
+            for l in args.loop:
+                slothy.optimize_loop(l)
+        else:
+            slothy.optimize(start=args.start, end=args.end)
 
     # Rename
     if args.rename_function:
@@ -245,7 +256,7 @@ def main(argv):
     if args.output is not None:
         slothy.write_source_to_file(args.output)
     else:
-        slothy.print_code()
+        print(slothy.get_source_as_string())
 
 if __name__ == "__main__":
-   main(sys.argv[1:])
+    _main()
diff --git a/slothy/__init__.py b/slothy/__init__.py
index e69de29b..51547d11 100644
--- a/slothy/__init__.py
+++ b/slothy/__init__.py
@@ -0,0 +1,4 @@
+from slothy.core.slothy import Slothy
+from slothy.core.core import SlothyException
+from slothy.core.config import Config
+from slothy.targets.query import Archery
diff --git a/targets/__init__.py b/slothy/core/__init__.py
similarity index 100%
rename from targets/__init__.py
rename to slothy/core/__init__.py
diff --git a/slothy/config.py b/slothy/core/config.py
similarity index 78%
rename from slothy/config.py
rename to slothy/core/config.py
index d7df3294..e0b32da3 100644
--- a/slothy/config.py
+++ b/slothy/core/config.py
@@ -25,6 +25,12 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
+"""
+SLOTHY configuration
+"""
+
+# pylint:disable=too-many-lines
+
 from copy import deepcopy
 import os
 
@@ -39,31 +45,6 @@ class Config(NestedPrint, LockAttributes):
     This configuration object is used both for one-shot optimizations using
     SlothyBase, as well as stateful multi-pass optimizations using Slothy."""
 
-    _default_split_heuristic = False
-    _default_split_heuristic_visualize_stalls = False
-    _default_split_heuristic_visualize_units = False
-    _default_split_heuristic_region = [0.0,1.0]
-    _default_split_heuristic_chunks = False
-    _default_split_heuristic_optimize_seam = 0
-    _default_split_heuristic_bottom_to_top = False
-    _default_split_heuristic_factor = 2
-    _default_split_heuristic_abort_cycle_at = None
-    _default_split_heuristic_stepsize = None
-    _default_split_heuristic_repeat = 1
-    _default_split_heuristic_preprocess_naive_interleaving = False
-    _default_split_heuristic_preprocess_naive_interleaving_by_latency = False
-
-    _default_compiler_binary = "gcc"
-
-    _default_unsafe_skip_address_fixup = False
-
-    _default_with_preprocessor = False
-    _default_max_solutions = 64
-    _default_timeout = None
-    _default_retry_timeout = None
-    _default_ignore_objective = False
-    _default_objective_precision = 0
-
     @property
     def arch(self):
         """The module defining the underlying architecture used by Slothy.
@@ -98,6 +79,97 @@ def reserved_regs(self):
             return self._reserved_regs
         return self._arch.RegisterType.default_reserved()
 
+    @property
+    def selfcheck(self):
+        """Indicates whether SLOTHY performs a self-check on the optimization result.
+        
+        The selfcheck confirms that the scheduling permutation found by SLOTHY yields
+        an isomorphism between the data flow graphs of the original and optimized code.
+
+        WARNING: Do not unset this option unless you know what you are doing.
+            It is vital in catching bugs in the model generation early.
+        
+        WARNING: The selfcheck is not a formal verification of SLOTHY's output!
+            There are at least two classes of bugs uncaught by the selfcheck:
+
+            - User configuration issues: The selfcheck validates SLOTHY's optimization
+              in the context of the provided configuration. Validation of the configuration
+              is the user's responsibility. Two common pitfalls include missing reserved
+              registers (allowing SLOTHY to clobber more registers than intended), or
+              missing output registers (allowing SLOTHY to overwrite an output register
+              in subsequent instructions).
+
+              This is the most common source of issues for code passing the selfcheck
+              but remaining functionally incorrect.
+
+            - Bugs in address offset fixup: SLOTHY's modelling of post-load/store address
+              increments is deliberately inaccurate to allow for reordering of such instructions
+              leveraging commutativity relations such as
+
+              ```
+              LDR X,[A],#imm;  STR Y,[A]    ===     STR Y,[A, #imm];  LDR X,[A],#imm
+              ```
+
+              (See also section "Address offset rewrites" in the SLOTHY paper).
+
+              Bugs in SLOTHY's address fixup logic would not be caught by the selfcheck.
+              If your code doesn't work and you are sure to have configured SLOTHY correctly,
+              you may therefore want to double-check that address offsets have been adjusted
+              correctly by SLOTHY.
+        """
+        return self._selfcheck
+
+    @property
+    def allow_useless_instructions(self):
+        """Indicates whether SLOTHY should abort upon encountering unused instructions.
+        
+        SLOTHY requires explicit knowledge of the intended output registers of its
+        input assembly. If this option is set, and an instruction is encountered which 
+        writes to a register which (a) is not an output register, (b) is not used by 
+        any later instruction, then SLOTHY will flag this instruction and abort.
+
+        The reason for this behaviour is that such unused instructions are usually
+        a sign of a buggy configuration, which would likely lead to intended output
+        registers being clobbered by later instructions.
+
+        WARNING: Don't disable this option unless you know what you are doing!
+            Disabling this option makes it much easier to overlook configuration
+            issues in SLOTHY and can lead to hard-to-debug optimization failures.
+        """
+        return self._allow_useless_instructions
+
+    @property
+    def variable_size(self):
+        """Model number of stalls as a parameter in the constraint model.
+        
+        If this is set, one-shot SLOTHY optimization will make the number of stalls
+        flexible in the model and, by default, task the underlying constraint solver
+        to minimize it.
+        
+        If this is not set, one-shot SLOTHY optimizations will search for solutions
+        with a fixed number of stalls, and an external binary search be used to
+        find the minimum number of stalls.
+        
+        For small-to-medium sizes assembly input, this option should be set, and will
+        lead to faster optimization. For large assembly input, the user should experiment
+        and consider unsetting it to reduce model complexity.
+        """
+        return self._variable_size
+
+    @property
+    def keep_tags(self):
+        """Indicates whether tags in the input source should be kept or removed.
+        
+        Tags include pre/core/post or ordering annotations that usually become meaningless
+        post-optimization. However, for preprocessing runs that do not reorder code, it makes
+        sense to keep them."""
+        return self._keep_tags
+
+    @property
+    def ignore_tags(self):
+        """Indicates whether tags in the input source should be ignored."""        
+        return self._ignore_tags
+
     @property
     def register_aliases(self):
         """Dictionary mapping symbolic register names to architectural register names.
@@ -110,6 +182,7 @@ def register_aliases(self):
         return { **self._register_aliases, **self._arch.RegisterType.default_aliases() }
 
     def add_aliases(self, new_aliases):
+        """Add further register aliases to the configuration"""
         self._register_aliases = { **self._register_aliases, **new_aliases }
 
     @property
@@ -222,7 +295,7 @@ def compiler_binary(self):
         """The compiler binary to be used.
 
         This is only relevant of `with_preprocessor` is set."""
-        return self._default_compiler_binary
+        return self._compiler_binary
 
     @property
     def timeout(self):
@@ -238,11 +311,31 @@ def retry_timeout(self):
         return self._retry_timeout
 
     @property
-    def unsafe_skip_address_fixup(self):
-        """Warn but not fail if post-optimization address fixup failed.
-
-        (See 4.13, Address offset rewrites, in https://eprint.iacr.org/2022/1303.pdf)"""
-        return self._unsafe_skip_address_fixup
+    def do_address_fixup(self):
+        """Indicates whether post-optimization address fixup should be conducted.
+        
+        SLOTHY's modelling of post-load/store address increments is deliberately
+        inaccurate to allow for reordering of such instructions leveraging commutativity 
+        relations such as:
+
+        ```
+        LDR X,[A],#imm;  STR Y,[A]    ===     STR Y,[A, #imm];  LDR X,[A],#imm
+        ```
+
+        When such reordering happens, a "post-optimization address fixup" of immediate
+        load/store offsets is necessary. See also section "Address offset rewrites" in 
+        the SLOTHY paper.
+
+        Disabling this option will skip post-optimization address fixup and put the
+        burden of post-optimization address fixup on the user. 
+        Disabling this option does NOT tighten the constraint model to forbid reorderings
+        such as the above.
+
+        WARNING: Don't disable this option unless you know what you are doing!
+            Disabling this will likely lead to optimized code that is functionally incorrect
+            and needing manual address offset fixup!
+        """
+        return self._do_address_fixup
 
     @property
     def ignore_objective(self):
@@ -321,6 +414,9 @@ def split_heuristic_stepsize(self):
 
     @property
     def split_heuristic_optimize_seam(self):
+        """If the split heuristic is used, the number of instructions above and beyond
+        the current sliding window that should be fixed but taken into account during
+        optimization."""
         if not self.split_heuristic:
             raise InvalidConfig("Did you forget to set config.split_heuristic=True? "\
                             "Shouldn't read config.split_heuristic_optimize_seam otherwise.")
@@ -337,27 +433,13 @@ def split_heuristic_chunks(self):
 
     @property
     def split_heuristic_bottom_to_top(self):
+        """If the split heuristic is used, move the sliding window from bottom to top
+        rather than from top to bottom."""
         if not self.split_heuristic:
             raise InvalidConfig("Did you forget to set config.split_heuristic=True? "\
                             "Shouldn't read config.split_heuristic_bottom_to_top otherwise.")
         return self._split_heuristic_bottom_to_top
 
-    @property
-    def split_heuristic_visualize_stalls(self):
-        """Attempt to visualize the stalls after application of the split heuristic"""
-        if not self.split_heuristic:
-            raise InvalidConfig("Did you forget to set config.split_heuristic=True? "\
-                            "Shouldn't read config.split_heuristic_visualize_stalls otherwise.")
-        return self._split_heuristic_visualize_stalls
-
-    @property
-    def split_heuristic_visualize_units(self):
-        """Attempt to visualize the functional units after application of the split heuristic"""
-        if not self.split_heuristic:
-            raise InvalidConfig("Did you forget to set config.split_heuristic=True? "\
-                            "Shouldn't read config.split_heuristic_visualize_units otherwise.")
-        return self._split_heuristic_visualize_units
-
     @property
     def split_heuristic_region(self):
         """Restrict the split heuristic to a sub-region of the code.
@@ -369,7 +451,7 @@ def split_heuristic_region(self):
         if the split region is set fo [0.25, 0.75] and the split factor is 5, then optimization
         windows of size .1 will be considered within [0.25, 0.75].
 
-        Note that even if this option is used, the specification of inputs and outputs is still 
+        Note that even if this option is used, the specification of inputs and outputs is still
         with respect to the entire code; SLOTHY will automatically derive the outputs of the
         subregion configured here."""
         if not self.split_heuristic:
@@ -388,21 +470,22 @@ def split_heuristic_preprocess_naive_interleaving(self):
         optimization."""
         if not self.split_heuristic:
             raise InvalidConfig("Did you forget to set config.split_heuristic=True? "\
-                            "Shouldn't read config.split_heuristic_preprocess_naive_interleaving otherwise.")
+                "Shouldn't read config.split_heuristic_preprocess_naive_interleaving otherwise.")
         return self._split_heuristic_preprocess_naive_interleaving
 
     @property
     def split_heuristic_preprocess_naive_interleaving_by_latency(self):
-        """If split heuristic with naive preprocessing is used, this option causes the naive interleaving
-        to be by latency-depth rather than latency."""
+        """If split heuristic with naive preprocessing is used, this option causes
+        the naive interleaving to be by latency-depth rather than latency."""
         if not self.split_heuristic:
-            raise InvalidConfig("Did you forget to set config.split_heuristic=True? "\
-                            "Shouldn't read config.split_heuristic_preprocess_naive_interleaving_by_latency otherwise.")
+            raise InvalidConfig("Did you forget to set config.split_heuristic=True? Shouldn't"    \
+                "read config.split_heuristic_preprocess_naive_interleaving_by_latency otherwise.")
         return self._split_heuristic_preprocess_naive_interleaving_by_latency
 
-    # TODO: Consider setting this to True unconditionally
     @property
     def flexible_lifetime_start(self):
+        """Internal property indicating whether the lifetime interval of a register
+        should be allowed to extend _before_ the instructions which uses it."""
         return \
             self.constraints.maximize_register_lifetimes or \
             (self.sw_pipelining.enabled and self.sw_pipelining.allow_post)
@@ -436,22 +519,6 @@ def copy(self):
     class SoftwarePipelining(NestedPrint, LockAttributes):
         """Subconfiguration for software pipelining"""
 
-        _default_enabled = False
-        _default_unroll = 1
-        _default_pre_before_post = False
-        _default_allow_pre = True
-        _default_allow_post = False
-        _default_unknown_iteration_count = False
-        _default_minimize_overlapping = True
-        _default_optimize_preamble = True
-        _default_optimize_postamble = True
-        _default_max_overlapping = None
-        _default_min_overlapping = None
-        _default_halving_heuristic = False
-        _default_halving_heuristic_periodic = False
-        _default_halving_heuristic_split_only = False
-        _default_max_pre = 1.0
-
         @property
         def enabled(self):
             """Determines whether software pipelining should be enabled."""
@@ -464,14 +531,14 @@ def unroll(self):
 
         @property
         def pre_before_post(self):
-            """If both early and late instructions are allowed, force late instructions of iteration N
-                to come _before_ early instructions of iteration N+2."""
+            """If both early and late instructions are allowed, force late instructions
+                of iteration N to come _before_ early instructions of iteration N+2."""
             return self._pre_before_post
 
         @property
         def allow_pre(self):
-            """Allow 'early' instructions, that is, instructions that are pulled forward from iteration N+1
-                to iteration N. A typical example would be an early load."""
+            """Allow 'early' instructions, that is, instructions that are pulled forward
+                from iteration N+1 to iteration N. A typical example would be an early load."""
             return self._allow_pre
 
         @property
@@ -553,36 +620,21 @@ def max_pre(self):
         def __init__(self):
             super().__init__()
 
-            self._enabled = \
-                Config.SoftwarePipelining._default_enabled
-            self._unroll = \
-                Config.SoftwarePipelining._default_unroll
-            self._pre_before_post = \
-                Config.SoftwarePipelining._default_pre_before_post
-            self._allow_pre = \
-                Config.SoftwarePipelining._default_allow_pre
-            self._allow_post = \
-                Config.SoftwarePipelining._default_allow_post
-            self._unknown_iteration_count = \
-                Config.SoftwarePipelining._default_unknown_iteration_count
-            self._minimize_overlapping = \
-                Config.SoftwarePipelining._default_minimize_overlapping
-            self._optimize_preamble = \
-                Config.SoftwarePipelining._default_optimize_preamble
-            self._optimize_postamble = \
-                Config.SoftwarePipelining._default_optimize_postamble
-            self._max_overlapping = \
-                Config.SoftwarePipelining._default_max_overlapping
-            self._min_overlapping = \
-                Config.SoftwarePipelining._default_min_overlapping
-            self._halving_heuristic = \
-                Config.SoftwarePipelining._default_halving_heuristic
-            self._halving_heuristic_periodic = \
-                Config.SoftwarePipelining._default_halving_heuristic_periodic
-            self._halving_heuristic_split_only = \
-                Config.SoftwarePipelining._default_halving_heuristic_split_only
-            self._max_pre = \
-                Config.SoftwarePipelining._default_max_pre
+            self.enabled = False
+            self.unroll = 1
+            self.pre_before_post = False
+            self.allow_pre = True
+            self.allow_post = False
+            self.unknown_iteration_count = False
+            self.minimize_overlapping = True
+            self.optimize_preamble = True
+            self.optimize_postamble = True
+            self.max_overlapping = None
+            self.min_overlapping = None
+            self.halving_heuristic = False
+            self.halving_heuristic_periodic = False
+            self.halving_heuristic_split_only = False
+            self.max_pre = 1.0
 
             self.lock()
 
@@ -635,19 +687,6 @@ def max_pre(self,val):
     class Constraints(NestedPrint, LockAttributes):
         """Subconfiguration for performance constraints"""
 
-        _default_stalls_allowed = 0
-        _default_stalls_maximum_attempt = 512
-        _default_stalls_minimum_attempt = 0
-        _default_stalls_precision = 0
-        _default_stalls_timeout_below_precision = None
-        _default_stalls_first_attempt = 0
-
-        _default_model_latencies = True
-        _default_model_functional_units = True
-        _default_allow_reordering = True
-        _default_allow_renaming = True
-        _default_restricted_renaming = None
-
         @property
         def stalls_allowed(self):
             """The number of stalls allowed. Internally, this is the number of NOP
@@ -697,7 +736,7 @@ def stalls_first_attempt(self):
         def stalls_precision(self):
             """The precision of the binary search for the minimum number of stalls
 
-                Slothy will stop searching if it can narrow down the minimum number
+                SLOTHY will stop searching if it can narrow down the minimum number
                 of stalls to an interval of the length provided by this variable.
                 In particular, a value of 1 means the true minimum if searched for."""
             if self.functional_only:
@@ -706,6 +745,9 @@ def stalls_precision(self):
 
         @property
         def stalls_timeout_below_precision(self):
+            """If this variable is set to a non-None value, SLOTHY does not abort
+            optimization once binary search is operating on an interval smaller than
+            the stall precision, but instead sets a different (typically smaller) timeout."""
             return self._stalls_timeout_below_precision
 
         @property
@@ -746,10 +788,6 @@ def allow_renaming(self):
             in order to find the number of model violations in a piece of code."""
             return self._allow_renaming
 
-        @property
-        def restricted_renaming(self):
-            return self._restricted_renaming
-
         def __init__(self):
             super().__init__()
 
@@ -767,18 +805,17 @@ def __init__(self):
             self.minimize_use_of_extra_registers = None
             self.allow_extra_registers = {}
 
-            self._model_latencies = Config.Constraints._default_model_latencies
-            self._model_functional_units = Config.Constraints._default_model_functional_units
-            self._allow_reordering = Config.Constraints._default_allow_reordering
-            self._allow_renaming = Config.Constraints._default_allow_renaming
-            self._restricted_renaming = Config.Constraints._default_restricted_renaming
+            self._stalls_allowed = 0
+            self._stalls_maximum_attempt = 512
+            self._stalls_minimum_attempt = 0
+            self._stalls_precision = 0
+            self._stalls_timeout_below_precision = None
+            self._stalls_first_attempt = 0
 
-            self._stalls_allowed = Config.Constraints._default_stalls_allowed
-            self._stalls_maximum_attempt = Config.Constraints._default_stalls_maximum_attempt
-            self._stalls_minimum_attempt = Config.Constraints._default_stalls_minimum_attempt
-            self._stalls_first_attempt = Config.Constraints._default_stalls_first_attempt
-            self._stalls_precision = Config.Constraints._default_stalls_precision
-            self._stalls_timeout_below_precision = Config.Constraints._default_stalls_timeout_below_precision
+            self._model_latencies = True
+            self._model_functional_units = True
+            self._allow_reordering = True
+            self._allow_renaming = True
 
             self.lock()
 
@@ -812,9 +849,6 @@ def allow_reordering(self,val):
         @allow_renaming.setter
         def allow_renaming(self,val):
             self._allow_renaming = val
-        @restricted_renaming.setter
-        def restricted_renaming(self,val):
-            self._restricted_renaming = val
         @functional_only.setter
         def functional_only(self,val):
             if not val:
@@ -825,11 +859,6 @@ def functional_only(self,val):
     class Hints(NestedPrint, LockAttributes):
         """Subconfiguration for solver hints"""
 
-        _default_all_core = True
-        _default_order_hint_orig_order = False
-        _default_rename_hint_orig_rename = False
-        _default_ext_bsearch_remember_successes = False
-
         @property
         def all_core(self):
             """When SW pipelining is used, hint that all instructions
@@ -850,15 +879,20 @@ def rename_hint_orig_rename(self):
 
         @property
         def ext_bsearch_remember_successes(self):
+            """When using an external binary search, hint previous successful
+                optimiation.
+                
+            See also Config.variable_size."""
             return self._ext_bsearch_remember_successes
 
         def __init__(self):
             super().__init__()
 
-            self._all_core = Config.Hints._default_all_core
-            self._order_hint_orig_order = Config.Hints._default_order_hint_orig_order
-            self._rename_hint_orig_rename = Config.Hints._default_rename_hint_orig_rename
-            self._ext_bsearch_remember_successes = Config.Hints._default_ext_bsearch_remember_successes
+            self._all_core = True
+            self._order_hint_orig_order = False
+            self._rename_hint_orig_rename = False
+            self._ext_bsearch_remember_successes = False
+
             self.lock()
 
         @all_core.setter
@@ -881,14 +915,7 @@ def __init__(self, Arch, Target):
         self._constraints = Config.Constraints()
         self._hints = Config.Hints()
 
-        # NOTE: - This saves us from having to do a binary search for the minimum
-        #         number of stalls ourselves, but it seems to slow down the tool
-        #         significantly!
-        #       - It also disables the minimization of instruction overlapping
-        #         in loop mode.
-        #
-        # Rather keep it off for now...
-        self.variable_size = False
+        self._variable_size = False
 
         self._register_aliases = {}
         self._outputs = set()
@@ -900,40 +927,38 @@ def __init__(self, Arch, Target):
         self._locked_registers = []
         self._reserved_regs = None
 
-        self.selfcheck = True # Check that that resulting code reordering constitutes an isomorphism of computation flow graphs
-
-        self.allow_useless_instructions = False
-
-        self._split_heuristic = Config._default_split_heuristic
-        self._split_heuristic_region = Config._default_split_heuristic_region
-        self._split_heuristic_factor = Config._default_split_heuristic_factor
-        self._split_heuristic_abort_cycle_at = Config._default_split_heuristic_abort_cycle_at
-        self._split_heuristic_stepsize = Config._default_split_heuristic_stepsize
-        self._split_heuristic_optimize_seam = Config._default_split_heuristic_optimize_seam
-        self._split_heuristic_chunks = Config._default_split_heuristic_chunks
-        self._split_heuristic_bottom_to_top = Config._default_split_heuristic_bottom_to_top
-        self._split_heuristic_repeat = Config._default_split_heuristic_repeat
-        self._split_heuristic_preprocess_naive_interleaving = \
-            Config._default_split_heuristic_preprocess_naive_interleaving
-        self._split_heuristic_preprocess_naive_interleaving_by_latency = \
-            Config._default_split_heuristic_preprocess_naive_interleaving_by_latency
-        self._split_heuristic_optimize_seam = Config._default_split_heuristic_optimize_seam
-
-        self._unsafe_skip_address_fixup = Config._default_unsafe_skip_address_fixup
-
-        self._with_preprocessor = Config._default_with_preprocessor
-        self._compiler_binary = Config._default_compiler_binary
-        self._max_solutions = Config._default_max_solutions
-        self._timeout = Config._default_timeout
-        self._retry_timeout = Config._default_retry_timeout
-        self._ignore_objective = Config._default_ignore_objective
-        self._objective_precision = Config._default_objective_precision
+        self._selfcheck = True
+        self._allow_useless_instructions = False
+
+        self._split_heuristic = False
+        self._split_heuristic_region = [0.0,1.0]
+        self._split_heuristic_chunks = False
+        self._split_heuristic_optimize_seam = 0
+        self._split_heuristic_bottom_to_top = False
+        self._split_heuristic_factor = 2
+        self._split_heuristic_abort_cycle_at = None
+        self._split_heuristic_stepsize = None
+        self._split_heuristic_repeat = 1
+        self._split_heuristic_preprocess_naive_interleaving = False
+        self._split_heuristic_preprocess_naive_interleaving_by_latency = False
+
+        self._compiler_binary = "gcc"
+
+        self.keep_tags = True
+        self.ignore_tags = False
+
+        self._do_address_fixup = True
+
+        self._with_preprocessor = False
+        self._max_solutions = 64
+        self._timeout = None
+        self._retry_timeout = None
+        self._ignore_objective = False
+        self._objective_precision = 0
 
         # Visualization
         self.indentation = 8
         self.visualize_reordering = True
-        self._split_heuristic_visualize_stalls = False
-        self._split_heuristic_visualize_units = False
 
         self.placeholder_char = '.'
         self.early_char = 'e'
@@ -984,6 +1009,15 @@ def _check_rename_config(self, lst):
     @reserved_regs.setter
     def reserved_regs(self,val):
         self._reserved_regs = val
+    @variable_size.setter
+    def variable_size(self,val):
+        self._variable_size = val
+    @selfcheck.setter
+    def selfcheck(self,val):
+        self._selfcheck = val
+    @allow_useless_instructions.setter
+    def allow_useless_instructions(self,val):
+        self._allow_useless_instructions = val
     @locked_registers.setter
     def locked_registers(self,val):
         self._locked_registers = val
@@ -1002,9 +1036,15 @@ def timeout(self, val):
     @retry_timeout.setter
     def retry_timeout(self, val):
         self._retry_timeout = val
-    @unsafe_skip_address_fixup.setter
-    def unsafe_skip_address_fixup(self, val):
-        self._unsafe_skip_address_fixup = val
+    @keep_tags.setter
+    def keep_tags(self, val):
+        self._keep_tags = val
+    @ignore_tags.setter
+    def ignore_tags(self, val):
+        self._ignore_tags = val
+    @do_address_fixup.setter
+    def do_address_fixup(self, val):
+        self._do_address_fixup = val
     @ignore_objective.setter
     def ignore_objective(self, val):
         self._ignore_objective = val
@@ -1032,12 +1072,6 @@ def split_heuristic_optimize_seam(self, val):
     @split_heuristic_bottom_to_top.setter
     def split_heuristic_bottom_to_top(self, val):
         self._split_heuristic_bottom_to_top = val
-    @split_heuristic_visualize_stalls.setter
-    def split_heuristic_visualize_stalls(self, val):
-        self._split_heuristic_visualize_stalls = val
-    @split_heuristic_visualize_units.setter
-    def split_heuristic_visualize_units(self, val):
-        self._split_heuristic_visualize_units = val
     @split_heuristic_region.setter
     def split_heuristic_region(self, val):
         self._split_heuristic_region = val
diff --git a/slothy/core.py b/slothy/core/core.py
similarity index 81%
rename from slothy/core.py
rename to slothy/core/core.py
index 7a9101e7..4a41844f 100644
--- a/slothy/core.py
+++ b/slothy/core/core.py
@@ -25,22 +25,26 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
-import logging, ortools, math
-
+import logging
+import math
 from types import SimpleNamespace
 from copy import deepcopy
+from functools import cached_property
 from sympy import simplify
 
+import ortools
 from ortools.sat.python import cp_model
-from functools import cached_property
 
-from slothy.config import Config
-from slothy.helper import LockAttributes, AsmHelper, Permutation, DeferHandler
+from slothy.core.config import Config
+from slothy.helper import LockAttributes, Permutation, DeferHandler, SourceLine
 
-from slothy.dataflow import DataFlowGraph as DFG
-from slothy.dataflow import Config as DFGConfig
-from slothy.dataflow import InstructionOutput, InstructionInOut, ComputationNode
-from slothy.dataflow import SlothyUselessInstructionException
+from slothy.core.dataflow import DataFlowGraph as DFG
+from slothy.core.dataflow import Config as DFGConfig
+from slothy.core.dataflow import InstructionOutput, InstructionInOut, ComputationNode
+from slothy.core.dataflow import SlothyUselessInstructionException
+
+class SlothyException(Exception):
+    """Generic exception thrown by SLOTHY"""
 
 class Result(LockAttributes):
     """The results of a one-shot SLOTHY optimization run"""
@@ -67,16 +71,13 @@ def _gen_orig_code_visualized(self):
 
         def arr_width(arr):
             mi = min(arr)
-            ma = max(0,max(arr))
+            ma = max(0, max(arr)) # pylint:disable=nested-min-max
             return mi, ma-mi
 
         min_pos, width = arr_width(self.reordering.values())
-        if not self.config.constraints.functional_only:
-            min_pos_cycle, width_cycle = \
-                arr_width(self.cycle_position_with_bubbles.values())
 
-        yield ""
-        yield "// original source code"
+        yield SourceLine("")
+        yield SourceLine("").set_comment("original source code")
         for i in range(self.codesize):
             pos = self.reordering[i] - min_pos
             c = core_char
@@ -101,16 +102,11 @@ def arr_width(arr):
                 c_pos += self.codesize
             t_comment = ''.join(t_comment)
 
-            if not self.config.constraints.functional_only and \
-               self.config.target.issue_rate > 1:
-                cycle_pos = self.cycle_position_with_bubbles[i]  - min_pos_cycle
-                t_comment_cycle = "|| " + (d * cycle_pos + c + d * (width_cycle - cycle_pos))
-            else:
-                t_comment_cycle = ""
-
-            yield f"// {self.orig_code[i]:{fixlen-3}s} // {t_comment} {t_comment_cycle}"
+            yield SourceLine("")                                      \
+                .set_comment(f"{str(self.orig_code[i]):{fixlen-3}s}") \
+                .add_comment(t_comment)
 
-        yield ""
+        yield SourceLine("")
 
     @property
     def orig_code_visualized(self):
@@ -128,10 +124,19 @@ def orig_outputs(self):
 
     @property
     def codesize(self):
+        """The number of instructions in the (original and optimized) source code."""
         return len(self.orig_code)
 
     @property
     def codesize_with_bubbles(self):
+        """Performance-measure for the optimized source code.
+        
+        This is the number of issue slots used by the optimized code.
+        Equivalently, after division by the target's issue width, it is 
+        SLOTHY's expectation of the performance of the code in cycles.
+
+        It is also the codomain of the xxx_with_bubbles dictionaries.
+        """
         return self._codesize_with_bubbles
     @codesize_with_bubbles.setter
     def codesize_with_bubbles(self, v):
@@ -140,6 +145,21 @@ def codesize_with_bubbles(self, v):
 
     @property
     def pre_core_post_dict(self):
+        """Dictionary indicating interleaving of iterations.
+
+        This dictionary consists of items (i, (pre, core, post)), where
+        i is the original program order position of an instruction, and
+        pre, core, post indicate whether that instruction is an early,
+        core or late instruction in the optimized source code. 
+        
+        An early instruction is one which is pulled into the previous iteration.
+        A late instruction is one which is deferred until the next iteration.
+        A core instruction is one which is left in its original iteration.
+
+        This property is only meaningful when software pipelining is enabled.
+
+        See also is_pre, is_core, is_post.
+        """
         self._require_sw_pipelining()
         return self._pre_core_post_dict
     @pre_core_post_dict.setter
@@ -290,7 +310,7 @@ def get_periodic_reordering(self, copies):
         vals = list(t.values())
         vals.sort()
         res = { i : vals.index(v) for (i,v) in t.items() }
-        assert (Permutation.is_permutation(res, copies * self.codesize))
+        assert Permutation.is_permutation(res, copies * self.codesize)
         return res
 
     def get_periodic_reordering_inv(self, copies):
@@ -323,15 +343,15 @@ def get_fully_unrolled_loop(self, iterations):
         self._require_sw_pipelining()
         assert iterations > self.num_exceptional_iterations
         kernel_copies = iterations - self.num_exceptional_iterations
-        new_source = '\n'.join(self._preamble                 +
-                               ( self._code * kernel_copies ) +
-                               self._postamble )
-        old_source = '\n'.join(self._orig_code * iterations)
+        new_source = (self._preamble                 +
+                      (self._code * kernel_copies) +
+                      self._postamble )
+        old_source = self._orig_code * iterations
         return old_source, new_source
 
     def get_unrolled_kernel(self, iterations):
         self._require_sw_pipelining()
-        return '\n'.join(self._code * iterations)
+        return self._code * iterations
 
     @cached_property
     def reordering(self):
@@ -344,6 +364,7 @@ def periodic_reordering_with_bubbles(self):
 
     @cached_property
     def periodic_reordering_with_bubbles_inv(self):
+        """The inverse dictionary to periodic_reordering_with_bubbles"""
         return self.get_periodic_reordering_with_bubbles_inv(1)
 
     @cached_property
@@ -352,6 +373,7 @@ def periodic_reordering(self):
 
     @cached_property
     def periodic_reordering_inv(self):
+        """The inverse permutation to periodic_reordering"""
         res = self.get_periodic_reordering_inv(1)
         assert Permutation.is_permutation(res, self.codesize)
         return res
@@ -362,9 +384,14 @@ def reordering_inv(self):
         return { v : k for k,v in self.reordering.items() }
 
     @property
+    def code_raw(self):
+        """Optimized code, without annotations"""
+        return self._code
+    @property
     def code(self):
         """The optimized source code"""
         code = self._code
+        assert SourceLine.is_source(code)
         ri = self.periodic_reordering_with_bubbles_inv
         if not self.config.visualize_reordering:
             return code
@@ -380,8 +407,10 @@ def _gen_visualized_code():
             for i in range(self.codesize_with_bubbles):
                 p = ri.get(i, None)
                 if p is None:
-                    gapstr = "// gap"
-                    yield f"{gapstr:{fixlen}s} // {d * self.codesize}"
+                    gap_str = "gap"
+                    yield SourceLine("")    \
+                        .set_comment(f"{gap_str:{fixlen-3}s}") \
+                        .add_comment(d * self.codesize)
                     continue
                 s = code[self.periodic_reordering[p]]
                 c = core_char
@@ -389,8 +418,8 @@ def _gen_visualized_code():
                     c = early_char
                 elif self.is_post(p):
                     c = late_char
-                comment = d * p + c + d * (self.codesize - p - 1)
-                yield f"{s:{fixlen}s} // {comment}"
+                vis = d * p + c + d * (self.codesize - p - 1)
+                yield s.copy().set_length(fixlen).set_comment(vis)
 
         res = list(_gen_visualized_code())
         res += self.orig_code_visualized
@@ -398,17 +427,18 @@ def _gen_visualized_code():
         return res
     @code.setter
     def code(self, val):
+        assert SourceLine.is_source(val)
         self._code = val
 
-    def get_full_code(self, log):
+    def _get_full_code(self, log):
         if self.config.sw_pipelining.enabled:
             # Unroll the loop a fixed number of times
             iterations = 5
             old_source, new_source = self.get_fully_unrolled_loop(iterations)
             reordering = self.get_reordering(iterations, no_gaps=True)
         else:
-            old_source = '\n'.join(self.orig_code)
-            new_source = '\n'.join(self.code)
+            old_source = self.orig_code
+            new_source = self.code
             reordering = self.reordering.copy()
             iterations = 1
 
@@ -417,8 +447,8 @@ def get_full_code(self, log):
 
         dfg_old_log = log.getChild("dfg_old")
         dfg_new_log = log.getChild("dfg_new")
-        SlothyBase.dump(f"Old code ({iterations} copies)", old_source, dfg_old_log)
-        SlothyBase.dump(f"New code ({iterations} copies)", new_source, dfg_new_log)
+        SourceLine.log(f"Old code ({iterations} copies)", old_source, dfg_old_log)
+        SourceLine.log(f"New code ({iterations} copies)", new_source, dfg_new_log)
 
         tree_old = DFG(old_source, dfg_old_log,
                        DFGConfig(self.config, outputs=self.orig_outputs))
@@ -437,14 +467,59 @@ def selfcheck(self, log):
         try:
             res = self._selfcheck_core(log)
         except SlothyUselessInstructionException as exc:
-            raise SlothySelfCheckException("Useless instruction detected during selfcheck: FAIL!") from exc
+            raise SlothySelfCheckException("Useless instruction detected during selfcheck: FAIL!")\
+                from exc
         if self.config.selfcheck and not res:
             raise SlothySelfCheckException("Isomorphism between computation flow graphs: FAIL!")
         return res
 
+    def selfcheck_with_fixup(self, log):
+        """Do selfcheck, and consider preamble/postamble fixup in case of SW pipelining
+
+        In the presence of cross iteration dependencies, the preamble and postamble
+        may be functionally incorrect and need fixup."""
+
+        # We gather the log output of the initial selfcheck and only release
+        # it (a) on success, or (b) when even the selfcheck after fixup fails.
+
+        defer_handler = DeferHandler()
+        log.propagate = False
+        log.addHandler(defer_handler)
+
+        try:
+            retry = not self.selfcheck(log)
+            exception = None
+        except SlothySelfCheckException as e:
+            exception = e
+
+        log.propagate = True
+        log.removeHandler(defer_handler)
+
+        if exception and self.config.sw_pipelining.enabled:
+            retry = True
+        elif exception:
+            # We don't expect a failure if there are no cross-iteration dependencies
+            defer_handler.forward(log)
+            raise e
+
+        if not retry:
+            # On success, show the log output
+            defer_handler.forward(log)
+        else:
+            log.info("Selfcheck failed! This sometimes happens in the presence "\
+                     "of cross-iteration dependencies. Try fixup...")
+            self.fixup_preamble_postamble(log.getChild("fixup_preamble_postamble"))
+
+            try:
+                self.selfcheck(log.getChild("after_fixup"))
+            except SlothySelfCheckException as e:
+                log.error("Here is the output of the original selfcheck before fixup")
+                defer_handler.forward(log)
+                raise e
+
     def _selfcheck_core(self, log):
         _, old_source, new_source, tree_old, tree_new, reordering = \
-            self.get_full_code(log)
+            self._get_full_code(log)
         edges_old = tree_old.edges()
         edges_new = tree_new.edges()
 
@@ -461,9 +536,9 @@ def _selfcheck_core(self, log):
         def apply_reordering(x):
             src,dst,lbl=x
             if not src in reordering.keys():
-                raise Exception(f"Source ID {src} not in remapping {reordering.items()}")
+                raise SlothyException(f"Source ID {src} not in remapping {reordering.items()}")
             if not dst in reordering:
-                raise Exception(f"Destination ID {dst} not in remapping {reordering.items()}")
+                raise SlothyException(f"Destination ID {dst} not in remapping {reordering.items()}")
             return (reordering[src], reordering[dst], lbl)
 
         edges_old_remapped = set(map(apply_reordering, edges_old))
@@ -480,8 +555,8 @@ def apply_reordering(x):
         log.error("Input/Output renaming")
         log.error(reordering)
 
-        SlothyBase.dump("old code", old_source, log, err=True)
-        SlothyBase.dump("new code", new_source, log, err=True)
+        SourceLine.log("old code", old_source, log, err=True)
+        SourceLine.log("new code", new_source, log, err=True)
 
         new_not_old = [e for e in edges_new if e not in edges_old_remapped]
         old_not_new = [e for e in edges_old_remapped if e not in edges_new]
@@ -500,18 +575,6 @@ def apply_reordering(x):
             log.error(f"New ({src_idx}:{src})"\
                       f"---{lbl}--->({dst_idx}:{dst}) not present in old graph")
 
-            src_idx_old = reordering_inv[src_idx]
-            dst_idx_old = reordering_inv[dst_idx]
-            src_old = tree_old.nodes_by_id[src_idx_old]
-            dst_old = tree_old.nodes_by_id[dst_idx_old]
-            log.error(f"Instructions in old graph: {src_old}, {dst_old}")
-            deps = [(s,d,l) for (s, d, l) in edges_old if s==src_idx_old and d==dst_idx_old]
-            if len(deps) > 0:
-                for (s,d,l) in deps:
-                    log.error(f"Edge: {src_old} --{l}--> {dst_old}")
-            else:
-                log.error("No dependencies in old graph!")
-
         for (src_idx,dst_idx,lbl) in old_not_new:
             src_idx_old = reordering_inv[src_idx]
             dst_idx_old = reordering_inv[dst_idx]
@@ -520,21 +583,6 @@ def apply_reordering(x):
             log.error(f"Old ({src_old})[id:{src_idx_old}]"\
                       f"---{lbl}--->{dst_old}[id:{dst_idx_old}] not present in new graph")
 
-            src = tree_new.nodes_by_id.get(src_idx, None)
-            dst = tree_new.nodes_by_id.get(dst_idx, None)
-
-            if src is not None and dst is not None:
-                log.error(f"Instructions in new graph: {src} --> {dst}")
-                deps = [(s,d,l) for (s,d,l) in edges_new if s==src_idx and d==dst_idx]
-                if len(deps) > 0:
-                    for (s, d, l) in deps:
-                        log.error(f"Edge: {src} --{l}--> {dst}")
-                else:
-                    log.error("No dependencies in new graph!")
-            else:
-                log.error(f"Indices {src_idx} ({src}) and {dst_idx} ({dst})"
-                          "don't both exist in new DFG?")
-
         log.error("Isomorphism between computation flow graphs: FAIL!")
         return False
 
@@ -573,6 +621,10 @@ def output_renamings(self, v):
 
     @property
     def stalls(self):
+        """The number of stalls in the optimization result.
+        
+        More precisely: The number of cycles c such that optimization succeeded with
+        up to c * issue_width unused issue slots."""
         return self._stalls
     @stalls.setter
     def stalls(self, v):
@@ -585,6 +637,8 @@ def _build_stalls_idxs(self):
                               self.reordering_with_bubbles.values() }
     @property
     def stall_positions(self):
+        """The positions of instructions in the optimized assembly where SLOTHY
+        expects a stall or unused issue slot."""
         if self._stalls_idxs is None:
             self._build_stalls_idxs()
         return self._stalls_idxs
@@ -607,23 +661,19 @@ def kernel_input_output(self, val):
         self._kernel_input_output = val
     @property
     def preamble(self):
-        """When using software pipelining, the preamble to the loop kernel of the optimized loop."""
+        """When using software pipelining, the preamble of the optimized loop."""
         self._require_sw_pipelining()
         return self._preamble
     @preamble.setter
     def preamble(self, val):
-        # For now, double-check that we never set the preamble twice
-        # assert self._preamble is None
         self._preamble = val
     @property
     def postamble(self):
-        """When using software pipelining, the postamble to the loop kernel of the optimized loop."""
+        """When using software pipelining, the postamble of the optimized loop."""
         self._require_sw_pipelining()
         return self._postamble
     @postamble.setter
     def postamble(self, val):
-        # For now, double-check that we never set the preamble twice
-        # assert self._postamble is None
         self._postamble = val
 
     @property
@@ -635,7 +685,7 @@ def config(self):
     def success(self):
         """Whether the optimization was successful"""
         if not self._valid:
-            raise Exception("Querying not-yet-populated result object")
+            raise SlothyException("Querying not-yet-populated result object")
         return self._success
     def __bool__(self):
         return self.success
@@ -647,16 +697,236 @@ def success(self, val):
 
     @property
     def valid(self):
+        """Indicates whether the result object is valid."""
         return self._valid
-
     @valid.setter
     def valid(self, val):
         self._valid = val
 
     def _require_sw_pipelining(self):
         if not self.config.sw_pipelining.enabled:
-            raise Exception("Asking for SW-pipelining attribute in result of SLOTHY run"
-                            " without SW pipelining")
+            raise SlothyException("Asking for SW-pipelining attribute in result "
+                "of SLOTHY run without SW pipelining")
+
+    @staticmethod
+    def _fixup_reordered_pair(t0, t1, logger):
+
+        def inst_changes_addr(inst):
+            return inst.increment is not None
+
+        if not t0.inst.is_load_store_instruction():
+            return
+        if not t1.inst.is_load_store_instruction():
+            return
+        if not t0.inst.addr == t1.inst.addr:
+            return
+        if inst_changes_addr(t0.inst) and inst_changes_addr(t1.inst):
+            logger.error( "=======================   ERROR   ===============================")
+            logger.error(f"    Cannot handle reordering of two instructions ({t0} and {t1}) ")
+            logger.error( "           which both want to modify the same address            ")
+            logger.error( "=================================================================")
+            raise SlothyException("Address fixup failure")
+
+        if inst_changes_addr(t0.inst):
+            # t1 gets reordered before t0, which changes the address
+            # Adjust t1's address accordingly
+            logger.debug(f"{t0} moved after {t1}, bumping {t1.fixup} by {t0.inst.increment}, "
+                         f"to {t1.fixup + int(simplify(t0.inst.increment))}")
+            t1.fixup += int(simplify(t0.inst.increment))
+        elif inst_changes_addr(t1.inst):
+            # t0 gets reordered after t1, which changes the address
+            # Adjust t0's address accordingly
+            logger.debug(f"{t1} moved before {t0}, lowering {t0.fixup} by {t1.inst.increment}, "
+                         f"to {t0.fixup - int(simplify(t1.inst.increment))}")
+            t0.fixup -= int(simplify(t1.inst.increment))
+
+    @staticmethod
+    def _fixup_reset(nodes):
+        for t in nodes:
+            t.fixup = 0
+
+    @staticmethod
+    def _fixup_finish(nodes, logger):
+        def inst_changes_addr(inst):
+            return inst.increment is not None
+
+        for t in nodes:
+            if not t.inst.is_load_store_instruction():
+                continue
+            if inst_changes_addr(t.inst):
+                continue
+            if t.fixup == 0:
+                continue
+            if t.inst.pre_index:
+                t.inst.pre_index = f"(({t.inst.pre_index}) + ({t.fixup}))"
+            else:
+                t.inst.pre_index = f"{t.fixup}"
+            logger.debug(f"Fixed up instruction {t.inst} by {t.fixup}, to {t.inst}")
+
+    def _offset_fixup_sw(self, log):
+        n, _, _, _, tree_new, reordering = self._get_full_code(log)
+        iterations = n // self.codesize
+
+        Result._fixup_reset(tree_new.nodes)
+        for _, _, ni, nj in Permutation.iter_swaps(reordering, n):
+            Result._fixup_reordered_pair(tree_new.nodes[ni], tree_new.nodes[nj], log)
+        Result._fixup_finish(tree_new.nodes, log)
+
+        preamble_len = len(self.preamble)
+        postamble_len = len(self.postamble)
+
+        assert n // iterations == self.codesize
+
+        preamble_new  = list(map(ComputationNode.to_source_line, tree_new.nodes[:preamble_len]))
+        postamble_new = [ ComputationNode.to_source_line(t)
+                            for t in tree_new.nodes[-postamble_len:] ] \
+                        if postamble_len > 0 else []
+
+        code_new = []
+        for i in range(iterations - self.num_exceptional_iterations):
+            code_new.append([ ComputationNode.to_source_line(t) for t in
+                              tree_new.nodes[preamble_len + i*self.codesize:
+                                             preamble_len + (i+1)*self.codesize] ])
+
+        # Flag if address fixup makes the kernel instable. In this case, we'd have to
+        # widen preamble and postamble, but this is not yet implemented.
+        count = 0
+        for i, (kcur, knext) in enumerate(zip(code_new, code_new[1:])):
+            if SourceLine.write_multiline(kcur) != SourceLine.write_multiline(knext):
+                count += 1
+        if count != 0:
+            raise SlothyException("Instable loop kernel after post-optimization address fixup")
+        code_new = code_new[0]
+
+        self.preamble = preamble_new
+        self.postamble = postamble_new
+        self.code = code_new
+
+    def _offset_fixup_straightline(self, log):
+        n, _, _, _, tree_new, reordering = self._get_full_code(log)
+
+        Result._fixup_reset(tree_new.nodes)
+        for _, _, ni, nj in Permutation.iter_swaps(reordering, n):
+            Result._fixup_reordered_pair(tree_new.nodes[ni], tree_new.nodes[nj], log)
+        Result._fixup_finish(tree_new.nodes, log)
+
+        self.code = [ ComputationNode.to_source_line(t) for t in tree_new.nodes ]
+
+    def offset_fixup(self, log):
+        """Fixup address offsets after optimization"""
+        if self.config.sw_pipelining.enabled:
+            self._offset_fixup_sw(log)
+        else:
+            self._offset_fixup_straightline(log)
+
+    def fixup_preamble_postamble(self, log):
+        """Potentially fix up the preamble and postamble
+
+        When software pipelining is used in the context of a loop with cross-iteration dependencies,
+        the core optimization step might lead to functionally incorrect preamble and postamble.
+        This function checks if this is the case and fixes preamble and postamble, if necessary.
+        """
+
+        #if not self._has_cross_iteration_dependencies():
+        if not self.config.sw_pipelining.enabled:
+            return
+
+        iterations = self.num_exceptional_iterations
+        assert iterations in [1,2]
+
+        kernel = self.get_unrolled_kernel(iterations=iterations)
+
+        perm = self.periodic_reordering_inv
+        assert Permutation.is_permutation(perm, self.codesize)
+
+        dfgc_orig = DFGConfig(self.config, outputs=self.orig_outputs)
+        dfgc_kernel = DFGConfig(self.config, outputs=self.kernel_input_output)
+
+        tree_orig = DFG(self.orig_code, log.getChild("orig"), dfgc_orig)
+
+        def is_in_preamble(t):
+            if t.orig_pos is None:
+                return False
+            if iterations == 1:
+                return self.is_pre(t.orig_pos, original_program_order=False)
+            assert iterations == 2
+            if t.orig_pos < self.codesize:
+                return self.is_pre(t.orig_pos, original_program_order=False)
+            return not self.is_post(t.orig_pos % self.codesize,
+                original_program_order=False)
+
+        def is_in_postamble(t):
+            if t.orig_pos is None:
+                return False
+            if iterations == 1:
+                return not self.is_pre(t.orig_pos, original_program_order=False)
+            assert iterations == 2
+            if t.orig_pos < self.codesize:
+                return not self.is_pre(t.orig_pos, original_program_order=False)
+            return self.is_post(t.orig_pos % self.codesize,
+                original_program_order=False)
+
+        tree_kernel = DFG(kernel, log.getChild("ssa"), dfgc_kernel)
+        tree_kernel.ssa()
+
+        # Go through early instructions that depend on an instruction from
+        # the previous iteration. Remap those dependencies as input dependencies.
+        for (consumer, producer, _, _) in tree_kernel.iter_dependencies():
+            producer = producer.reduce()
+            if not (is_in_preamble(consumer) and not is_in_preamble(producer.src)):
+                continue
+            if producer.src.is_virtual:
+                continue
+            orig_pos = perm[producer.src.orig_pos % self.codesize]
+            assert isinstance(producer, InstructionOutput)
+            producer.src.inst.args_out[producer.idx] = \
+                tree_orig.nodes[orig_pos].inst.args_out[producer.idx]
+
+        # Update input and in-out register names
+        for t in tree_kernel.nodes_all:
+            for i, v in enumerate(t.src_in):
+                t.inst.args_in[i] = v.name()
+            for i, v in enumerate(t.src_in_out):
+                t.inst.args_in_out[i] = v.name()
+
+        new_preamble = [ ComputationNode.to_source_line(t)
+                        for t in tree_kernel.nodes if is_in_preamble(t) ]
+        self.preamble = new_preamble
+        SourceLine.log("New preamble", self.preamble, log)
+
+        dfgc_preamble = DFGConfig(self.config, outputs=self.kernel_input_output)
+        dfgc_preamble.inputs_are_outputs = False
+        DFG(self.preamble, log.getChild("new_preamble"), dfgc_preamble)
+
+        tree_kernel = DFG(kernel, log.getChild("ssa"), dfgc_kernel)
+        tree_kernel.ssa()
+
+        # Go through non-early instructions that feed into an instruction from
+        # the next iteration. Remap those dependencies as input dependencies.
+        for (consumer, producer, _, _) in tree_kernel.iter_dependencies():
+            producer = producer.reduce()
+            if not (is_in_postamble(producer.src) and not is_in_postamble(consumer)):
+                continue
+            orig_pos = perm[producer.src.orig_pos % self.codesize]
+            assert isinstance(producer, InstructionOutput)
+            producer.src.inst.args_out[producer.idx] = \
+                tree_orig.nodes[orig_pos].inst.args_out[producer.idx]
+
+        # Update input and in-out register names
+        for t in tree_kernel.nodes_all:
+            for i, v in enumerate(t.src_in):
+                t.inst.args_in[i] = v.reduce().name()
+            for i, v in enumerate(t.src_in_out):
+                t.inst.args_in_out[i] = v.reduce().name()
+
+        new_postamble = [ ComputationNode.to_source_line(t)
+                         for t in tree_kernel.nodes if is_in_postamble(t) ]
+        self.postamble = new_postamble
+        SourceLine.log("New postamble", self.postamble, log)
+
+        dfgc_postamble = DFGConfig(self.config, outputs=self.orig_outputs)
+        DFG(self.postamble, log.getChild("new_postamble"), dfgc_postamble)
+
 
     def __init__(self, config):
         super().__init__()
@@ -678,15 +948,15 @@ def __init__(self, config):
         self._kernel_input_output = None
         self._pre_core_post_dict = None
         self._codesize_with_bubbles = None
+        self._register_used = None
 
         self.lock()
 
 class SlothySelfCheckException(Exception):
-    pass
+    """Exception thrown upon selfcheck failures"""
 
 class SlothyBase(LockAttributes):
-    """Stateless core of SLOTHY --
-    [S]uper ([L]azy) [O]ptimization of [T]ricky [H]andwritten assembl[Y]
+    """Stateless core of SLOTHY.
 
     This class is the technical heart of the package: It implements the
     conversion of a software optimization problem into a constraint solving
@@ -712,10 +982,12 @@ def target(self):
 
     @property
     def result(self):
+        """The result object of the last optimization."""
         return self._result
 
     @property
     def success(self):
+        """Indicates whether the last optimiation succeeded."""
         return self._result.success
 
     def __init__(self, Arch, Target, *, logger=None, config=None):
@@ -752,7 +1024,7 @@ def _reset(self):
     def _set_timeout(self, timeout):
         if timeout is None:
             return
-        self.logger.info(f"Setting timeout of %d seconds...", timeout)
+        self.logger.info("Setting timeout of %d seconds...", timeout)
         self._model.cp_solver.parameters.max_time_in_seconds = timeout
 
     def optimize(self, source, prefix_len=0, suffix_len=0, log_model=None, retry=False):
@@ -823,7 +1095,7 @@ def optimize(self, source, prefix_len=0, suffix_len=0, log_model=None, retry=Fal
         self._add_constraints_scheduling()
         self._add_constraints_lifetime_bounds()
         self._add_constraints_loop_optimization()
-        self._add_constraints_N_issue()
+        self._add_constraints_n_issue()
         self._add_constraints_dependency_order()
         self._add_constraints_latencies()
         self._add_constraints_register_renaming()
@@ -841,11 +1113,13 @@ def optimize(self, source, prefix_len=0, suffix_len=0, log_model=None, retry=Fal
         self._result = Result(self.config)
 
         # Do the actual work
-        self.logger.info(f"Invoking external constraint solver ({self._describe_solver()}) ...")
-        self.result._success = self._solve()
-        if not retry and self.result._success:
-            self.logger.info(f"Booleans in result: {self._model.cp_solver.NumBooleans()}")
-        self.result._valid = True
+        self.logger.info("Invoking external constraint solver (%s) ...", self._describe_solver())
+        self.result.success = self._solve()
+        self.result.valid = True
+
+        if not retry and self.success:
+            self.logger.info("Booleans in result: %d", self._model.cp_solver.NumBooleans())
+
         if not self.success:
             return False
 
@@ -853,20 +1127,21 @@ def optimize(self, source, prefix_len=0, suffix_len=0, log_model=None, retry=Fal
         return True
 
     def _load_source(self, source, prefix_len=0, suffix_len=0):
+        assert SourceLine.is_source(source)
 
+        # TODO: This does not belong here
         if self.config.sw_pipelining.enabled and \
            ( prefix_len >0 or suffix_len > 0 ):
-            raise Exception("Invalid arguments")
+            raise SlothyException("Invalid arguments")
 
-        source = AsmHelper.reduce_source(source)
-        SlothyBase.dump("Source code", source, self.logger.input)
+        source = SourceLine.reduce_source(source)
+        SourceLine.log("Source code", source, self.logger.input)
 
         self._orig_code = source.copy()
-        source = '\n'.join(source)
 
         # Convert source code to computational flow graph
         if self.config.sw_pipelining.enabled:
-            source = source + '\n' + source
+            source = source + source
 
         self._model.tree = DFG(source, self.logger.getChild("dataflow"),
                                 DFGConfig(self.config))
@@ -906,7 +1181,7 @@ def _init_model_internals(self):
 
     def _usage_check(self):
         if self._num_optimization_passes > 0:
-            raise Exception("At the moment, SlothyBase should be used for one-shot optimizations")
+            raise SlothyException("SlothyBase should be used for one-shot optimizations")
         self._num_optimization_passes += 1
 
     def _reg_is_architectural(self,reg,ty):
@@ -935,7 +1210,7 @@ def static_renaming(conf_val, t):
             arch_str = "arch" if is_arch else "symbolic"
 
             if not isinstance(conf_val, dict):
-                raise Exception(f"Couldn't make sense of renaming configuration {conf_val}")
+                raise SlothyException(f"Couldn't make sense of renaming configuration {conf_val}")
 
             # Try to look up register in dictionary. There are three ways
             # it can be specified: Directly by name, via the "arch/symbolic"
@@ -947,7 +1222,7 @@ def static_renaming(conf_val, t):
             val = val if val is not None else conf_val.get( "other"  , None )
 
             if val is None:
-                raise Exception( f"Register {reg} not present in renaming config {conf_val}")
+                raise SlothyException( f"Register {reg} not present in renaming config {conf_val}")
 
             # There are three choices for the value:
             # - "static" for static assignment, which will statically assign a value
@@ -957,12 +1232,12 @@ def static_renaming(conf_val, t):
             if val == "static":
                 canonical_static_assignment = reg if is_arch else None
                 return True, canonical_static_assignment
-            elif val == "any":
+            if val == "any":
                 return False, None
-            else:
-                if not self._reg_is_architectural(val,ty):
-                    raise Exception(f"Invalid renaming configuration {val} for {reg}")
-                return True, val
+
+            if not self._reg_is_architectural(val,ty):
+                raise SlothyException(f"Invalid renaming configuration {val} for {reg}")
+            return True, val
 
         def tag_input(t):
             static, val = static_renaming(self.config.rename_inputs, t)
@@ -993,7 +1268,7 @@ def get_fresh_renaming_reg(ty):
         try:
             # Iterate statically renamed inputs/outputs which have not yet been assigned
             for v in inputs_tagged + outputs_tagged:
-                if v.static == False or v.reg is not None:
+                if v.static is False or v.reg is not None:
                     continue
                 v.reg = get_fresh_renaming_reg(v.ty)
         except OutOfRegisters as e:
@@ -1055,20 +1330,6 @@ def _backup_original_code(self):
         for t in self._get_nodes():
             t.inst_orig = deepcopy(t.inst)
 
-    @staticmethod
-    def dump(name, s, logger=None, err=False):
-        if err:
-            fun = logger.error
-        else:
-            fun = logger.debug
-        if isinstance(s,str):
-            s = s.splitlines()
-        if len(s) == 0:
-            return
-        fun(f"Dump: {name}")
-        for l in s:
-            fun(f"> {l}")
-
     class CpSatSolutionCb(cp_model.CpSolverSolutionCallback):
         """A solution callback class represents objects that are alive during CP-SAT operation
         and equipped with a callback that is triggered every time CP-SAT finds a new solution.
@@ -1100,124 +1361,6 @@ def solution_count(self):
             """The number of solutions found so far"""
             return self.__solution_count
 
-    @staticmethod
-    def _fixup_reordered_pair(t0, t1, logger, unsafe_skip_address_fixup=False):
-
-        def inst_changes_addr(inst):
-            return inst.increment is not None
-
-        if not t0.inst.is_load_store_instruction():
-            return
-        if not t1.inst.is_load_store_instruction():
-            return
-        if not t0.inst.addr == t1.inst.addr:
-            return
-        if inst_changes_addr(t0.inst) and inst_changes_addr(t1.inst):
-            if not unsafe_skip_address_fixup:
-                logger.error( "=======================   ERROR   ===============================")
-                logger.error(f"    Cannot handle reordering of two instructions ({t0} and {t1}) ")
-                logger.error( "           which both want to modify the same address            ")
-                logger.error( "=================================================================")
-                raise Exception("Address fixup failure")
-
-            logger.warning( "=========================   WARNING   ============================")
-            logger.warning(f"   Cannot handle reordering of two instructions ({t0} and {t1})   ")
-            logger.warning( "           which both want to modify the same address             ")
-            logger.warning( "   Skipping this -- you have to fix the address offsets manually  ")
-            logger.warning( "==================================================================")
-            return
-        if inst_changes_addr(t0.inst):
-            # t1 gets reordered before t0, which changes the address
-            # Adjust t1's address accordingly
-            logger.debug(f"{t0} moved after {t1}, bumping {t1.fixup} by {t0.inst.increment}, "
-                         f"to {t1.fixup + int(simplify(t0.inst.increment))}")
-            t1.fixup += int(simplify(t0.inst.increment))
-        elif inst_changes_addr(t1.inst):
-            # t0 gets reordered after t1, which changes the address
-            # Adjust t0's address accordingly
-            logger.debug(f"{t1} moved before {t0}, lowering {t0.fixup} by {t1.inst.increment}, "
-                         f"to {t0.fixup - int(simplify(t1.inst.increment))}")
-            t0.fixup -= int(simplify(t1.inst.increment))
-
-    @staticmethod
-    def _fixup_reset(nodes):
-        for t in nodes:
-            t.fixup = 0
-
-    @staticmethod
-    def _fixup_finish(nodes, logger):
-        def inst_changes_addr(inst):
-            return inst.increment is not None
-
-        for t in nodes:
-            if not t.inst.is_load_store_instruction():
-                continue
-            if inst_changes_addr(t.inst):
-                continue
-            if t.fixup == 0:
-                continue
-            if t.inst.pre_index:
-                t.inst.pre_index = f"(({t.inst.pre_index}) + ({t.fixup}))"
-            else:
-                t.inst.pre_index = f"{t.fixup}"
-            logger.debug(f"Fixed up instruction {t.inst} by {t.fixup}, to {t.inst}")
-
-    def _offset_fixup_sw(self, log):
-        n, _, _, _, tree_new, reordering = self._result.get_full_code(log)
-        iterations = n // self._result.codesize
-
-        SlothyBase._fixup_reset(tree_new.nodes)
-        for _, _, ni, nj in Permutation.iter_swaps(reordering, n):
-            SlothyBase._fixup_reordered_pair(tree_new.nodes[ni], tree_new.nodes[nj], log)
-        SlothyBase._fixup_finish(tree_new.nodes, log)
-
-        preamble_len = len(self._result.preamble)
-        postamble_len = len(self._result.postamble)
-
-        assert n // iterations == self._result.codesize
-
-        preamble_new  = [ str(t.inst) for t in tree_new.nodes[:preamble_len] ]
-        postamble_new = [ str(t.inst) for t in tree_new.nodes[-postamble_len:] ] \
-            if postamble_len > 0 else []
-
-        code_new = []
-        for i in range(iterations - self._result.num_exceptional_iterations):
-            code_new.append([ str(t.inst) for t in
-                              tree_new.nodes[preamble_len + i*self._result.codesize:
-                                             preamble_len + (i+1)*self._result.codesize] ])
-
-        # Flag if address fixup makes the kernel instable. In this case, we'd have to
-        # widen preamble and postamble, but this is not yet implemented.
-        count = 0
-        for i, (kcur, knext) in enumerate(zip(code_new, code_new[1:])):
-            if kcur != knext:
-                count += 1
-        if count != 0:
-            raise Exception("Instable loop kernel after post-optimization address fixup")
-        code_new = code_new[0]
-
-        self._result.preamble = preamble_new
-        self._result.postamble = postamble_new
-        self._result.code = code_new
-
-    def _offset_fixup_straightline(self, log):
-        n, _, _, _, tree_new, reordering = self._result.get_full_code(log)
-
-        SlothyBase._fixup_reset(tree_new.nodes)
-        for _, _, ni, nj in Permutation.iter_swaps(reordering, n):
-            SlothyBase._fixup_reordered_pair(tree_new.nodes[ni], tree_new.nodes[nj], log)
-        SlothyBase._fixup_finish(tree_new.nodes, log)
-
-        self._result.code = [ str(t.inst) for t in tree_new.nodes ]
-
-    def offset_fixup(self):
-        """Fixup address offsets after optimization"""
-        log = self.logger.getChild("offset_fixup")
-        if self.config.sw_pipelining.enabled:
-            self._offset_fixup_sw(log)
-        else:
-            self._offset_fixup_straightline(log)
-
     def fixup_preamble_postamble(self):
         """Potentially fix up the preamble and postamble
 
@@ -1233,9 +1376,7 @@ def fixup_preamble_postamble(self):
         log = self.logger.getChild("fixup_preamble_postamble")
 
         iterations = self._result.num_exceptional_iterations
-        assert iterations == 1 or iterations == 2
-
-        n = self._result.codesize * iterations
+        assert iterations in [1,2]
 
         kernel = self._result.get_unrolled_kernel(iterations=iterations)
 
@@ -1252,24 +1393,26 @@ def is_in_preamble(t):
                 return False
             if iterations == 1:
                 return self._result.is_pre(t.orig_pos, original_program_order=False)
-            elif iterations == 2:
-                if t.orig_pos < self._result.codesize:
-                    return self._result.is_pre(t.orig_pos, original_program_order=False)
-                else:
-                    return not self._result.is_post(t.orig_pos % self._result.codesize,
-                                                    original_program_order=False)
+
+            assert iterations == 2
+            if t.orig_pos < self._result.codesize:
+                return self._result.is_pre(t.orig_pos, original_program_order=False)
+
+            return not self._result.is_post(t.orig_pos % self._result.codesize,
+                                            original_program_order=False)
 
         def is_in_postamble(t):
             if t.orig_pos is None:
                 return False
             if iterations == 1:
                 return not self._result.is_pre(t.orig_pos, original_program_order=False)
-            elif iterations == 2:
-                if t.orig_pos < self._result.codesize:
-                    return not self._result.is_pre(t.orig_pos, original_program_order=False)
-                else:
-                    return self._result.is_post(t.orig_pos % self._result.codesize,
-                                                original_program_order=False)
+
+            assert iterations == 2
+            if t.orig_pos < self._result.codesize:
+                return not self._result.is_pre(t.orig_pos, original_program_order=False)
+
+            return self._result.is_post(t.orig_pos % self._result.codesize,
+                                        original_program_order=False)
 
         tree_kernel = DFG(kernel, log.getChild("ssa"), dfgc_kernel)
         tree_kernel.ssa()
@@ -1294,9 +1437,10 @@ def is_in_postamble(t):
             for i, v in enumerate(t.src_in_out):
                 t.inst.args_in_out[i] = v.name()
 
-        new_preamble = [ str(t.inst) for t in tree_kernel.nodes if is_in_preamble(t) ]
+        new_preamble = [ ComputationNode.to_source_line(t)
+                        for t in tree_kernel.nodes if is_in_preamble(t) ]
         self._result.preamble = new_preamble
-        SlothyBase.dump("New preamble", self._result.preamble, log)
+        SourceLine.log("New preamble", self._result.preamble, log)
 
         dfgc_preamble = DFGConfig(self.config, outputs=self._result.kernel_input_output)
         dfgc_preamble.inputs_are_outputs = False
@@ -1323,9 +1467,10 @@ def is_in_postamble(t):
             for i, v in enumerate(t.src_in_out):
                 t.inst.args_in_out[i] = v.reduce().name()
 
-        new_postamble = [ str(t.inst) for t in tree_kernel.nodes if is_in_postamble(t) ]
+        new_postamble = [ ComputationNode.to_source_line(t)
+                         for t in tree_kernel.nodes if is_in_postamble(t) ]
         self._result.postamble = new_postamble
-        SlothyBase.dump("New postamble", self._result.postamble, log)
+        SourceLine.log("New postamble", self._result.postamble, log)
 
         dfgc_postamble = DFGConfig(self.config, outputs=self._result.orig_outputs)
         DFG(self._result.postamble, log.getChild("new_postamble"), dfgc_postamble)
@@ -1341,48 +1486,8 @@ def _extract_result(self):
         self._extract_input_output_renaming()
 
         self._extract_code()
-
-        # In the presence of cross iteration dependencies, the preamble and postamble
-        # may be functionally incorrect and need fixup.
-        # We therefore gather the log output of the initial selfcheck and only release
-        # it (a) on success, or (b) when even the selfcheck after fixup fails.
-
-        log = self.logger.getChild("selfcheck")
-        defer_handler = DeferHandler()
-        log.propagate = False
-        log.addHandler(defer_handler)
-
-        try:
-            retry = not self._result.selfcheck(log)
-            exception = None
-        except SlothySelfCheckException as e:
-            exception = e
-
-        log.propagate = True
-        log.removeHandler(defer_handler)
-
-        if exception and self._has_cross_iteration_dependencies():
-            retry = True
-        elif exception:
-            # We don't expect a failure if there are no cross-iteration dependencies
-            defer_handler.forward(log)
-            raise e
-
-        if not retry:
-            # On success, show the log output
-            defer_handler.forward(log)
-        else:
-            self.logger.info("Selfcheck failed! This sometimes happens in the presence of cross-iteration dependencies. Try fixup...")
-            self.fixup_preamble_postamble()
-
-            try:
-                self._result.selfcheck(self.logger.getChild("selfcheck_after_fixup"))
-            except SlothySelfCheckException as e:
-                self.logger.error("Here is the output of the original selfcheck before fixup")
-                defer_handler.forward(log)
-                raise e
-
-        self.offset_fixup()
+        self._result.selfcheck_with_fixup(self.logger.getChild("selfcheck"))
+        self._result.offset_fixup(self.logger.getChild("fixup"))
 
     def _extract_positions(self, get_value):
 
@@ -1477,10 +1582,6 @@ def _extract_kernel_input_output(self):
 
     def _extract_code(self):
 
-        def add_indentation(src):
-            indentation = ' ' * self.config.indentation
-            src = [ indentation + s for s in src ]
-
         def get_code(filter_func=None, top=False):
             if len(self._model.tree.nodes) == 0:
                 return
@@ -1494,7 +1595,7 @@ def get_code_line(line_no):
                 t = self._model.tree.nodes[periodic_reordering_with_bubbles_inv[line_no]]
                 if filter_func and not filter_func(t):
                     return
-                yield str(t.inst)
+                yield ComputationNode.to_source_line(t)
 
             base  = 0
             lines = self._result.codesize_with_bubbles
@@ -1510,42 +1611,54 @@ def get_code_line(line_no):
                 preamble += list(get_code(filter_func=lambda t: t.pre, top=True))
             if self._result.num_post > 0:
                 preamble += list(get_code(filter_func=lambda t: not t.post))
-            self._result.preamble = preamble
 
             postamble = []
             if self._result.num_pre > 0:
                 postamble += list(get_code(filter_func=lambda t: not t.pre, top=True))
             if self._result.num_post > 0:
                 postamble += list(get_code(filter_func=lambda t: t.post))
-            self._result.postamble = postamble
 
-            self._result.code = list(get_code())
-            self._extract_kernel_input_output()
+            kernel = list(get_code())
 
             log = self.logger.result.getChild("sw_pipelining")
             log.debug("Kernel dependencies: %s", self._result.kernel_input_output)
 
-            SlothyBase.dump("Preamble",  self._result.preamble, log)
-            SlothyBase.dump("Kernel",    self._result.kernel, log)
-            SlothyBase.dump("Postamble", self._result.postamble, log)
+            SourceLine.log("Preamble",  preamble, log)
+            SourceLine.log("Kernel",    kernel, log)
+            SourceLine.log("Postamble", postamble, log)
+
+            preamble = SourceLine.apply_indentation(preamble, self.config.indentation)
+            postamble = SourceLine.apply_indentation(postamble, self.config.indentation)
+            kernel = SourceLine.apply_indentation(kernel, self.config.indentation)
 
-            add_indentation(self._result.preamble)
-            add_indentation(self._result.kernel)
-            add_indentation(self._result.postamble)
+            if self.config.keep_tags is False:
+                SourceLine.drop_tags(preamble)
+                SourceLine.drop_tags(postamble)
+                SourceLine.drop_tags(kernel)
+
+            self._result.preamble = preamble
+            self._result.postamble = postamble
+            self._result.code = kernel
+
+            self._extract_kernel_input_output()
 
         else:
-            self._result.code = list(get_code())
+            code = list(get_code())
+            code = SourceLine.apply_indentation(code, self.config.indentation)
+
+            if self.config.keep_tags is False:
+                SourceLine.drop_tags(code)
+
+            self._result.code = code
 
             self.logger.result.debug("Optimized code")
             for s in self._result.code:
-                self.logger.result.debug("> " + s.strip())
-
-            add_indentation(self._result.code)
+                self.logger.result.debug("> " + str(s).strip())
 
         if self.config.visualize_reordering:
             self._result._code += self._result.orig_code_visualized
 
-    def _add_path_constraint( self, consumer, producer, cb, force=False):
+    def _add_path_constraint( self, consumer, producer, cb):
         """Add model constraint cb() relating to the pair of producer-consumer instructions
            Outside of loop mode, this ignores producer and consumer, and just adds cb().
            In loop mode, however, the condition has to be omitted in two cases:
@@ -1607,16 +1720,15 @@ def _get_nodes_by_program_order(self, low=False, high=False, allnodes=False,
                                     inputs=False, outputs=False):
         if low:
             return self._model.tree.nodes_low
-        elif high:
+        if high:
             return self._model.tree.nodes_high
-        elif allnodes:
+        if allnodes:
             return self._model.tree.nodes_all
-        elif inputs:
+        if inputs:
             return self._model.tree.nodes_input
-        elif outputs:
+        if outputs:
             return self._model.tree.nodes_output
-        else:
-            return self._model.tree.nodes
+        return self._model.tree.nodes
 
     def _get_nodes_by_depth(self, **kwargs):
         return sorted(self._get_nodes_by_program_order(**kwargs),
@@ -1625,8 +1737,7 @@ def _get_nodes_by_depth(self, **kwargs):
     def _get_nodes(self, by_depth=False, **kwargs):
         if by_depth:
             return self._get_nodes_by_depth(**kwargs)
-        else:
-            return self._get_nodes_by_program_order(**kwargs)
+        return self._get_nodes_by_program_order(**kwargs)
 
     # ================================================================
     #                  VARIABLES (Instruction scheduling)            #
@@ -1694,7 +1805,7 @@ def _add_variables_functional_units(self):
                 t.unique_unit = False
                 t.exec_unit_choices = {}
                 for unit_choices in units:
-                    if type(unit_choices) != list:
+                    if not isinstance(unit_choices, list):
                         unit_choices = [unit_choices]
                     for unit in unit_choices:
                         unit_var = self._NewBoolVar(f"[{t.inst}].unit_choice.{unit}")
@@ -1721,13 +1832,15 @@ def make_start_var(name=""):
             # When we optimize for longest register lifetimes, we allow the starting time of the
             # usage interval to be smaller than the program order position of the instruction.
             if self.config.flexible_lifetime_start:
-                t.out_lifetime_start      = [ make_start_var(f"{t.varname()}_out_{i}_lifetime_start")
-                                              for i in range(t.inst.num_out) ]
-                t.inout_lifetime_start    = [ make_start_var(f"{t.varname()}_inout_{i}_lifetime_start")
-                                              for i in range(t.inst.num_in_out) ]
+                t.out_lifetime_start = [
+                    make_start_var(f"{t.varname()}_out_{i}_lifetime_start")
+                        for i in range(t.inst.num_out) ]
+                t.inout_lifetime_start = [
+                    make_start_var(f"{t.varname()}_inout_{i}_lifetime_start")
+                        for i in range(t.inst.num_in_out) ]
             else:
-                t.out_lifetime_start      = [ t.program_start_var for i in range(t.inst.num_out) ]
-                t.inout_lifetime_start    = [ t.program_start_var for i in range(t.inst.num_in_out) ]
+                t.out_lifetime_start = [ t.program_start_var for _ in range(t.inst.num_out) ]
+                t.inout_lifetime_start = [ t.program_start_var for _ in range(t.inst.num_in_out) ]
 
             t.out_lifetime_end        = [ make_var(f"{t.varname()}_out_{i}_lifetime_end")
                                           for i in range(t.inst.num_out) ]
@@ -1745,28 +1858,10 @@ def make_start_var(name=""):
     def _add_variables_register_renaming(self):
         """Add boolean variables indicating if an instruction uses a certain output register"""
 
-        def get_metric(t):
-            return int(t.id) // (max(t.depth,1))
-
-        if self.config.constraints.restricted_renaming is not None:
-            nodes_sorted_by_metric = [ t for t in self._get_nodes() ] # Refs only
-            nodes_sorted_by_metric.sort(key=get_metric)
-            start_idx = int(len(nodes_sorted_by_metric) *
-                self.config.constraints.restricted_renaming)
-            renaming_allowed_list = nodes_sorted_by_metric[start_idx:]
-
-        def _allow_renaming(t):
+        def _allow_renaming(_):
             if not self.config.constraints.allow_renaming:
                 return False
-            if self.config.constraints.restricted_renaming is None:
-                return True
-            if t.is_virtual:
-                return True
-            if t in renaming_allowed_list:
-                self.logger.info("Exceptionally allow renaming for %s, position %s, depth %d",
-                    t, t.id, t.depth)
-                return True
-            return False
+            return True
 
         self.logger.debug("Adding variables for register allocation...")
 
@@ -1787,7 +1882,8 @@ def _allow_renaming(t):
 
                 self.logger.debug("- Output %s (%s)", arg_out, arg_ty)
 
-                # Locked output register aren't renamed, and neither are outputs of locked instructions.
+                # Locked output register aren't renamed, and neither are
+                # outputs of locked instructions.
                 self.logger.debug("Locked registers: %s", self.config.locked_registers)
                 is_locked = arg_out in self.config.locked_registers
                 # Symbolic registers are always renamed
@@ -1815,7 +1911,7 @@ def _allow_renaming(t):
                     self.logger.error("Original candidates: %s", candidates)
                     self.logger.error("Restricted candidates: %s", candidates_restricted)
                     self.logger.error("Restrictions: %s", restrictions)
-                    raise Exception()
+                    raise SlothyException()
 
                 self.logger.input.debug("Registers available for renaming of "
                                         f"[{t.inst}].{arg_out} ({t.orig_pos})")
@@ -1877,7 +1973,7 @@ def add_arg_combination_vars(combinations, vs, name, t=t):
         ## Create intervals tracking the usage of registers
 
         for t in self._get_nodes(allnodes=True):
-            self.logger.debug(f"Create register usage intervals for {t}")
+            self.logger.debug("Create register usage intervals for %s", t)
 
             ivals = []
             ivals += list(zip(t.inst.arg_types_out, t.alloc_out_var,
@@ -1944,16 +2040,16 @@ def _iter_dependencies_with_lifetime(self):
         def _get_lifetime_start(src):
             if isinstance(src, InstructionOutput):
                 return src.src.out_lifetime_start[src.idx]
-            elif isinstance(src, InstructionInOut):
+            if isinstance(src, InstructionInOut):
                 return src.src.inout_lifetime_start[src.idx]
-            raise Exception("Unknown register source")
+            raise SlothyException("Unknown register source")
 
         def _get_lifetime_end(src):
             if isinstance(src, InstructionOutput):
                 return src.src.out_lifetime_end[src.idx]
-            elif isinstance(src, InstructionInOut):
+            if isinstance(src, InstructionInOut):
                 return src.src.inout_lifetime_end[src.idx]
-            raise Exception("Unknown register source")
+            raise SlothyException("Unknown register source")
 
         for (consumer, producer, ty, idx) in self._iter_dependencies():
             start_var = _get_lifetime_start(producer)
@@ -1997,8 +2093,9 @@ def _add_constraints_lifetime_bounds(self):
         # For every instruction depending on the output, add a lifetime bound
         for (consumer, producer, _, _, _, end_var, _) in \
             self._iter_dependencies_with_lifetime():
-            self._add_path_constraint(consumer, producer.src, lambda end_var=end_var, consumer=consumer:
-                self._Add(end_var >= consumer.program_start_var), force=True)
+            self._add_path_constraint(consumer, producer.src,
+                                      lambda end_var=end_var, consumer=consumer:
+                self._Add(end_var >= consumer.program_start_var))
 
     # ================================================================
     #                  CONSTRAINTS (Register allocation)             #
@@ -2040,7 +2137,9 @@ def _force_renaming_collision(self, var_dic_a, var_dic_b):
     def _force_allocation_restriction_single(self, valid_allocs, var_dict):
         for k,v in var_dict.items():
             if k not in valid_allocs:
-                self._Add(v == False)
+                # Disabling pylint warning here since we're building a
+                # CP-SAT constraint here, rather than making a boolean comparison.
+                self._Add(v == False) # pylint:disable=singleton-comparison
 
     def _force_allocation_restriction_many(self, restriction_lst, var_dict_lst):
         for r, v in zip(restriction_lst, var_dict_lst, strict=True):
@@ -2057,7 +2156,7 @@ def _add_constraints_register_renaming(self):
                 if len(arr) > 0:
                     self._model.AddMaxEquality(self._register_used[reg], arr)
                 else:
-                    self._Add(self._register_used[reg] == False)
+                    self._Add(self._register_used[reg] is False)
 
         # Ensure that outputs are unambiguous
         for t in self._get_nodes(allnodes=True):
@@ -2075,8 +2174,10 @@ def _add_constraints_register_renaming(self):
                                       t.alloc_in_out_combinations_vars)
             # Enforce individual input argument restrictions (for outputs this has already
             # been done at the time when we created the allocation variables).
-            self._force_allocation_restriction_many(t.inst.args_in_restrictions, t.alloc_in_var)
-            self._force_allocation_restriction_many(t.inst.args_in_out_restrictions, t.alloc_in_out_var)
+            self._force_allocation_restriction_many(t.inst.args_in_restrictions,
+                t.alloc_in_var)
+            self._force_allocation_restriction_many(t.inst.args_in_out_restrictions,
+                t.alloc_in_out_var)
             # Enforce exclusivity of arguments
             self._forbid_renaming_collision_many( t.inst.args_in_out_different,
                                             t.alloc_out_var,
@@ -2090,10 +2191,10 @@ def find_out_node(t_in):
                 c = list(filter(lambda t: t.inst.orig_reg == t_in.inst.orig_reg,
                                 self._model.tree.nodes_output))
                 if len(c) == 0:
-                    raise Exception("Could not find matching output for input:" +
+                    raise SlothyException("Could not find matching output for input:" +
                                     t_in.inst.orig_reg)
                 if len(c) > 1:
-                    raise Exception("Found multiple matching output nodes for input: " +
+                    raise SlothyException("Found multiple matching output nodes for input: " +
                                     f"{t_in.inst.orig_reg}: {c}")
                 return c[0]
             for t_in in self._model.tree.nodes_input:
@@ -2123,9 +2224,28 @@ def _add_constraints_loop_optimization(self):
 
             self._AddExactlyOne([t.pre_var, t.post_var, t.core_var])
 
+            # Check if source line was tagged pre/core/post
+            force_pre  = t.inst.source_line.tags.get("pre", None)
+            force_core = t.inst.source_line.tags.get("core", None)
+            force_post = t.inst.source_line.tags.get("post", None)
+            if force_pre is not None:
+                assert force_pre is True or force_pre is False
+                self._Add(t.pre_var == force_pre)
+                self.logger.debug("Force pre=%s instruction for %s", force_pre, t.inst)
+            if force_core is not None:
+                assert force_core is True or force_core is False
+                self._Add(t.core_var == force_core)
+                self.logger.debug("Force core=%s instruction for %s", force_core, t.inst)
+            if force_post is not None:
+                assert force_post is True or force_post is False
+                self._Add(t.post_var == force_post)
+                self.logger.debug("Force post=%s instruction for %s", force_post, t.inst)
+
             if not self.config.sw_pipelining.allow_pre:
+                # pylint:disable=singleton-comparison
                 self._Add(t.pre_var == False)
             if not self.config.sw_pipelining.allow_post:
+                # pylint:disable=singleton-comparison
                 self._Add(t.post_var == False)
 
             if self.config.hints.all_core:
@@ -2136,8 +2256,9 @@ def _add_constraints_loop_optimization(self):
             # Allow early instructions only in a certain range
             if self.config.sw_pipelining.max_pre < 1.0 and self._is_low(t):
                 relpos = t.orig_pos / len(self._get_nodes(low=True))
-                if relpos < 1 and relpos > self.config.sw_pipelining.max_pre:
-                    self._Add( t.pre_var == False )
+                if self.config.sw_pipelining.max_pre < relpos < 1:
+                    # pylint:disable=singleton-comparison
+                    self._Add(t.pre_var == False)
 
         if self.config.sw_pipelining.pre_before_post:
             for t, s in [(t,s) for t in self._get_nodes(low=True) \
@@ -2154,14 +2275,17 @@ def _add_constraints_loop_optimization(self):
                 # An instruction with forward dependency to the next iteration
                 # cannot be an early instruction, and an instruction depending
                 # on an instruction from a previous iteration cannot be late.
+
+                # pylint:disable=singleton-comparison
                 self._Add(producer.src.pre_var == False)
+                # pylint:disable=singleton-comparison
                 self._Add(consumer.post_var == False)
 
     # ================================================================
     #                  CONSTRAINTS (Single issuing)                  #
     # ================================================================
 
-    def _add_constraints_N_issue(self):
+    def _add_constraints_n_issue(self):
         self._AddAllDifferent([ t.program_start_var for t in self._get_nodes() ] )
 
         if self.config.variable_size:
@@ -2174,7 +2298,6 @@ def _add_constraints_N_issue(self):
             self._Add( t.program_start_var ==
                        t.cycle_start_var * self.target.issue_rate + t.slot_var )
 
-
     def _add_constraints_locked_ordering(self):
 
         def inst_changes_addr(inst):
@@ -2203,11 +2326,52 @@ def _change_same_address(t0,t1):
                     self._AddImplication( t0.pre_var,  t1.post_var.Not() )
 
                 if _change_same_address(t0,t1):
-                    self.logger.debug("Forbid reordering of (%s,%s) to avoid address fixup issues", t0, t1)
+                    self.logger.debug("Forbid reordering of (%s,%s) to avoid address fixup issues",
+                                      t0, t1)
 
                 self._add_path_constraint( t1, t0,
                    lambda t0=t0, t1=t1: self._Add(t0.program_start_var < t1.program_start_var) )
 
+        # Look for source annotations forcing orderings
+
+        if self.config.sw_pipelining.enabled is True:
+            nodes = self._get_nodes(low=True)
+        else:
+            nodes = self._get_nodes()
+
+        def find_node_by_source_id(src_id):
+            for t in nodes:
+                cur_id = t.inst.source_line.tags.get("id", None)
+                if cur_id == src_id:
+                    return t
+            raise SlothyException(f"Could not find node with source ID {src_id}")
+
+        for i, t1 in enumerate(nodes):
+            force_after = t1.inst.source_line.tags.get("after", [])
+            if not isinstance(force_after, list):
+                force_after = [force_after]
+            t0s = list(map(find_node_by_source_id, force_after))
+            force_after_last = t1.inst.source_line.tags.get("after_last", False)
+            if force_after_last is True:
+                if i == 0:
+                    # Ignore after_last tag for first instruction
+                    continue
+                t0s.append(nodes[i-1])
+            for t0 in t0s:
+                self.logger.info("Force %s < %s by source annotation", t0, t1)
+                self._add_path_constraint(t1, t0,
+                    lambda t0=t0, t1=t1: self._Add(t0.program_start_var < t1.program_start_var))
+
+        for t0 in nodes:
+            force_before = t0.inst.source_line.tags.get("before", [])
+            if not isinstance(force_before, list):
+                force_before = [force_before]
+            for t1_id in force_before:
+                t1 = find_node_by_source_id(t1_id)
+                self.logger.info("Force %s < %s by source annotation", t0, t1)
+                self._add_path_constraint(t1, t0,
+                    lambda t0=t0, t1=t1: self._Add(t0.program_start_var < t1.program_start_var))
+
     # ================================================================
     #                  CONSTRAINTS (Single issuing)                  #
     # ================================================================
@@ -2303,11 +2467,19 @@ def _add_constraints_misc(self):
         self.target.add_further_constraints(self)
 
     def get_inst_pairs(self, cond=None):
-        if cond is None:
-            cond = lambda a,b: True
+        """Yields all instruction pairs satisfying the provided predicate.
+
+        This can be useful for the specification of additional
+        microarchitecture-specific constraints.
+
+        Args:
+            cond: Predicate on pairs of ComputationNode's. True by default.
+
+        Returns:
+            Generator of all instruction pairs satisfying the predicate."""
         for t0 in self._model.tree.nodes:
             for t1 in self._model.tree.nodes:
-                if cond(t0,t1):
+                if cond is None or cond(t0,t1):
                     yield (t0,t1)
 
     # ================================================================#
@@ -2355,13 +2527,16 @@ def _add_constraints_loop_periodic(self):
             self._Add( t0.post_var == t1.post_var )
             self._Add( t0.core_var == t1.core_var )
             # Early
-            self._Add( t0.program_start_var == t1.program_start_var + self._model.program_padded_size_half )\
+            self._Add(t0.program_start_var == \
+                        t1.program_start_var + self._model.program_padded_size_half) \
                        .OnlyEnforceIf(t0.pre_var)
             # Core
-            self._Add( t1.program_start_var == t0.program_start_var + self._model.program_padded_size_half )\
+            self._Add(t1.program_start_var == \
+                        t0.program_start_var + self._model.program_padded_size_half) \
                        .OnlyEnforceIf(t0.core_var)
             # Late
-            self._Add( t0.program_start_var == t1.program_start_var + self._model.program_padded_size_half )\
+            self._Add(t0.program_start_var == \
+                        t1.program_start_var + self._model.program_padded_size_half) \
                        .OnlyEnforceIf(t0.post_var)
             ## Register allocations must be the same
             assert t0.inst.arg_types_out == t1.inst.arg_types_out
@@ -2379,7 +2554,7 @@ def _add_constraints_loop_periodic(self):
                 for reg in t1_vars:
                     v0 = t0.alloc_out_var[o][reg]
                     v1 = t1.alloc_out_var[o][reg]
-                    self._Add( v0 == v1 )
+                    self._Add(v0 == v1)
 
     def restrict_early_late_instructions(self, filter_func):
         """Forces all instructions not passing the filter_func to be `core`, that is,
@@ -2387,7 +2562,7 @@ def restrict_early_late_instructions(self, filter_func):
 
         This is only meaningful if software pipelining is enabled."""
         if not self.config.sw_pipelining.enabled:
-            raise Exception("restrict_early_late_instructions() only useful in SW pipelining mode")
+            raise SlothyException("restrict_early_late_instructions() only in SW pipelining mode")
 
         for t in self._get_nodes():
             if filter_func(t.inst):
@@ -2400,12 +2575,12 @@ def force_early(self, filter_func, early=True):
 
         This is only meaningful if software pipelining is enabled."""
         if not self.config.sw_pipelining.enabled:
-            raise Exception("force_early() only useful in SW pipelining mode")
+            raise SlothyException("force_early() only useful in SW pipelining mode")
 
         invalid_pre  =     early and not self.config.sw_pipelining.allow_pre
         invalid_post = not early and not self.config.sw_pipelining.allow_post
         if invalid_pre or invalid_post:
-            raise Exception("Invalid SW pipelining configuration in force_early()")
+            raise SlothyException("Invalid SW pipelining configuration in force_early()")
 
         for t in self._get_nodes():
             if filter_func(t.inst):
@@ -2458,8 +2633,8 @@ def restrict_slots_for_instructions_by_class(self, cls_lst, slots):
         provided list of instruction classes.
 
         Args:
-        - cls_lst: A list of instruction classes
-        - slots: A list of issue slots represented as integers."""
+            cls_lst: A list of instruction classes
+            slots: A list of issue slots represented as integers."""
         self.restrict_slots_for_instructions(
             self.filter_instructions_by_class(cls_lst), slots )
 
@@ -2468,8 +2643,8 @@ def restrict_slots_for_instructions_by_property(self, filter_func, slots):
         filter function.
 
         Args:
-        - cls_lst: A predicate on instructions
-        - slots: A list of issue slots represented as integers."""
+            cls_lst: A predicate on instructions
+            slots: A list of issue slots represented as integers."""
         self.restrict_slots_for_instructions(
             self.filter_instructions_by_property(filter_func), slots )
 
@@ -2497,7 +2672,8 @@ def _add_objective(self, force_objective=False):
                 name = "minimize iteration overlapping"
             elif self.config.constraints.maximize_register_lifetimes:
                 name = "maximize register lifetimes"
-                maxlist = [ v for t in self._get_nodes(allnodes=True) for v in t.out_lifetime_duration ]
+                maxlist = [ v for t in self._get_nodes(allnodes=True)
+                           for v in t.out_lifetime_duration ]
             elif self.config.constraints.move_stalls_to_bottom is True:
                 minlist = [ t.program_start_var for t in self._get_nodes() ]
                 name = "move stalls to bottom"
@@ -2542,8 +2718,7 @@ def _describe_solver(self):
         workers = self._model.cp_solver.parameters.num_workers
         if workers > 0:
             return f"OR-Tools CP-SAT v{ortools.__version__}, {workers} threads"
-        else:
-            return f"OR-Tools CP-SAT v{ortools.__version__}"
+        return f"OR-Tools CP-SAT v{ortools.__version__}"
 
     def _init_external_model_and_solver(self):
         self._model.cp_model  = cp_model.CpModel()
@@ -2557,44 +2732,38 @@ def _init_external_model_and_solver(self):
             self.logger.warning("Please consider upgrading OR-Tools to version >= 9.5.2040")
             self._model.cp_solver.parameters.symmetry_level = 1
 
-    def _NewIntVar(self, minval, maxval, name=""):
+    def _NewIntVar(self, minval, maxval, name=""): # pylint:disable=invalid-name
         r = self._model.cp_model.NewIntVar(minval,maxval, name)
         self._model.variables.append(r)
         return r
-    def _NewIntervalVar(self, base, dur, end, name=""):
-        r = self._model.cp_model.NewIntervalVar(base,dur,end,name)
-        return r
-    def _NewOptionalIntervalVar(self, base, dur, end, cond,name=""):
-        r = self._model.cp_model.NewOptionalIntervalVar(base,dur,end,cond,name)
-        return r
-    def _NewBoolVar(self, name=""):
+    def _NewIntervalVar(self, base, dur, end, name=""): # pylint:disable=invalid-name
+        return self._model.cp_model.NewIntervalVar(base,dur,end,name)
+    def _NewOptionalIntervalVar(self, base, dur, end, cond,name=""): # pylint:disable=invalid-name
+        return self._model.cp_model.NewOptionalIntervalVar(base,dur,end,cond,name)
+    def _NewBoolVar(self, name=""): # pylint:disable=invalid-name
         r = self._model.cp_model.NewBoolVar(name)
         self._model.variables.append(r)
         return r
-    def _NewConstant(self, val):
+    def _NewConstant(self, val): # pylint:disable=invalid-name
         r = self._model.cp_model.NewConstant(val)
         return r
-    def _Add(self,c):
+    def _Add(self,c): # pylint:disable=invalid-name
         return self._model.cp_model.Add(c)
-    def _AddNoOverlap(self,lst):
+    def _AddNoOverlap(self,lst): # pylint:disable=invalid-name
         return self._model.cp_model.AddNoOverlap(lst)
-    def _AddExactlyOne(self,lst):
+    def _AddExactlyOne(self,lst): # pylint:disable=invalid-name
         return self._model.cp_model.AddExactlyOne(lst)
-    def _AddImplication(self,a,b):
+    def _AddImplication(self,a,b): # pylint:disable=invalid-name
         return self._model.cp_model.AddImplication(a,b)
-    def _AddAtLeastOne(self,lst):
+    def _AddAtLeastOne(self,lst): # pylint:disable=invalid-name
         return self._model.cp_model.AddAtLeastOne(lst)
-    def _AddAbsEq(self,dst,expr):
+    def _AddAbsEq(self,dst,expr): # pylint:disable=invalid-name
         return self._model.cp_model.AddAbsEquality(dst,expr)
-    def _AddAllDifferent(self,lst):
-        if len(lst) < 2:
-            return
+    def _AddAllDifferent(self,lst): # pylint:disable=invalid-name
         return self._model.cp_model.AddAllDifferent(lst)
-    def _AddHint(self,var,val):
+    def _AddHint(self,var,val): # pylint:disable=invalid-name
         return self._model.cp_model.AddHint(var,val)
-    def _AddNoOverlap(self,interval_list):
-        if len(interval_list) < 2:
-            return
+    def _AddNoOverlap(self,interval_list): # pylint:disable=invalid-name
         return self._model.cp_model.AddNoOverlap(interval_list)
 
     def _export_model(self, log_model):
@@ -2649,7 +2818,7 @@ def retry(self, fix_stalls=None):
         self._set_timeout(self.config.retry_timeout)
 
         # - Objective
-        self._add_objective(force_objective = (fix_stalls is not None))
+        self._add_objective(force_objective = fix_stalls is not None)
 
         # Do the actual work
         self.logger.info("Invoking external constraint solver...")
@@ -2663,4 +2832,4 @@ def retry(self, fix_stalls=None):
 
     def _dump_model_statistics(self):
         # Extract and report results
-        SlothyBase.dump("Statistics", self._model.cp_model.cp_solver.ResponseStats(), self.logger)
+        SourceLine.log("Statistics", self._model.cp_model.cp_solver.ResponseStats(), self.logger)
diff --git a/slothy/dataflow.py b/slothy/core/dataflow.py
similarity index 88%
rename from slothy/dataflow.py
rename to slothy/core/dataflow.py
index e25d61c0..965a6db6 100644
--- a/slothy/dataflow.py
+++ b/slothy/core/dataflow.py
@@ -26,7 +26,7 @@
 #
 
 from functools import cached_property
-from .helper import AsmHelper
+from slothy.helper import SourceLine
 
 class SlothyUselessInstructionException(Exception):
     """An instruction was found whose outputs are neither used by a subsequent instruction
@@ -127,6 +127,7 @@ def __init__(self, reg, reg_ty):
         self.num_in = 1
         self.args_in = [reg]
         self.arg_types_in = [reg_ty]
+        self.args_in_restrictions = [None]
 
     def write(self):
         return f"// output renaming: {self.orig_reg} -> {self.args_in_out[0]}"
@@ -141,6 +142,7 @@ def __init__(self, reg, reg_ty):
         self.num_out = 1
         self.args_out = [reg]
         self.arg_types_out = [reg_ty]
+        self.args_out_restrictions = [None]
 
     def write(self):
         return f"// input renaming: {self.orig_reg} -> {self.args_out[0]}"
@@ -209,6 +211,17 @@ def isinstancelist(l, c):
         self.dst_out    = [ [] for _ in range(inst.num_out)    ]
         self.dst_in_out = [ [] for _ in range(inst.num_in_out) ]
 
+    def to_source_line(self):
+        """Convert node in data flor graph to source line.
+
+        This keeps original tags and comments from the source line that
+        gave rise to the node, but updates the text with the stringification
+        of the instruction underlying the node.
+        """
+        line = self.inst.source_line.copy()
+        inst_txt = str(self.inst)
+        return line.set_text(inst_txt)
+
     @cached_property
     def is_virtual_input(self):
         """Indicates whether the node is an input node."""
@@ -278,9 +291,9 @@ def arch(self):
     def typing_hints(self):
         """A dictionary of 'typing hints' explicitly assigning to symbolic register names
          a register type.
-        
-        This can be necessary to disambiguate the type of symbolic registers. 
-        For example, the Helium vector extension has various instructions which 
+
+        This can be necessary to disambiguate the type of symbolic registers.
+        For example, the Helium vector extension has various instructions which
         accept either vector or GPR arguments."""
         typing_hints = { name : ty for ty in self.arch.RegisterType \
                for name in self.arch.RegisterType.list_registers(ty, with_variants=True) }
@@ -291,12 +304,12 @@ def outputs(self):
         return self._outputs
     @property
     def inputs_are_outputs(self):
-        """Every input is automatically treated as an output. 
+        """Every input is automatically treated as an output.
         This is typically set for loop kernels."""
         return self._inputs_are_outputs
     @property
     def allow_useless_instructions(self):
-        """Indicates whether data flow creation should raise SlothyUselessInstructionException 
+        """Indicates whether data flow creation should raise SlothyUselessInstructionException
         when a useless instruction is detected."""
         return self._allow_useless_instructions
 
@@ -325,6 +338,7 @@ def __init__(self, slothy_config=None, **kwargs):
         self._outputs = None
         self._inputs_are_outputs = None
         self._allow_useless_instructions = None
+        self._locked_registers = None
         self._load_slothy_config(slothy_config)
         for k,v in kwargs.items():
             setattr(self,k,v)
@@ -334,6 +348,7 @@ def _load_slothy_config(self, slothy_config):
             return
         self._slothy_config = slothy_config
         self._arch = slothy_config.arch
+        self._locked_registers = slothy_config.locked_registers
         self._typing_hints = self._slothy_config.typing_hints
         self._outputs = self._slothy_config.outputs
         self._inputs_are_outputs = self._slothy_config.inputs_are_outputs
@@ -461,7 +476,7 @@ def _iter_edges_with_label():
 
     def depth(self):
         """The depth of the data flow graph.
-        
+
         Equivalently, the maximum length of a dependency chain in the assembly source
         represented by the graph."""
         if self.nodes is None or len(self.nodes) == 0:
@@ -479,6 +494,73 @@ def arch(self):
         """The underlying architecturel model"""
         return self.config.arch
 
+    def apply_cbs(self, cb, logger, one_a_time=False):
+        """Apply callback to all nodes in the graph"""
+
+        count = 0
+        while True:
+            count += 1
+            assert count < 100 # There shouldn't be many repeated modifications to the CFG
+
+            some_change = False
+
+            for t in self.nodes:
+                t.delete = False
+                t.changed = False
+
+            for t in self.nodes:
+                if cb(t):
+                    some_change = True
+                    if one_a_time is True:
+                        break
+
+            if some_change is False:
+                break
+
+            z = zip(self.nodes, self.src)
+            z = filter(lambda x: x[0].delete is False, z)
+            z = map(lambda x: ([x[0].inst], x[0].inst.write()), z)
+
+            self.src = list(z)
+
+            # Otherwise, parse again
+            changed = [t for t in self.nodes if t.changed is True]
+            deleted = [t for t in self.nodes if t.delete is True]
+
+            logger.debug("Some instruction changed in callback -- need to build dataflow graph again...")
+
+            for t in deleted:
+                logger.debug("* %s was deleted", t)
+            for t in changed:
+                logger.debug("* %s was changed", t)
+
+            self._build_graph()
+
+    def apply_parsing_cbs(self):
+        """Apply parsing callbacks to all nodes in the graph.
+
+        Typically, we only build the computation flow graph once. However, sometimes we make
+        retrospective modifications to instructions afterwards, and then need to reparse.
+
+        An example for this are jointly destructive instruction patterns: A sequence of
+        instructions where each instruction individually overwrites only part of a register,
+        but jointly they overwrite the register as a whole. In this case, we can remove the
+        output register as an input dependency for the first instruction in the sequence,
+        thereby creating more reordering and renaming flexibility. In this case, we change
+        the instruction and then rebuild the computation flow graph.
+        """
+        logger = self.logger.getChild("parsing_cbs")
+        def parsing_cb(t):
+            return t.inst.global_parsing_cb(t, log=logger.info)
+        return self.apply_cbs(parsing_cb, logger)
+
+    def apply_fusion_cbs(self):
+        """Apply fusion callbacks to nodes in the graph"""
+        logger = self.logger.getChild("fusion_cbs")
+        def fusion_cb(t):
+            return t.inst.global_fusion_cb(t, log=logger.info)
+        return self.apply_cbs(fusion_cb, logger, one_a_time=True)
+
     def __init__(self, src, logger, config, parsing_cb=True):
         """Compute a data flow graph from a source code snippet.
 
@@ -497,45 +579,10 @@ def __init__(self, src, logger, config, parsing_cb=True):
         self.config = config
         self.src = self._parse_source(src)
 
-        # Typically, we only build the computation flow graph once. However, sometimes we make
-        # retrospective modifications to instructions afterwards, and then need to reparse.
-        #
-        # An example for this are jointly destructive instruction patterns: A sequence of
-        # instructions where each instruction individually overwrites only part of a register,
-        # but jointly they overwrite the register as a whole. In this case, we can remove the
-        # output register as an input dependency for the first instruction in the sequence,
-        # thereby creating more reordering and renaming flexibility. In this case, we change
-        # the instruction and then rebuild the computation flow graph.
-        count = 0
-        while True:
-            count += 1
-            assert count < 10 # There shouldn't be many repeated modifications to the CFG
-
-            self._build_graph()
-
-            if not parsing_cb:
-                break
+        self._build_graph()
 
-            changed = []
-            for t in self.nodes:
-                was_changed = t.inst.global_parsing_cb(t)
-                if was_changed: # remember to build the dataflow graph again
-                    changed.append(t)
-
-            changes = len(changed)
-            # If no instruction was modified, we're done
-            if changes == 0:
-                break
-
-            self.src = list(zip([[t.inst] for t in self.nodes], [s[1] for s in self.src]))
-
-            # Otherwise, parse again
-            logger.debug("%d instructions changed -- need to build dataflow graph again...",
-                         changes)
-            logger.debug("The following instructions have changed:")
-            if changes > 0:
-                for t in changed:
-                    logger.debug(t)
+        if parsing_cb is True:
+            self.apply_parsing_cbs()
 
         self._selfcheck_outputs()
 
@@ -565,15 +612,24 @@ def outputs_unused(t):
             self.logger.warning(f"Instruction details: {t}, {t.inst.inputs}")
             self._dump_instructions("Source code", error=False)
 
+    def _parse_line(self, l):
+        assert SourceLine.is_source_line(l)
+        insts = self.arch.Instruction.parser(l)
+        # Remember options from source line
+        # TODO: Might not be the right place to remember options
+        for inst in insts:
+            inst.source_line = l
+        return (insts, l)
+
     def _parse_source(self, src):
-        return [ (self.arch.Instruction.parser(l),l) for l in AsmHelper.reduce_source(src) ]
+        return list(map(self._parse_line, SourceLine.reduce_source(src)))
 
     def iter_dependencies(self):
         """Returns an iterator over all dependencies in the data flow graph.
-        
+
         Each returned element has the form (consumer, producer, ty, idx), representing a dependency
-        from output producer to the idx-th input (if ty=="in") or input/output (if ty=="inout") of 
-        consumer. The producer field is an instance of RegisterSource and contains the output index 
+        from output producer to the idx-th input (if ty=="in") or input/output (if ty=="inout") of
+        consumer. The producer field is an instance of RegisterSource and contains the output index
         and source instruction as producer.idx and producer.src, respectively."""
         for consumer in self.nodes_all:
             for idx, producer in enumerate(consumer.src_in):
@@ -647,8 +703,8 @@ def get_fresh_reg():
             no_ssa.append((producer.src, producer.idx))
 
         for t in self.nodes:
-            for (i,_) in enumerate(t.inst.args_out):
-                if (t,i) in no_ssa:
+            for (i,c) in enumerate(t.inst.args_out):
+                if c in self.config._locked_registers or (t,i) in no_ssa:
                     continue
                 t.inst.args_out[i] = get_fresh_reg()
 
@@ -763,6 +819,7 @@ def find_sources(types,names):
 
         step = ComputationNode(node_id=s_id, orig_pos=orig_pos, inst=s,
                                src_in=src_in, src_in_out=src_in_out)
+        step.reg_state = self.reg_state.copy()
 
         def change_reg_ref(reg, ref):
             self._remember_type(reg, ref.get_type())
diff --git a/slothy/heuristics.py b/slothy/core/heuristics.py
similarity index 58%
rename from slothy/heuristics.py
rename to slothy/core/heuristics.py
index edb9a330..b278d50a 100644
--- a/slothy/heuristics.py
+++ b/slothy/core/heuristics.py
@@ -25,22 +25,33 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
+"""SLOTHY heuristics
+
+The one-shot SLOTHY approach tends to become computationally infeasible above
+200 assembly instructions. To optimize kernels beyond that threshold, this
+module provides heuristics to split the optimization problem into several
+smaller-sizes problems amenable to one-shot SLOTHY.
+"""
+
 import math
 import random
-import numpy as np
 
-from slothy.dataflow import DataFlowGraph as DFG
-from slothy.dataflow import Config as DFGConfig
-from slothy.core import SlothyBase, Result
-from slothy.helper import Permutation, AsmHelper
+from slothy.core.dataflow import DataFlowGraph as DFG
+from slothy.core.dataflow import Config as DFGConfig, ComputationNode
+from slothy.core.core import SlothyBase, Result, SlothyException
+from slothy.helper import Permutation, SourceLine
 from slothy.helper import binary_search, BinarySearchLimitException
 
 class Heuristics():
+    """Break down large optimization problems into smaller ones.
+
+    The one-shot SLOTHY approach tends to become computationally infeasible above
+    200 assembly instructions. To optimize kernels beyond that threshold, this
+    class provides heuristics to split the optimization problem into several
+    smaller-sizes problems amenable to one-shot SLOTHY."""
 
     @staticmethod
-    def optimize_binsearch_core(source, logger, conf, **kwargs):
-        """Shim wrapper around Slothy performing a binary search for the
-        minimization of stalls"""
+    def _optimize_binsearch_core(source, logger, conf, **kwargs):
 
         logger_name = logger.name.replace(".","_")
         last_successful = None
@@ -73,70 +84,139 @@ def try_with_stalls(stalls, timeout=None):
 
         try:
             return binary_search(try_with_stalls,
-                                 minimum= conf.constraints.stalls_minimum_attempt - 1,
-                                 start=conf.constraints.stalls_first_attempt,
-                                 threshold=conf.constraints.stalls_maximum_attempt,
-                                 precision=conf.constraints.stalls_precision,
-                                 timeout_below_precision=conf.constraints.stalls_timeout_below_precision)
+                minimum=conf.constraints.stalls_minimum_attempt - 1,
+                start=conf.constraints.stalls_first_attempt,
+                threshold=conf.constraints.stalls_maximum_attempt,
+                precision=conf.constraints.stalls_precision,
+                timeout_below_precision=conf.constraints.stalls_timeout_below_precision)
+
         except BinarySearchLimitException:
             logger.error("Exceeded stall limit without finding a working solution")
             logger.error("Here's what you asked me to optimize:")
-            Heuristics._dump("Original source code", source, logger=logger, err=True, no_comments=True)
-            logger.error("Configuration")
+
+            Heuristics._dump("Original source code", source,
+                logger=logger, err=True, no_comments=True)
+            logger.error("Configuration:")
             conf.log(logger.error)
 
             err_file = conf.log_dir + f"/{logger_name}_ERROR.s"
-            f = open(err_file, "w")
-            conf.log(lambda l: f.write("// " + l + "\n"))
-            f.write('\n'.join(source))
-            f.close()
+            with open(err_file, "w", encoding="utf-8") as f:
+                conf.log(lambda l: f.write("// " + l + "\n"))
+                f.write('\n'.join(source))
+
             logger.error(f"Stored this information in {err_file}")
 
     @staticmethod
     def optimize_binsearch(source, logger, conf, **kwargs):
+        """Optimize for minimum number of stalls, and potentially a secondary objective.
+
+        Args:
+            source: The source code to be optimized. Must be a list of
+                SourceLine instances.
+            logger: The logger to be used
+            conf: The configuration to apply. This fixed for all one-shot SLOTHY
+                runs invoked by this call, except for the variation of the stall count.
+
+        The `variable_size` configuration option determines whether the minimiation of
+        stalls happens internally or externally. Internal minimization means that the
+        number of stalls is part of the model, and its minimization registered as the
+        objective to the underlying solver. External minimization means that the number
+        of stalls is statically fixed per one-shot SLOTHY optimization, and that an
+        external binary search is used to minimize it.
+
+        Returns:
+            The Result object for the succceeding optimization with the smallest
+            number of stalls.
+        """
         if conf.variable_size:
             return Heuristics.optimize_binsearch_internal(source, logger, conf, **kwargs)
-        else:
-            return Heuristics.optimize_binsearch_external(source, logger, conf, **kwargs)
+
+        return Heuristics.optimize_binsearch_external(source, logger, conf, **kwargs)
+
+    @staticmethod
+    def _log_reoptimization_failure(log):
+        log.warning("Re-optimization with objective at minimum number of stalls failed. "\
+            "By the non-deterministic nature of the optimization, this can happen. "     \
+            "Will just pick previous result...")
+
+    @staticmethod
+    def _log_input_output_warning(log):
+        log.warning("You are using SW pipelining without setting inputs_are_outputs=True."\
+                    "This means that the last iteration of the loop may overwrite inputs "\
+                    "to the loop (such as address registers), unless they are marked as " \
+                    "reserved registers. If this is intended, ignore this warning. "      \
+                    "Otherwise, consider setting inputs_are_outputs=True to ensure that " \
+                    "nothing that is used as an input to the loop is overwritten, "       \
+                    "not even in the last iteration.")
 
     @staticmethod
     def optimize_binsearch_external(source, logger, conf, flexible=True, **kwargs):
-        """Find minimum number of stalls without objective, then optimize
-        the objective for a fixed number of stalls."""
+        """Externally optimize for minimum number of stalls, and potentially a secondary objective.
+
+        This function uses an external binary search to find the minimum number of stalls
+        for which a one-shot SLOTHY optimization succeeds. If the provided configuration
+        has a secondary objective, it then re-optimizes the result for that secondary
+        objective, fixing the minimal number of stalls.
+
+        Args:
+            source: The source code to be optimized. Must be a list of SourceLine instances.
+            logger: The logger to be used.
+            conf: The configuration to apply. This is fixed for all one-shot SLOTHY
+                runs invoked by this call, except for variation of stall count.
+            flexible: Indicates whether the number of stalls should be minimized
+                through a binary search, or whether a single one-shot SLOTHY optimization
+                for a fixed number of stalls (encoded in the configuration) should be
+                conducted.
+
+        Returns:
+            A Result object representing the final optimization result.
+        """
 
         if not flexible:
             core = SlothyBase(conf.arch, conf.target, logger=logger,config=conf)
             if not core.optimize(source):
-                raise Exception("Optimization failed")
+                raise SlothyException("Optimization failed")
             return core.result
 
         logger.info("Perform external binary search for minimal number of stalls...")
 
         c = conf.copy()
         c.ignore_objective = True
-        min_stalls, core = Heuristics.optimize_binsearch_core(source, logger, c, **kwargs)
+        min_stalls, core = Heuristics._optimize_binsearch_core(source, logger, c, **kwargs)
 
         if not conf.has_objective:
             return core.result
 
-        logger.info(f"Optimize again with minimal number of {min_stalls} stalls, with objective...")
+        logger.info("Optimize again with minimal number of %d stalls, with objective...",
+            min_stalls)
         first_result = core.result
 
         core.config.ignore_objective = False
         success = core.retry()
 
         if not success:
-            logger.warning("Re-optimization with objective at minimum number of stalls failed -- should not happen? Will just pick previous result...")
+            Heuristics._log_reoptimization_failure(logger)
             return first_result
 
-        # core = SlothyBase(conf.arch, conf.target, logger=logger, config=c)
-        # success = core.optimize(source, **kwargs)
         return core.result
 
     @staticmethod
     def optimize_binsearch_internal(source, logger, conf, **kwargs):
-        """Find minimum number of stalls without objective, then optimize
-        the objective for a fixed number of stalls."""
+        """Internally optimize for minimum number of stalls, and potentially a secondary objective.
+
+        This finds the minimum number of stalls for which a one-shot SLOTHY optimization succeeds.
+        If the provided configuration has a secondary objective, it then re-optimizes the result
+        for that secondary objective, fixing the minimal number of stalls.
+
+        Args:
+            source: The source code to be optimized. Must be a list of SourceLine instances.
+            logger: The logger to be used.
+            conf: The configuration to apply. This is fixed for all one-shot SLOTHY
+                runs invoked by this call, except for variation of stall count.
+
+        Returns:
+            A Result object representing the final optimization result.
+        """
 
         logger.info("Perform internal binary search for minimal number of stalls...")
 
@@ -148,7 +228,7 @@ def optimize_binsearch_internal(source, logger, conf, **kwargs):
             c.variable_size = True
             c.constraints.stalls_allowed = cur_attempt
 
-            logger.info(f"Attempt optimization with max {cur_attempt} stalls...")
+            logger.info("Attempt optimization with max %d stalls...", cur_attempt)
 
             core = SlothyBase(c.arch, c.target, logger=logger, config=c)
             success = core.optimize(source, **kwargs)
@@ -160,40 +240,63 @@ def optimize_binsearch_internal(source, logger, conf, **kwargs):
             cur_attempt = max(1,cur_attempt * 2)
             if cur_attempt > conf.constraints.stalls_maximum_attempt:
                 logger.error("Exceeded stall limit without finding a working solution")
-                raise Exception("No solution found")
+                raise SlothyException("No solution found")
 
         logger.info(f"Minimum number of stalls: {min_stalls}")
 
         if not conf.has_objective:
             return core.result
 
-        logger.info(f"Optimize again with minimal number of {min_stalls} stalls, with objective...")
+        logger.info("Optimize again with minimal number of %d stalls, with objective...",
+            min_stalls)
         first_result = core.result
 
         success = core.retry(fix_stalls=min_stalls)
         if not success:
-            logger.warning("Re-optimization with objective at minimum number of stalls failed -- should not happen? Will just pick previous result...")
+            Heuristics._log_reoptimization_failure(logger)
             return first_result
 
         return core.result
 
     @staticmethod
     def periodic(body, logger, conf):
-        """Heuristics for the optimization of large loops
-
-        Can be called if software pipelining is disabled. In this case, it just
-        forwards to the linear heuristic."""
+        """Entrypoint for optimization of loops.
+
+        If software pipelining is disabled, this function forwards to
+        the straightline optimization via Heuristics.linear().
+
+        If software pipelining is enabled but the halving heuristic
+        is disabled, this function performs a one-shot SLOTHY optimization
+        without heuristics.
+
+        If software pipelining is enabled and the halving heuristic is
+        enabled, this function optimizes the loop body via straightline
+        optimization first, splits result as `[A;B]`, and optimizes
+        `[B;A]` again via straightline optimizations. The optimized loop
+        is then given by the preamble `A`, kernel `opt(B;A)`, and postamble
+        `B`. The straightline optimizations applied in this heuristics are
+        done via Heuristics.linear() and thus themselves subject to the
+        splitting heuristic, if enabled.
+
+        Args:
+            body: The loop body to be optimized. This must be a list of
+                SourceLine instances.
+            logger: The logger to be used.
+            conf: The configuration to be applied.
+
+        Returns:
+            Tuple (preamble, kernel, postamble, num_exceptional_iterations)
+            of preamble, kernel and postamble (each as a list of SourceLine
+            objects), plus the number of iterations jointly accounted for by
+            the preamble and postamble (the caller will need this to adjust the
+            loop counter).
+        """
 
         if conf.sw_pipelining.enabled and not conf.inputs_are_outputs:
-            logger.warning("You are using SW pipelining without setting inputs_are_outputs=True. This means that the last iteration of the loop may overwrite inputs to the loop (such as address registers), unless they are marked as reserved registers. If this is intended, ignore this warning. Otherwise, consider setting inputs_are_outputs=True to ensure that nothing that is used as an input to the loop is overwritten, not even in the last iteration.")
+            Heuristics._log_input_output_warning(logger)
 
-        def unroll(source):
-            if conf.sw_pipelining.enabled:
-                source = source * conf.sw_pipelining.unroll
-            source = '\n'.join(source)
-            return source
-
-        body = unroll(body)
+        if conf.sw_pipelining.enabled:
+            body = body * conf.sw_pipelining.unroll
 
         if conf.inputs_are_outputs:
             dfg = DFG(body, logger.getChild("dfg_generate_outputs"),
@@ -202,10 +305,10 @@ def unroll(source):
             conf.inputs_are_outputs = False
 
         # If we're not asked to do software pipelining, just forward to
-        # the heurstics for linear optimization.
+        # the heuristics for linear optimization.
         if not conf.sw_pipelining.enabled:
-            core = Heuristics.linear( body, logger=logger, conf=conf)
-            return [], core, [], 0
+            res = Heuristics.linear( body, logger=logger, conf=conf)
+            return [], res.code, [], 0
 
         if conf.sw_pipelining.halving_heuristic:
             return Heuristics._periodic_halving( body, logger, conf)
@@ -224,6 +327,7 @@ def unroll(source):
 
         num_exceptional_iterations = result.num_exceptional_iterations
         kernel = result.code
+        assert SourceLine.is_source(kernel)
 
         # Second step: Separately optimize preamble and postamble
 
@@ -231,12 +335,12 @@ def unroll(source):
         if conf.sw_pipelining.optimize_preamble:
             logger.debug("Optimize preamble...")
             Heuristics._dump("Preamble", preamble, logger)
-            logger.debug(f"Dependencies within kernel: "\
-                         f"{result.kernel_input_output}")
+            logger.debug("Dependencies within kernel: %s", result.kernel_input_output)
             c = conf.copy()
             c.outputs = result.kernel_input_output
             c.sw_pipelining.enabled=False
-            preamble = Heuristics.linear(preamble,conf=c, logger=logger.getChild("preamble"))
+            res_preamble = Heuristics.linear(preamble,conf=c, logger=logger.getChild("preamble"))
+            preamble = res_preamble.code
 
         postamble = result.postamble
         if conf.sw_pipelining.optimize_postamble:
@@ -244,27 +348,43 @@ def unroll(source):
             Heuristics._dump("Preamble", postamble, logger)
             c = conf.copy()
             c.sw_pipelining.enabled=False
-            postamble = Heuristics.linear(postamble, conf=c, logger=logger.getChild("postamble"))
+            res_postamble = Heuristics.linear(postamble, conf=c,
+                logger=logger.getChild("postamble"))
+            postamble = res_postamble.code
 
         return preamble, kernel, postamble, num_exceptional_iterations
 
     @staticmethod
-    def linear(body, logger, conf, visualize_stalls=True):
-        """Heuristic for the optimization of large linear chunks of code.
+    def linear(body, logger, conf):
+        """Entrypoint for straightline optimization.
+
+        If the split heuristic is disabled, this forwards to a one-shot optimization.
+
+        If the split heuristic is enabled (conf.split_heuristic == True), the assembly
+        input is optimized by successively applying one-shot optimizations to a
+        'sliding window' of code.
 
-        Must only be called if software pipelining is disabled."""
+        Args:
+            body: The assembly input to be optimized. This must be a list of
+                SourceLine objects.
+            conf: The configuration to be applied. Software pipelining must be disabled.
+
+        Raises:
+            Raises a SlothyException if software pipelining is enabled.
+        """
+        assert SourceLine.is_source(body)
         if conf.sw_pipelining.enabled:
-            raise Exception("Linear heuristic should only be called with SW pipelining disabled")
+            raise SlothyException("Linear heuristic should only be called "
+                                  "with SW pipelining disabled")
 
         Heuristics._dump("Starting linear optimization...", body, logger)
 
         # So far, we only implement one heuristic: The splitting heuristic --
         # If that's disabled, just forward to the core optimization
         if not conf.split_heuristic:
-            result = Heuristics.optimize_binsearch(body,logger.getChild("slothy"), conf)
-            return result.code
+            return Heuristics.optimize_binsearch(body,logger.getChild("slothy"), conf)
 
-        return Heuristics._split( body, logger, conf, visualize_stalls)
+        return Heuristics._split(body, logger, conf)
 
     @staticmethod
     def _naive_reordering(body, logger, conf, use_latency_depth=False):
@@ -285,11 +405,11 @@ def _naive_reordering(body, logger, conf, use_latency_depth=False):
         else:
             # Calculate latency-depth of instruction nodes
             nodes_by_depth = dfg.nodes.copy()
-            nodes_by_depth.sort(key=(lambda t: t.depth))
+            nodes_by_depth.sort(key=lambda t: t.depth)
             for t in dfg.nodes_all:
                 t.latency_depth = 0
             def get_latency(tp,t):
-                if tp.src.is_virtual():
+                if tp.src.is_virtual:
                     return 0
                 return conf.target.get_latency(tp.src.inst, tp.idx, t.inst)
             for t in nodes_by_depth:
@@ -304,14 +424,15 @@ def get_latency(tp,t):
 
         perm = Permutation.permutation_id(l)
 
-        for i in range(l):
-            def get_inputs(inst):
-                return set(inst.args_in + inst.args_in_out)
-            def get_outputs(inst):
-                return set(inst.args_out + inst.args_in_out)
+        def get_inputs(inst):
+            return set(inst.args_in + inst.args_in_out)
+        def get_outputs(inst):
+            return set(inst.args_out + inst.args_in_out)
 
-            joint_prev_inputs = {}
-            joint_prev_outputs = {}
+        joint_prev_inputs = {}
+        joint_prev_outputs = {}
+
+        for i in range(l):
             cur_joint_prev_inputs = set()
             cur_joint_prev_outputs = set()
             for j in range(i,l):
@@ -339,24 +460,15 @@ def could_come_next(j):
 
             def pick_candidate(candidate_idxs):
 
-                # print("CANDIDATES: " + '\n* '.join(list(map(lambda idx: str((body[idx], conf.target.get_units(insts[idx]))), candidate_idxs))))
-                # There a different strategies one can pursue here, some being:
-                # - Always pick the candidate instruction of the smallest depth
-                # - Peek into the uarch model and try to alternate between functional units
-                #   It's a bit disappointing if this is necessary, since SLOTHY should do this.
-                #   However, running it on really large snippets (1000 instructions) remains
-                #   infeasible, even if latencies and renaming are disabled.
-
                 strategy = "minimal_depth"
-                # strategy = "alternate_functional_units"
 
                 if strategy == "minimal_depth":
                     candidate_depths = list(map(lambda j: depths[j], candidate_idxs))
                     logger.debug("Candidate %s: %s", depth_str, candidate_depths)
                     choice_idx = candidate_idxs[candidate_depths.index(min(candidate_depths))]
 
-                elif strategy == "alternate_functional_units":
-
+                else:
+                    assert strategy == "alternate_functional_units"
                     def flatten_units(units):
                         res = []
                         for u in units:
@@ -389,44 +501,28 @@ def units_different(a,b):
                         candidate_depths = list(map(lambda j: depths[j], candidate_idxs))
                         logger.debug(f"Candidate {depth_str}s: {candidate_depths}")
                         min_depth = min(candidate_depths)
-                        refined_candidates = [ candidate_idxs[i] for i,d in enumerate(candidate_depths) if d == min_depth ]
+                        refined_candidates = [ candidate_idxs[i]
+                            for i,d in enumerate(candidate_depths) if d == min_depth ]
                         choice_idx = random.choice(refined_candidates)
 
-                else:
-                    raise Exception("Unknown preprocessing strategy")
-
                 return choice_idx
 
-            def move_entry_forward(lst, idx_from, idx_to, callback=None):
+            def move_entry_forward(lst, idx_from, idx_to):
                 entry = lst[idx_from]
                 del lst[idx_from]
-
-                if callback is not None:
-                    for before in lst[idx_to:idx_from]:
-                        callback(before, entry)
-
                 return lst[:idx_to] + [entry] + lst[idx_to:]
 
-            def inst_reorder_cb(t0,t1):
-                SlothyBase._fixup_reordered_pair(t0,t1,logger)
-
-            SlothyBase._fixup_reset(insts)
             choice_idx = None
             while choice_idx is None:
-                try:
-                    choice_idx = pick_candidate(candidate_idxs)
-                    insts = move_entry_forward(insts, choice_idx, i, inst_reorder_cb)
-                except:
-                    candidate_idxs.remove(choice_idx)
-                    choice_idx = None
-            SlothyBase._fixup_finish(insts, logger)
+                choice_idx = pick_candidate(candidate_idxs)
+                insts = move_entry_forward(insts, choice_idx, i)
 
             local_perm = Permutation.permutation_move_entry_forward(l, choice_idx, i)
             perm = Permutation.permutation_comp (local_perm, perm)
 
-            body = [ str(j.inst) for j in insts]
+            body = list(map(ComputationNode.to_source_line, insts))
             depths = move_entry_forward(depths, choice_idx, i)
-            body[i] = f"    {body[i].strip():100s} // {depth_str} {depths[i]}"
+            body[i].set_length(100).set_comment(f"{depth_str} {depths[i]}")
             Heuristics._dump("New code", body, logger)
 
         # Selfcheck
@@ -441,6 +537,9 @@ def inst_reorder_cb(t0,t1):
         res.valid = True
         res.selfcheck(logger.getChild("naive_interleaving_selfcheck"))
 
+        res.offset_fixup(logger.getChild("naive_interleaving_fixup"))
+        body = res.code_raw
+
         Heuristics._dump("Before naive interleaving", old, logger)
         Heuristics._dump("After naive interleaving", body, logger)
         return body, perm
@@ -454,11 +553,11 @@ def _get_ssa_form(body, logger, conf):
         logger.info("Transform DFG into SSA...")
         dfg = DFG(body, logger.getChild("dfg_ssa"), DFGConfig(conf.copy()), parsing_cb=True)
         dfg.ssa()
-        ssa = [ str(t.inst) for t in dfg.nodes ]
+        ssa = [ ComputationNode.to_source_line(t) for t in dfg.nodes ]
         return ssa
 
     @staticmethod
-    def _split_inner(body, logger, conf, visualize_stalls=True, ssa=False):
+    def _split_inner(body, logger, conf, ssa=False):
 
         l = len(body)
         if l == 0:
@@ -484,34 +583,22 @@ def _split_inner(body, logger, conf, visualize_stalls=True, ssa=False):
                 c = conf.copy()
                 c.constraints.allow_reordering = False
                 c.constraints.functional_only = True
-                body = AsmHelper.reduce_source(body)
-                result = Heuristics.optimize_binsearch(body, log.getChild("remove_symbolics"),conf=c)
+                body = SourceLine.reduce_source(body)
+                result = Heuristics.optimize_binsearch(body,
+                    log.getChild("remove_symbolics"),conf=c)
                 body = result.code
-                body = AsmHelper.reduce_source(body)
+                body = SourceLine.reduce_source(body)
         else:
             perm = Permutation.permutation_id(l)
 
-        # log.debug("Remove symbolics...")
-        # c = conf.copy()
-        # c.constraints.allow_reordering = False
-        # c.constraints.functional_only = True
-        # body = AsmHelper.reduce_source(body)
-        # result = Heuristics.optimize_binsearch(body, log.getChild("remove_symbolics"),conf=c)
-        # body = result.code
-        # body = AsmHelper.reduce_source(body)
-
-        # conf.outputs = result.outputs
-
-        # Heuristics._dump("Source code without symbolic registers", body, log)
-
-        # conf.outputs = result.outputs
-
         def print_intarr(arr, l,vals=50):
-            m = max(10,max(arr))
+            m = max(10, max(arr)) # pylint:disable=nested-min-max
             start_idxs = [ (l * i)     // vals for i in range(vals) ]
             end_idxs   = [ (l * (i+1)) // vals for i in range(vals) ]
             avgs = []
             for (s,e) in zip(start_idxs, end_idxs):
+                if s == e:
+                    continue
                 avg = sum(arr[s:e]) // (e-s)
                 avgs.append(avg)
                 log.info(f"[{s:3d}-{e:3d}]: {'*'*avg}{'.'*(m-avg)} ({avg})")
@@ -522,7 +609,8 @@ def print_stalls(stalls,l):
             stalls_arr = [ i in stalls for i in range(l) ]
             for v in stalls_arr:
                 assert v in {0,1}
-            stalls_cumulative = [ sum(stalls_arr[max(0,i-math.floor(chunk_len/2)):i+math.ceil(chunk_len/2)]) for i in range(l) ]
+            stalls_cumulative = [ sum(stalls_arr[max(0,i-math.floor(chunk_len/2))
+                :i+math.ceil(chunk_len/2)]) for i in range(l) ]
             print_intarr(stalls_cumulative,l)
 
         def optimize_chunk(start_idx, end_idx, body, stalls,show_stalls=True):
@@ -549,7 +637,8 @@ def optimize_chunk(start_idx, end_idx, body, stalls,show_stalls=True):
             pre_pad = len(cur_pre)
             post_pad = len(cur_post)
 
-            Heuristics._dump(f"Optimizing chunk [{start_idx}-{prefix_len}:{end_idx}+{suffix_len}]", cur_body, log)
+            Heuristics._dump(f"Optimizing chunk [{start_idx}-{prefix_len}:{end_idx}+{suffix_len}]",
+                cur_body, log)
             if prefix_len > 0:
                 Heuristics._dump("Using prefix", cur_prefix, log)
             if suffix_len > 0:
@@ -571,7 +660,7 @@ def optimize_chunk(start_idx, end_idx, body, stalls,show_stalls=True):
                 log.getChild(f"{start_idx}_{end_idx}"), c,
                 prefix_len=prefix_len, suffix_len=suffix_len)
             Heuristics._dump(f"New chunk [{start_idx}:{end_idx}]", result.code, log)
-            new_body = cur_pre + AsmHelper.reduce_source(result.code) + cur_post
+            new_body = cur_pre + SourceLine.reduce_source(result.code) + cur_post
 
             perm = Permutation.permutation_pad(result.reordering, pre_pad, post_pad)
 
@@ -606,8 +695,7 @@ def make_idx_list_consecutive(factor, increment):
             end_pos = []
             while cur_end < 1.0:
                 cur_end = cur_start + chunk_len
-                if cur_end > 1.0:
-                    cur_end = 1.0
+                cur_end = min(cur_end, 1.0)
                 start_pos.append(cur_start)
                 end_pos.append(cur_end)
 
@@ -639,19 +727,14 @@ def not_empty(x):
         else:
             increment = conf.split_heuristic_stepsize
 
-        # orig_body = AsmHelper.reduce_source(cur_body).copy()
-        # perm = Permutation.permutation_id(len(orig_body))
-
         # Remember inputs and outputs
         dfgc = DFGConfig(conf.copy())
         outputs = conf.outputs.copy()
         inputs = DFG(orig_body, log.getChild("dfg_infer_inputs"),dfgc).inputs.copy()
 
-        last_base = None
-
-        for i in range(conf.split_heuristic_repeat):
+        for _ in range(conf.split_heuristic_repeat):
 
-            cur_body = AsmHelper.reduce_source(cur_body)
+            cur_body = SourceLine.reduce_source(cur_body)
 
             if conf.split_heuristic_chunks:
                 start_pos = [ x[0] for x in conf.split_heuristic_chunks ]
@@ -667,86 +750,78 @@ def not_empty(x):
                     idx_lst.reverse()
 
             cur_body, stalls, local_perm = optimize_chunks_many(idx_lst, cur_body, stalls,
-                                                    abort_stall_threshold=conf.split_heuristic_abort_cycle_at)
+                                    abort_stall_threshold=conf.split_heuristic_abort_cycle_at)
             perm = Permutation.permutation_comp(local_perm, perm)
 
         # Check complete result
         res = Result(conf)
         res.orig_code = orig_body
-        res.code = AsmHelper.reduce_source(cur_body).copy()
+        res.code = SourceLine.reduce_source(cur_body).copy()
         res.codesize_with_bubbles = res.codesize
         res.success = True
         res.reordering_with_bubbles = perm
         res.input_renamings = { s:s for s in inputs }
         res.output_renamings = { s:s for s in outputs }
         res.valid = True
-        res.selfcheck(log.getChild("full_selfcheck"))
-        cur_body = res.code
-
-        maxlen = max(len(s.rstrip()) for s in cur_body)
-        for i in stalls:
-            if i > len(cur_body):
-                log.error("Something is wrong: Index %d, body length %d", i, len(cur_body))
-                Heuristics._dump("Body:", cur_body, log, err=True)
-            cur_body[i] = f"{cur_body[i].rstrip():{maxlen+8}s} // gap(s) to follow"
-
-        # Visualize model violations
-        if conf.split_heuristic_visualize_stalls:
-            cur_body = AsmHelper.reduce_source(cur_body)
-            c = conf.copy()
-            c.constraints.allow_reordering = False
-            c.constraints.allow_renaming = False
-            c.visualize_reordering = False
-            cur_body = Heuristics.optimize_binsearch( cur_body, log.getChild("visualize_stalls"), c).code
-            cur_body = ["// Start split region"] + cur_body + ["// End split region"]
-
-        # Visualize functional units
-        if conf.split_heuristic_visualize_units:
-            dfg = DFG(cur_body, logger.getChild("visualize_functional_units"), DFGConfig(c))
-            new_body = []
-            for (l,t) in enumerate(dfg.nodes):
-                unit = conf.target.get_units(t.inst)[0]
-                indentation = conf.target.ExecutionUnit.get_indentation(unit)
-                new_body[i] = f"{'':{indentation}s}" + l
-            cur_body = new_body
-
-        return cur_body
+        res.selfcheck(log.getChild("split_heuristic_full"))
+        return res
 
     @staticmethod
-    def _split(body, logger, conf, visualize_stalls=True):
+    def _split(body, logger, conf):
         c = conf.copy()
 
         # Focus on the chosen subregion
-        body = AsmHelper.reduce_source(body)
+        body = SourceLine.reduce_source(body)
 
         if c.split_heuristic_region == [0.0, 1.0]:
-            return Heuristics._split_inner(body, logger, c, visualize_stalls)
+            return Heuristics._split_inner(body, logger, c)
+
+        inputs = DFG(body, logger.getChild("dfg_generate_inputs"), DFGConfig(c)).inputs
 
         start_end_idxs = Heuristics._idxs_from_fractions(c.split_heuristic_region, body)
         start_idx = start_end_idxs[0]
         end_idx = start_end_idxs[1]
 
         pre = body[:start_idx]
-        cur = body[start_idx:end_idx]
+        partial_body = body[start_idx:end_idx]
         post = body[end_idx:]
 
         # Adjust the outputs
         c.outputs = DFG(post, logger.getChild("dfg_generate_outputs"), DFGConfig(c)).inputs
         c.inputs_are_outputs = False
 
-        cur = Heuristics._split_inner(cur, logger, c, visualize_stalls)
-        body = pre + cur + post
-        return body
+        res = Heuristics._split_inner(partial_body, logger, c)
+        new_partial_body = res.code
+
+        pre_pad = len(pre)
+        post_pad = len(post)
+        perm = Permutation.permutation_pad(res.reordering, pre_pad, post_pad)
+
+        new_body = SourceLine.reduce_source(pre + new_partial_body + post)
+
+        res2 = Result(conf)
+        res2.orig_code = body.copy()
+        res2.code = new_body
+        res2.codesize_with_bubbles = pre_pad + post_pad + res.codesize_with_bubbles
+        res2.success = True
+        res2.reordering_with_bubbles = perm
+        res2.input_renamings = { s:s for s in inputs }
+        res2.output_renamings = { s:s for s in conf.outputs }
+        res2.valid = True
+        res2.selfcheck(logger.getChild("split"))
+
+        return res2
 
     @staticmethod
     def _dump(name, s, logger, err=False, no_comments=False):
+        assert SourceLine.is_source(s)
+        s = [ str(l) for l in s]
+
         def strip_comments(sl):
             return [ s.split("//")[0].strip() for s in sl ]
 
         fun = logger.debug if not err else logger.error
-        fun(f"Dump: {name}")
-        if isinstance(s, str):
-          s = s.splitlines()
+        fun(f"Dump: {name} (size {len(s)})")
         if no_comments:
             s = strip_comments(s)
         for l in s:
@@ -759,6 +834,8 @@ def _periodic_halving(body, logger, conf):
         assert conf.sw_pipelining.enabled
         assert conf.sw_pipelining.halving_heuristic
 
+        body = SourceLine.reduce_source(body)
+
         # Find kernel dependencies
         kernel_deps = DFG(body, logger.getChild("dfg_kernel_deps"),
                           DFGConfig(conf.copy())).inputs
@@ -770,11 +847,80 @@ def _periodic_halving(body, logger, conf):
         c.outputs = c.outputs.union(kernel_deps)
 
         if not conf.sw_pipelining.halving_heuristic_split_only:
-            kernel = Heuristics.linear(body,logger.getChild("slothy"),conf=c,
-                                       visualize_stalls=False)
+            res_halving_0 = Heuristics.linear(body,logger.getChild("slothy"),conf=c)
+
+            # Split resulting kernel as [A;B] and synthesize result structure
+            # as if SW pipelining has been used and the result would have been
+            # [B;A], with preamble A and postamble B.
+            #
+            # Run the normal SW-pipelining selfcheck on this result.
+            #
+            # The overall goal here is to produce a result structure that's structurally
+            # the same as for normal SW pipelining, including checks and visualization.
+            #
+            # TODO: The 2nd optimization step below does not yet produce a Result structure.
+            reordering = res_halving_0.reordering
+            codesize = res_halving_0.codesize
+            def rotate_pos(p):
+                return p - (codesize // 2)
+            def is_pre(i):
+                return rotate_pos(reordering[i]) < 0
+
+            kernel = SourceLine.reduce_source(res_halving_0.code)
+            preamble = kernel[:codesize//2]
+            postamble = kernel[codesize//2:]
+
+            # Swap halves around and consider new kernel [B;A]
+            kernel = postamble + preamble
+
+            dfgc = DFGConfig(c.copy())
+            dfgc.inputs_are_outputs = False
+            core_out = DFG(postamble, logger.getChild("dfg_kernel_deps"),dfgc).inputs
+
+            dfgc = DFGConfig(conf.copy())
+            dfgc.inputs_are_outputs = True
+            dfgc.outputs = core_out
+            new_kernel_deps = DFG(kernel, logger.getChild("dfg_kernel_deps"),dfgc).inputs
+
+            c2 = c.copy()
+            c2.sw_pipelining.enabled = True
+
+            reordering1 = { i : rotate_pos(reordering[i])
+                for i in range(codesize) }
+            pre_core_post_dict1 = { i : (is_pre(i), not is_pre(i), False)
+                for i in range(codesize) }
+
+            res = Result(c2)
+            res.orig_code = body
+            res.code = kernel
+            res.preamble = preamble
+            res.postamble = postamble
+            res.kernel_input_output = new_kernel_deps
+            res.codesize_with_bubbles = res_halving_0.codesize_with_bubbles
+            res.reordering_with_bubbles = reordering1
+            res.pre_core_post_dict = pre_core_post_dict1
+            res.input_renamings = { s:s for s in kernel_deps }
+            res.output_renamings = { s:s for s in c.outputs }
+            res.success = True
+            res.valid = True
+
+            # Check result as if it has been produced by SW pipelining run
+            res.selfcheck(logger.getChild("halving_heuristic_1"))
+
         else:
             logger.info("Halving heuristic: Split-only -- no optimization")
-            kernel = body
+            codesize = len(body)
+            preamble = body[:codesize//2]
+            postamble = body[codesize//2:]
+            kernel = postamble + preamble
+
+            dfgc = DFGConfig(c.copy())
+            dfgc.inputs_are_outputs = False
+            kernel_deps = DFG(postamble, logger.getChild("dfg_kernel_deps"),dfgc).inputs
+
+            dfgc = DFGConfig(conf.copy())
+            dfgc.inputs_are_outputs = True
+            kernel_deps = DFG(kernel, logger.getChild("dfg_kernel_deps"),dfgc).inputs
 
         #
         # Second step:
@@ -792,27 +938,9 @@ def _periodic_halving(body, logger, conf):
         # iteration followed by the early half of the successive iteration. The hope is that this
         # enables good interleaving even without calling SLOTHY in SW pipelining mode.
 
-        kernel = AsmHelper.reduce_source(kernel)
-        kernel_len  = len(kernel)
-        kernel_lenh = kernel_len // 2
-        kernel_low  = kernel[:kernel_lenh]
-        kernel_high = kernel[kernel_lenh:]
-        kernel = kernel_high.copy() + kernel_low.copy()
-
-        preamble, postamble = kernel_low, kernel_high
-
-        dfgc = DFGConfig(conf.copy())
-        dfgc.outputs = kernel_deps
-        dfgc.inputs_are_outputs = False
-        kernel_deps = DFG(kernel_high, logger.getChild("dfg_kernel_deps"),dfgc).inputs
-
-        dfgc = DFGConfig(conf.copy())
-        dfgc.inputs_are_outputs = True
-        kernel_deps = DFG(kernel, logger.getChild("dfg_kernel_deps"),dfgc).inputs
-
         logger.info("Apply halving heuristic to optimize two halves of consecutive loop kernels...")
 
-        # The 'periodic' version considers the 'seam' between loop iterations; otherwise, we consider
+        # The 'periodic' version considers the 'seam' between iterations; otherwise, we consider
         # [B;A] as a non-periodic snippet, which may still lead to stalls at the loop boundary.
 
         if conf.sw_pipelining.halving_heuristic_periodic:
@@ -827,10 +955,50 @@ def _periodic_halving(body, logger, conf):
                                                     getChild("periodic heuristic"), conf=c).code
         elif not conf.sw_pipelining.halving_heuristic_split_only:
             c = conf.copy()
-            c.outputs = kernel_deps
-            c.sw_pipelining.enabled=False
-
-            kernel = Heuristics.linear( kernel, logger.getChild("heuristic"), conf=c)
+            c.outputs = new_kernel_deps
+            c.inputs_are_outputs = True
+            c.sw_pipelining.enabled = False
+
+            res_halving_1 = Heuristics.linear(kernel, logger.getChild("heuristic"), conf=c)
+            final_kernel = res_halving_1.code
+
+            reordering2 = res_halving_1.reordering_with_bubbles
+
+            c2 = conf.copy()
+
+            def get_reordering2(i):
+                is_pre = res.pre_core_post_dict[i][0]
+                p = reordering2[res.periodic_reordering[i]]
+                if is_pre:
+                    p -= res_halving_1.codesize_with_bubbles
+                return p
+            reordering2 = { i : get_reordering2(i) for i in range(codesize) }
+
+            res2 = Result(c2)
+            res2.orig_code = body
+            res2.code = final_kernel
+            res2.kernel_input_output = new_kernel_deps
+            res2.codesize_with_bubbles = res_halving_1.codesize_with_bubbles
+            res2.reordering_with_bubbles = reordering2
+            res2.pre_core_post_dict = pre_core_post_dict1
+            res2.input_renamings = res.input_renamings
+            res2.output_renamings = res.output_renamings
+
+            new_preamble = [ final_kernel[i] for i in range(res2.codesize)
+                if res2.is_pre(i, original_program_order=False) is True ]
+            new_postamble = [ final_kernel[i] for i in range(res2.codesize)
+                if res2.is_pre(i, original_program_order=False) is False ]
+
+            res2.preamble = new_preamble
+            res2.postamble = new_postamble
+            res2.success = True
+            res2.valid = True
+
+            # TODO: This does not yet work since there can be renaming at the boundary between
+            # preamble and postamble that we don't account for in the selfcheck.
+            # res2.selfcheck(logger.getChild("halving_heuristic_2"))
+
+            kernel = res2.code
 
         num_exceptional_iterations = 1
         return preamble, kernel, postamble, num_exceptional_iterations
diff --git a/slothy/slothy.py b/slothy/core/slothy.py
similarity index 53%
rename from slothy/slothy.py
rename to slothy/core/slothy.py
index 44d1102c..e23e74be 100644
--- a/slothy/slothy.py
+++ b/slothy/core/slothy.py
@@ -1,4 +1,3 @@
-
 #
 # Copyright (c) 2022 Arm Limited
 # Copyright (c) 2022 Hanno Becker
@@ -26,16 +25,51 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
+"""SLOTHY optimizer
+
+SLOTHY - Super Lazy Optimization of Tricky Handwritten assemblY - is a 
+fixed-instruction assembly superoptimizer based on constraint solving. 
+It takes handwritten assembly as input and simultaneously super-optimizes:
+
+- Instruction scheduling
+- Register allocation
+- Software pipelining
+
+SLOTHY enables a development workflow where developers write 'clean' assembly by hand, 
+emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific 
+micro-optimizations. Since SLOTHY does not change instructions, and scheduling/allocation 
+optimizations are tightly controlled through configurable and extensible constraints, the 
+developer keeps close control over the final assembly, while being freed from the most tedious 
+and readability- and verifiability-impeding micro-optimizations.
+
+This module provides the Slothy class, which is a stateful interface to both
+one-shot and heuristic optimiations using SLOTHY."""
+
 import logging
 from types import SimpleNamespace
 
-from slothy.dataflow import DataFlowGraph as DFG
-from slothy.dataflow import Config as DFGConfig
-from slothy.core import Config
-from slothy.heuristics import Heuristics
-from slothy.helper import AsmAllocation, AsmMacro, AsmHelper, CPreprocessor
-
-class Slothy():
+from slothy.core.dataflow import DataFlowGraph as DFG
+from slothy.core.dataflow import Config as DFGConfig, ComputationNode
+from slothy.core.core import Config
+from slothy.core.heuristics import Heuristics
+from slothy.helper import AsmAllocation, AsmMacro, AsmHelper, CPreprocessor, SourceLine
+
+class Slothy:
+    """SLOTHY optimizer
+
+    This class provides a stateful interface to both one-shot and heuristic
+    optimizations using SLOTHY.
+    
+    The basic flow of operation is the following:
+    - Initialize an instance, providing models to the target architecture
+      and microarchitecture as arguments.
+    - Load source code from file or raw string.
+    - Repeat: Adjust configuration and conduct an optimization of a loop body or
+        straightline block of code, using optimize() or optimize_loop().
+    - Write source code to file or raw string.
+
+    The use of heuristics is controlled through the configuration.
+    """
 
     # Quick convenience access to architecture and target from the config
     def _get_arch(self):
@@ -47,41 +81,64 @@ def _get_target(self):
 
     def __init__(self, arch, target, logger=None):
         self.config = Config(arch, target)
-        self.logger = logger if logger != None else logging.getLogger("slothy")
-        self.source = None
+        self.logger = logger if logger is not None else logging.getLogger("slothy")
+
+        # The source, once loaded, is represented as a list of strings
+        self._source = None
         self.results = None
 
         self.last_result = None
         self.success = None
 
+    @property
+    def source(self):
+        """Returns the current source code as an array of SourceLine objects
+
+        If you want the current source code as a multiline string, use get_source_as_string()."""
+        return self._source
+
+    @source.setter
+    def source(self, val):
+        assert SourceLine.is_source(val)
+        self._source = val
+
+    def get_source_as_string(self, comments=True, indentation=True, tags=True):
+        """Retrieve current source code as multi-line string"""
+        return SourceLine.write_multiline(self.source, comments=comments,
+            indentation=indentation, tags=tags)
+
+    def set_source_as_string(self, s):
+        """Provide input source code as multi-line string"""
+        assert isinstance(s, str)
+        reduce = not self.config.ignore_tags
+        self.source = SourceLine.read_multiline(s, reduce=reduce)
+
     def load_source_raw(self, source):
-        self.source = source.replace("\\\n", "")
+        """Load source code from multi-line string"""
+        self.set_source_as_string(source)
         self.results = []
 
     def load_source_from_file(self, filename):
+        """Load source code from file"""
         if self.source is not None:
             self.logger.warning("Overwriting previous source code")
-        f = open(filename,"r")
-        self.load_source_raw(f.read())
-        f.close()
+        with open(filename,"r", encoding="utf8") as f:
+            self.load_source_raw(f.read())
 
     def write_source_to_file(self, filename):
-        f = open(filename,"w")
-        f.write(self.source)
-        f.close()
-
-    def print_code(self):
-        print(self.source)
+        """Write current source code to file"""
+        with open(filename,"w", encoding="utf8") as f:
+            f.write(self.get_source_as_string())
 
     def rename_function(self, old_funcname, new_funcname):
+        """Rename a function in the current source code"""
         self.source = AsmHelper.rename_function(self.source, old_funcname, new_funcname)
 
     @staticmethod
     def _dump(name, s, logger, err=False):
+        assert isinstance(s, list)
         fun = logger.debug if not err else logger.error
         fun(f"Dump: {name}")
-        if isinstance(s, str):
-          s = s.splitlines()
         for l in s:
             fun(f"> {l}")
 
@@ -106,6 +163,7 @@ def optimize(self, start=None, end=None, loop_synthesis_cb=None, logname=None):
              loop_synthesis_cb: Optional (None by default) callback synthesis final source code
                   from tuple of (preamble, kernel, postamble, # exceptional iterations).
         """
+        # pylint:disable=too-many-locals
 
         if logname is None and start is not None:
             logname = start
@@ -128,42 +186,43 @@ def optimize(self, start=None, end=None, loop_synthesis_cb=None, logname=None):
             self.logger.debug("Code after preprocessor:")
             Slothy._dump("preprocessed", body, self.logger, err=False)
 
-        body = AsmHelper.split_semicolons(body)
+        body = SourceLine.split_semicolons(body)
         body = AsmMacro.unfold_all_macros(pre, body)
         body = AsmAllocation.unfold_all_aliases(c.register_aliases, body)
-        body = AsmHelper.apply_indentation(body, indentation)
+        body = SourceLine.apply_indentation(body, indentation)
         self.logger.info("Instructions in body: %d", len(list(filter(None, body))))
         early, core, late, num_exceptional = Heuristics.periodic(body, logger, c)
 
         def indented(code):
-            indent = ' ' * self.config.indentation
-            return [ indent + s for s in code ]
+            return [ SourceLine(l).set_indentation(indentation) for l in code]
 
         if start is not None:
-            core = [f"{start}:"] + core
+            core = [SourceLine(f"{start}:")] + core
         if end is not None:
-            core += [f"{end}:"]
+            core += [SourceLine(f"{end}:")]
 
         if not self.config.sw_pipelining.enabled:
             assert early == []
             assert late == []
             assert num_exceptional == 0
-            optimized_source = indented(core)
-        elif loop_synthesis_cb != None:
-            optimized_source = indented(loop_synthesis_cb( pre, core, post, num_exceptional))
+            optimized_source = core
+        elif loop_synthesis_cb is not None:
+            optimized_source = loop_synthesis_cb( pre, core, post, num_exceptional)
         else:
             optimized_source = []
             optimized_source += indented([f"// Exceptional iterations: {num_exceptional}",
                                           "// Preamble"])
-            optimized_source += indented(early)
+            optimized_source += early
             optimized_source += indented(["// Kernel"])
-            optimized_source += indented(core)
+            optimized_source += core
             optimized_source += indented(["// Postamble"])
-            optimized_source += indented(late)
+            optimized_source += late
 
-        self.source = '\n'.join(pre + optimized_source + post)
+        self.source = pre + optimized_source + post
+        assert SourceLine.is_source(self.source)
 
     def get_loop_input_output(self, loop_lbl):
+        """Find all registers that a loop body depends on"""
         logger = self.logger.getChild(loop_lbl)
         _, body, _, _, _ = self.arch.Loop.extract(self.source, loop_lbl)
 
@@ -173,6 +232,7 @@ def get_loop_input_output(self, loop_lbl):
         return list(DFG(body, logger.getChild("dfg_kernel_deps"), dfgc).inputs)
 
     def get_input_from_output(self, start, end, outputs=None):
+        """For the piece of straightline code, infer which input registers affect its output"""
         if outputs is None:
             outputs = {}
         logger = self.logger.getChild(f"{start}_{end}_infer_input")
@@ -188,11 +248,7 @@ def get_input_from_output(self, start, end, outputs=None):
         dfgc = DFGConfig(c)
         return list(DFG(body, logger.getChild("dfg_find_deps"), dfgc).inputs)
 
-    def ssa_region(self, start, end, outputs=None):
-        if outputs is None:
-            outputs = {}
-        logger = self.logger.getChild(f"{start}_{end}_infer_input")
-        pre, body, post = AsmHelper.extract(self.source, start, end)
+    def _fusion_core(self, pre, body, logger):
         c = self.config.copy()
 
         if c.with_preprocessor:
@@ -200,20 +256,65 @@ def ssa_region(self, start, end, outputs=None):
             body = CPreprocessor.unfold(pre, body, c.compiler_binary)
             self.logger.debug("Code after preprocessor:")
             Slothy._dump("preprocessed", body, self.logger, err=False)
-        body = AsmHelper.split_semicolons(body)
+        body = SourceLine.split_semicolons(body)
 
         aliases = AsmAllocation.parse_allocs(pre)
         c.add_aliases(aliases)
-        c.outputs = outputs
 
         body = AsmMacro.unfold_all_macros(pre, body)
         body = AsmAllocation.unfold_all_aliases(c.register_aliases, body)
         dfgc = DFGConfig(c)
-        dfg = DFG(body, logger.getChild("dfg_find_deps"), dfgc)
+
+        dfg = DFG(body, logger.getChild("ssa"), dfgc, parsing_cb=False)
         dfg.ssa()
+        body = [ ComputationNode.to_source_line(t) for t in dfg.nodes ]
 
-        body_ssa = [ f"{start}:" ] + [ str(t.inst) for t in dfg.nodes ] + [ f"{end}:" ]
-        self.source = '\n'.join(pre + body_ssa + post)
+        dfg = DFG(body, logger.getChild("fusion"), dfgc, parsing_cb=False)
+        dfg.apply_fusion_cbs()
+        body = [ ComputationNode.to_source_line(t) for t in dfg.nodes ]
+
+        return body
+
+    def fusion_region(self, start, end):
+        """Run fusion callbacks on straightline code"""
+        logger = self.logger.getChild(f"ssa_{start}_{end}")
+        pre, body, post = AsmHelper.extract(self.source, start, end)
+
+        body_ssa = [ SourceLine(f"{start}:") ] +\
+             self._fusion_core(pre, body, logger) + \
+            [ SourceLine(f"{end}:") ]
+        self.source = pre + body_ssa + post
+        assert SourceLine.is_source(self.source)
+
+    def fusion_loop(self, loop_lbl):
+        """Run fusion callbacks on loop body"""
+        logger = self.logger.getChild(f"ssa_loop_{loop_lbl}")
+
+        pre , body, post, _, other_data = \
+            self.arch.Loop.extract(self.source, loop_lbl)
+
+        indentation = AsmHelper.find_indentation(body)
+
+        loop = self.arch.Loop(lbl_start=loop_lbl)
+        body_ssa = SourceLine.read_multiline(loop.start()) + \
+            SourceLine.apply_indentation(self._fusion_core(pre, body, logger), indentation) + \
+            SourceLine.read_multiline(loop.end(other_data))
+
+        self.source = pre + body_ssa + post
+        assert SourceLine.is_source(self.source)
+
+        c = self.config.copy()
+        self.config.keep_tags = True
+        self.config.constraints.functional_only = True
+        self.config.constraints.allow_reordering = False
+        self.config.sw_pipelining.enabled = False
+        self.config.split_heuristic = False
+        self.config.inputs_are_outputs = True
+        self.config.sw_pipelining.unknown_iteration_count = False
+        self.optimize_loop(loop_lbl)
+        self.config = c
+
+        assert SourceLine.is_source(self.source)
 
     def optimize_loop(self, loop_lbl, postamble_label=None):
         """Optimize the loop starting at a given label"""
@@ -236,22 +337,33 @@ def optimize_loop(self, loop_lbl, postamble_label=None):
             self.logger.debug("Code after preprocessor:")
             Slothy._dump("preprocessed", body, self.logger, err=False)
 
-        body = AsmHelper.split_semicolons(body)
+        body = SourceLine.split_semicolons(body)
         body = AsmMacro.unfold_all_macros(early, body)
         body = AsmAllocation.unfold_all_aliases(c.register_aliases, body)
-        body = AsmHelper.apply_indentation(body, indentation)
+        body = SourceLine.apply_indentation(body, indentation)
 
-        insts = len(list(filter(None, body)))
-        self.logger.info("Optimizing loop %s (%d instructions) ...", loop_lbl, insts)
+        self.logger.info("Optimizing loop %s (%d instructions) ...",
+            loop_lbl, len(body))
 
         preamble_code, kernel_code, postamble_code, num_exceptional = \
             Heuristics.periodic(body, logger, c)
+
         def indented(code):
-            indent = ' ' * self.config.indentation
-            return [ indent + s for s in code ]
+            if not SourceLine.is_source(code):
+                code = SourceLine.read_multiline(code)
+            return SourceLine.apply_indentation(code, self.config.indentation)
+
+        loop_lbl_end = f"{loop_lbl}_end"
+        def loop_lbl_iter(i):
+            return SourceLine(f"{loop_lbl}_iter_{i}")
 
-        loop = self.arch.Loop(lbl_start=loop_lbl)
         optimized_code = []
+
+        if self.config.sw_pipelining.unknown_iteration_count:
+            for i in range(1, num_exceptional):
+                optimized_code += indented(self.arch.Branch.if_equal(i, loop_lbl_iter(i)))
+
+        loop = self.arch.Loop(lbl_start=loop_lbl)
         optimized_code += indented(preamble_code)
 
         if self.config.sw_pipelining.unknown_iteration_count:
@@ -261,22 +373,34 @@ def indented(code):
         else:
             jump_if_empty = None
 
-        optimized_code += list(loop.start(
+        optimized_code += SourceLine.read_multiline(loop.start(
             indentation=self.config.indentation,
             fixup=num_exceptional,
             unroll=self.config.sw_pipelining.unroll,
             jump_if_empty=jump_if_empty))
         optimized_code += indented(kernel_code)
-        optimized_code += list(loop.end(other_data, indentation=self.config.indentation))
+        optimized_code += SourceLine.read_multiline(loop.end(other_data,
+            indentation=self.config.indentation))
         if postamble_label is not None:
-            optimized_code += [ f"{postamble_label}: // end of loop kernel" ]
+            optimized_code += [ SourceLine(f"{postamble_label}:")
+                .add_comment("end of loop kernel") ]
         optimized_code += indented(postamble_code)
 
+        if self.config.sw_pipelining.unknown_iteration_count:
+            optimized_code += indented(self.arch.Branch.unconditional(loop_lbl_end))
+            for i in range(1, num_exceptional):
+                optimized_code += [SourceLine(f"{loop_lbl_iter(i)}:")]
+                optimized_code += i * indented(body)
+                optimized_code += [SourceLine(f"{loop_lbl_iter(i)}_end:")]
+                if i != num_exceptional - 1:
+                    optimized_code += indented(self.arch.Branch.unconditional(loop_lbl_end))
+            optimized_code += [SourceLine(f"{loop_lbl_end}:")]
+
         self.last_result = SimpleNamespace()
         dfgc = DFGConfig(c)
         dfgc.inputs_are_outputs = True
         self.last_result.kernel_input_output = \
             list(DFG(kernel_code, logger.getChild("dfg_kernel_deps"), dfgc).inputs)
 
-        self.source = '\n'.join(early + optimized_code + late)
+        self.source = early + optimized_code + late
         self.success = True
diff --git a/slothy/helper.py b/slothy/helper.py
index cb43560b..88f148fc 100644
--- a/slothy/helper.py
+++ b/slothy/helper.py
@@ -29,6 +29,348 @@
 import subprocess
 import logging
 
+class SourceLine:
+    """Representation of a single line of source code"""
+
+    def _extract_comments_from_text(self):
+        if not "//" in self._raw:
+            return
+        s = list(map(str.strip, self._raw.split("//")))
+        self._raw = s[0]
+        self._comments += s[1:]
+        self._trim_comments()
+
+    def _extract_indentation_from_text(self):
+        old = self._raw
+        new = old.lstrip()
+        self._indentation = len(old) - len(new)
+        self._raw = new
+
+    @staticmethod
+    def _parse_tags_in_string(s, tags):
+        def parse_value(v):
+            if v.lower() == "true":
+                return True
+            if v.lower() == "false":
+                return False
+            if v.isnumeric():
+                return int(v)
+            return v
+
+        def tag_value_callback(g):
+            tag = g.group("tag")
+            value = parse_value(g.group("value"))
+            tags[tag] = value
+            return ""
+
+        def tag_callback(g):
+            tag = g.group("tag")
+            tags[tag] = True
+            return ""
+
+        tag_value_regexp_txt = r"@slothy:(?P<tag>(\w|-)+)=(?P<value>\w+)"
+        tag_regexp_txt = r"@slothy:(?P<tag>(\w|-)+)"
+        s = re.sub(tag_value_regexp_txt, tag_value_callback, s)
+        s = re.sub(tag_regexp_txt, tag_callback, s)
+        return s
+
+    def _strip_comments(self):
+        self._comments = list(map(str.strip, self._comments))
+
+    def _trim_comments(self):
+        self._strip_comments()
+        self._comments = list(filter(lambda s: s != "", self._comments))
+
+    def _extract_tags_from_comments(self):
+        tags = {}
+        self._comments = list(map(lambda c: SourceLine._parse_tags_in_string(c, tags),
+            self._comments))
+        self._trim_comments()
+        self.add_tags(tags)
+
+    def reduce(self):
+        """Extract metadata (tags, comments, indentation) from raw text
+
+        The extracted components get retracted from the text."""
+        self._extract_indentation_from_text()
+        self._extract_comments_from_text()
+        self._extract_tags_from_comments()
+        return self
+
+    def add_comment(self, comment):
+        """Add a comment to the metadata of a source line"""
+        self._comments.append(comment)
+        return self
+
+    def add_comments(self, comments):
+        """Add one or more comments to the metadata of a source line"""
+        for c in comments:
+            self.add_comment(c)
+        return self
+
+    def set_comments(self, comments):
+        """Set comments for source line.
+
+        Overwrites existing comments."""
+        self._comments = comments
+        return self
+
+    def set_comment(self, comment):
+        """Set single comment for source line.
+
+        Overwrites existing comments."""
+        self.set_comments([comment])
+        return self
+
+    def __init__(self, s, reduce=True):
+        """Create source line from string"""
+        assert isinstance(s, str)
+
+        self._raw = s
+        self._tags = {}
+        self._indentation = 0
+        self._fixlength = None
+        self._comments = []
+
+        if reduce is True:
+            self.reduce()
+
+    def set_tag(self, tag, value=True):
+        """Set source line tag"""
+        self._tags[tag] = value
+        return self
+
+    def set_length(self, length):
+        """Set the padded length of the text component of the source line
+
+        When printing the source line with to_string(), the source text will be
+        whitespace padded to the specified length before adding comments and tags.
+        This allows to print multiple commented source lines with a uniform
+        indentation for the comments, improving readability."""
+        self._fixlength = length
+        return self
+
+    @property
+    def tags(self):
+        """Return the list of tags for the source line
+
+        Tags are source annotations of the form @slothy:(tag[=value]?).
+        """
+        return self._tags
+    @tags.setter
+    def tags(self, v):
+        self._tags = v
+
+    @property
+    def comments(self):
+        """Return the list of comments for the source line"""
+        return self._comments
+    @comments.setter
+    def comments(self, v):
+        self._comments = v
+
+    def has_text(self):
+        """Indicates if the source line constaints some text"""
+        return self._raw.strip() != ""
+
+    @property
+    def text(self):
+        """Returns the (non-metadata) text in the source line"""
+        return self._raw
+
+    def to_string(self, indentation=False, comments=False, tags=False):
+        """Convert source line to a string
+
+        This includes formatting the metadata in a way reversing the
+        parsing done in the _extract_xxx() routines."""
+        if self._fixlength is None:
+            core = self._raw
+        else:
+            core = f"{self._raw:{self._fixlength}s}"
+
+        indentation = ' ' * self._indentation \
+            if indentation is True else ""
+        comments = ''.join(map(lambda s: f"// {s}", self._comments)) \
+            if comments is True else ""
+        tags = ' '.join(map(lambda tv: f" // @slothy:{tv[0]}={tv[1]}", self._tags.items())) \
+            if tags is True else ""
+
+        return f"{indentation}{core}{comments}{tags}"
+
+    def __str__(self):
+        return self.to_string()
+
+    @staticmethod
+    def reduce_source(src):
+        """Extract metadata (e.g. indentation, tags, comments) from source lines"""
+        assert SourceLine.is_source(src)
+        for l in src:
+            l.reduce()
+        return [ l for l in src if l.has_text() ]
+
+    @staticmethod
+    def log(name, s, logger=None, err=False):
+        """Send source to logger"""
+        assert isinstance(s, list)
+        if err:
+            fun = logger.error
+        else:
+            fun = logger.debug
+        if len(s) == 0:
+            return
+        fun(f"Dump: {name}")
+        for l in s:
+            fun(f"> {l}")
+
+    def set_text(self, s):
+        """Set the text of the source line
+
+        This only affects the instruction text of the source line, but leaves
+        metadata (such as comments, indentation or tags) unmodified."""
+        self._raw = s
+        return self
+
+    def add_text(self, s):
+        """Add text to a source line
+
+        This only affects the instruction text of the source line, but leaves
+        metadata (such as comments, indentation or tags) unmodified."""
+        self._raw += " " + s
+        return self
+
+    @property
+    def is_escaped(self):
+        """Indicates if line text ends with a backslash"""
+        return self.text.endswith("\\")
+
+    def remove_escaping(self):
+        """Remove escape character at end of line, if present"""
+        if not self.is_escaped:
+            return self
+        self._raw = self._raw[:-1]
+        return self
+
+    def __len__(self):
+        return len(self._raw)
+
+    def copy(self):
+        """Create a copy of a source line"""
+        return SourceLine(self._raw)                \
+                .add_tags(self._tags.copy())        \
+                .set_indentation(self._indentation) \
+                .add_comments(self._comments.copy())\
+                .set_length(self._fixlength)
+
+    @staticmethod
+    def read_multiline(s, reduce=True):
+        """Parse multi-line string or array of strings into list of SourceLine instances"""
+        if isinstance(s, str):
+            s = s.splitlines()
+        return SourceLine.merge_escaped_lines([ SourceLine(l, reduce=reduce) for l in s ])
+
+    @staticmethod
+    def merge_escaped_lines(s):
+        """Merge lines ending in a backslash with subsequent line(s)"""
+        assert SourceLine.is_source(s)
+        res = []
+        cur = None
+        for l in s:
+            if cur is not None:
+                cur.add_text(l.text)
+            else:
+                cur = l.copy()
+            if cur.is_escaped:
+                cur.remove_escaping()
+            else:
+                res.append(cur)
+                cur = None
+        assert cur is None
+        return res
+
+    @staticmethod
+    def copy_source(s):
+        """Create a copy of a list of source lines"""
+        assert SourceLine.is_source(s)
+        return [ l.copy() for l in s ]
+
+    @staticmethod
+    def write_multiline(s, comments=True, indentation=True, tags=True):
+        """Write source as multiline string"""
+        return '\n'.join(map(lambda t: t.to_string(
+            comments=comments, tags=tags, indentation=indentation), s))
+
+    def set_indentation(self, indentation):
+        """Set the indentation (in number of spaces) to be used in to_string()"""
+        self._indentation = indentation
+        return self
+
+    def add_tags(self, tags):
+        """Add one or more tags to the metadata of the source line
+
+        tags must be a tag:value dictionary."""
+        self._tags = {**self._tags, **tags}
+        return self
+
+    def add_tag(self, tag, value):
+        """Add a single tag-value pair to the metadata of the source line
+
+        If a tag is already specified, it is overwritten."""
+        return self.add_tags({ tag: value })
+
+    def inherit_tags(self, l):
+        """Inhertis the tags from another source line
+
+        In case of overlapping tags, source line l takes precedence."""
+        assert SourceLine.is_source_line(l)
+        self.add_tags(l.tags)
+        return self
+
+    @staticmethod
+    def apply_indentation(source, indentation):
+        """Apply consistent indentation to assembly source"""
+        assert SourceLine.is_source(source)
+        if indentation is None:
+            return source
+        assert isinstance(indentation, int)
+        return [ l.copy().set_indentation(indentation) for l in source ]
+
+    @staticmethod
+    def drop_tags(source):
+        """Drop all tags from a source"""
+        assert SourceLine.is_source(source)
+        for l in source:
+            l.tags = {}
+        return source
+
+    @staticmethod
+    def split_semicolons(s):
+        """"Split the text of a source line at semicolons
+
+        The resulting source lines inherit their metadata from the caller."""
+        assert SourceLine.is_source(s)
+        res = []
+        for line in s:
+            for l in str(line).split(';'):
+                t = line.copy()
+                t.set_text(l)
+                res.append(t)
+        return res
+
+    @staticmethod
+    def is_source(s):
+        """Check if parameter is a list of SourceLine instances"""
+        if isinstance(s, list) is False:
+            return False
+        for t in s:
+            if isinstance(t, SourceLine) is False:
+                return False
+        return True
+
+    @staticmethod
+    def is_source_line(s):
+        """Checks if the parameter is a SourceLine instance"""
+        return isinstance(s, SourceLine)
+
 class NestedPrint():
     """Helper for recursive printing of structures"""
     def __str__(self):
@@ -44,7 +386,7 @@ def log(self, fun):
         for l in str(self).splitlines():
             fun(l)
 
-class LockAttributes(object):
+class LockAttributes:
     """Base class adding support for 'locking' the set of attributes, that is,
        preventing the creation of any further attributes. Note that the modification
        of already existing attributes remains possible.
@@ -62,7 +404,7 @@ def __setattr__(self, attr, val):
             varlist = [v for v in dir(self) if not v.startswith("_") ]
             varlist = '\n'.join(map(lambda x: '* ' + x, varlist))
             raise TypeError(f"Unknown attribute {attr}. \nValid attributes are:\n{varlist}")
-        elif self._locked and attr == "_locked":
+        if self._locked and attr == "_locked":
             raise TypeError("Can't unlock an object")
         object.__setattr__(self,attr,val)
 
@@ -79,6 +421,7 @@ def find_indentation(source):
         def get_indentation(l):
             return len(l) - len(l.lstrip())
 
+        source = map(str, source)
         # Remove empty lines
         source = list(filter(lambda t: t.strip() != "", source))
         l = len(source)
@@ -99,88 +442,24 @@ def get_indentation(l):
 
         return None
 
-    @staticmethod
-    def apply_indentation(source, indentation):
-        """Apply consistent indentation to assembly source"""
-        if indentation is None:
-            return source
-        assert isinstance(indentation, int)
-        indent = ' ' * indentation
-        return [ indent + l.lstrip() for l in source ]
-
     @staticmethod
     def rename_function(source, old_funcname, new_funcname):
         """Rename function in assembly snippet"""
 
         # For now, just replace function names line by line
-        def change_funcname(s):
+        def change_funcname(line):
+            s = str(line)
             s = re.sub( f"{old_funcname}:", f"{new_funcname}:", s)
             s = re.sub( f"\\.global(\\s+){old_funcname}", f".global\\1{new_funcname}", s)
             s = re.sub( f"\\.type(\\s+){old_funcname}", f".type\\1{new_funcname}", s)
-            return s
-        return '\n'.join([ change_funcname(s) for s in source.splitlines() ])
-
-    @staticmethod
-    def split_semicolons(body):
-        """Split assembly snippet across semicolons`"""
-        return [ l for s in body for l in s.split(';') ]
-
-    @staticmethod
-    def reduce_source_line(line):
-        """Simplify or ignore assembly source line"""
-        regexp_align_txt = r"^\s*\.(?:p2)?align"
-        regexp_req_txt   = r"\s*(?P<alias>\w+)\s+\.req\s+(?P<reg>\w+)"
-        regexp_unreq_txt = r"\s*\.unreq\s+(?P<alias>\w+)"
-        regexp_label_txt = r"\s*(?P<label>\w+)\s*:\s*$"
-        regexp_align = re.compile(regexp_align_txt)
-        regexp_req   = re.compile(regexp_req_txt)
-        regexp_unreq = re.compile(regexp_unreq_txt)
-        regexp_label = re.compile(regexp_label_txt)
-
-        def strip_comment(s):
-            s = s.split("//")[0]
-            s = re.sub("/\\*[^*]*\\*/","",s)
-            return s.strip()
-        def is_empty(s):
-            return s == ""
-        def is_asm_directive(s):
-            # We only accept (and ignore) .req and .unreqs in code so far
-            return sum([ regexp_req.match(s)   is not None,
-                         regexp_unreq.match(s) is not None,
-                         regexp_align.match(s) is not None]) > 0
-
-        def is_label(s):
-            return (regexp_label.match(s) is not None)
-
-        line = strip_comment(line)
-        if is_empty(line):
-            return
-        if is_asm_directive(line):
-            return
-        if is_label(line):
-            return
-        return line
-
-    @staticmethod
-    def reduce_source(src, allow_nops=True):
-        """Simplify assembly snippet"""
-        if isinstance(src,str):
-            src = src.splitlines()
-        def filter_nop(src):
-            if allow_nops:
-                return True
-            return src != "nop"
-        src = map(AsmHelper.reduce_source_line, src)
-        src = filter(lambda x: x != None, src)
-        src = filter(filter_nop, src)
-        src = list(src)
-        return src
+            return line.copy().set_text(s)
+        return [ change_funcname(s) for s in source ]
 
     @staticmethod
     def extract(source, lbl_start=None, lbl_end=None):
         """Extract code between two labels from an assembly source"""
         pre, body, post = AsmHelper._extract_core(source, lbl_start, lbl_end)
-        body = AsmHelper.reduce_source(body, allow_nops=False)
+        body = SourceLine.reduce_source(body)
         return pre, body, post
 
     @staticmethod
@@ -189,8 +468,7 @@ def _extract_core(source, lbl_start=None, lbl_end=None):
         body = []
         post = []
 
-        lines = iter(source.splitlines())
-        source = source.splitlines()
+        lines = iter(source)
         if lbl_start is None and lbl_end is None:
             body = source
             return pre, body, post
@@ -210,6 +488,7 @@ def _extract_core(source, lbl_start=None, lbl_end=None):
             idx += 1
             if not keep:
                 l = next(lines, None)
+                l_str = str(l)
             if l is None:
                 break
             keep = False
@@ -218,9 +497,9 @@ def _extract_core(source, lbl_start=None, lbl_end=None):
                 continue
             expect_label = [ lbl_start, lbl_end ][state]
             cur_buf = [ pre, body ][state]
-            p = loop_lbl_regexp.match(l)
+            p = loop_lbl_regexp.match(l_str)
             if p is not None and p.group("label") == expect_label:
-                l = p.group("remainder")
+                l = l.copy().set_text(p.group("remainder"))
                 keep = True
                 state += 1
                 continue
@@ -267,8 +546,10 @@ def _remove_allocation(self, alias):
     def parse_line(self, line):
         """Check if an assembly line is a .req or .unreq directive, and update the
         alias dictionary accordingly. Otherwise, do nothing."""
+        assert isinstance(line, SourceLine)
+        l_str = str(line)
         # Check if it's an allocation
-        p = self.regexp_req.match(line)
+        p = self.regexp_req.match(l_str)
         if p is not None:
             alias = p.group("alias")
             reg = p.group("reg")
@@ -276,7 +557,7 @@ def parse_line(self, line):
             return
 
         # Regular expression for a definition removal
-        p = self.regexp_unreq.match(line)
+        p = self.regexp_unreq.match(l_str)
         if p is not None:
             alias = p.group("alias")
             self._remove_allocation(alias)
@@ -306,8 +587,11 @@ def _apply_multiple_aliases_to_line(line):
                 line = _apply_single_alias_to_line(alias_from, alias_to, line)
             return line
         res = []
-        for l in src:
-            res.append(_apply_multiple_aliases_to_line(l))
+        for line in src:
+            l = str(line)
+            t = line.copy()
+            t.set_text(_apply_multiple_aliases_to_line(l))
+            res.append(t)
         return res
 
 class BinarySearchLimitException(Exception):
@@ -360,11 +644,14 @@ def __init__(self, name, args, body):
 
     def __call__(self,args_dict):
         output = []
-        for l in self.body:
+        for line in self.body:
+            l = str(line)
             for arg in self.args:
                 l = re.sub(f"\\\\{arg}(\\W|$)",args_dict[arg] + "\\1",l)
             l = re.sub("\\\\\\(\\)","",l)
-            output.append(l)
+            t = line.copy()
+            t.set_text(l)
+            output.append(t)
         return output
 
     def __repr__(self):
@@ -372,6 +659,8 @@ def __repr__(self):
 
     def unfold_in(self, source, change_callback=None):
         """Unfold all applications of macro in assembly source"""
+        assert SourceLine.is_source(source)
+
         macro_regexp_txt = rf"^\s*{self.name}"
         arg_regexps = []
         if self.args == [""]:
@@ -393,11 +682,11 @@ def unfold_in(self, source, change_callback=None):
         indentation_regexp = re.compile(indentation_regexp_txt)
 
         # Go through source line by line and check if there's a macro invocation
-        for l in AsmHelper.reduce_source(source):
+        for l in SourceLine.reduce_source(source):
+            assert SourceLine.is_source_line(l)
 
-            lp = AsmHelper.reduce_source_line(l)
-            if lp is not None:
-                p = macro_regexp.match(lp)
+            if l.has_text():
+                p = macro_regexp.match(l.text)
             else:
                 p = None
 
@@ -407,8 +696,11 @@ def unfold_in(self, source, change_callback=None):
             if change_callback:
                 change_callback()
             # Try to keep indentation
-            indentation = indentation_regexp.match(l).group("whitespace")
-            repl = [ indentation + s.strip() for s in self(p.groupdict())]
+            indentation = len(indentation_regexp.match(l.text).group("whitespace"))
+            repl = self(p.groupdict())
+            for l0 in repl:
+                l0.set_indentation(indentation)
+                l0.inherit_tags(l)
             output += repl
 
         return output
@@ -416,14 +708,13 @@ def unfold_in(self, source, change_callback=None):
     @staticmethod
     def unfold_all_macros(macros, source):
         """Unfold list of macros in assembly source"""
-
+        assert isinstance(macros, list)
+        assert SourceLine.is_source(source)
         def list_of_instances(l,c):
             return isinstance(l,list) and all(map(lambda m: isinstance(m,c), l))
         def dict_of_instances(l,c):
             return isinstance(l,dict) and list_of_instances(list(l.values()), c)
-        if isinstance(macros,str):
-            macros = macros.splitlines()
-        if list_of_instances(macros, str):
+        if SourceLine.is_source(macros):
             macros = AsmMacro.extract(macros)
         if not dict_of_instances(macros, AsmMacro):
             raise AsmHelperException(f"Invalid argument: {macros}")
@@ -452,22 +743,19 @@ def extract(source):
         macro_start_regexp_txt = r"^\s*\.macro\s+(?P<name>\w+)(?P<args>.*)$"
         macro_start_regexp = re.compile(macro_start_regexp_txt)
 
-        slothy_no_unfold_regexp_txt = r".*//\s*slothy:\s*no-unfold\s*$"
-        slothy_no_unfold_regexp = re.compile(slothy_no_unfold_regexp_txt)
-
         macro_end_regexp_txt = r"^\s*\.endm\s*$"
         macro_end_regexp = re.compile(macro_end_regexp_txt)
 
         for cur in source:
+            cur_str = str(cur)
 
             if state == 0:
 
-                p = macro_start_regexp.match(cur)
+                p = macro_start_regexp.match(cur_str)
                 if p is None:
                     continue
 
-                # Ignore macros with "// slothy:no-unfold" annotation
-                if slothy_no_unfold_regexp.match(cur) is not None:
+                if cur.tags.get("no-unfold", None) is not None:
                     continue
 
                 current_args = [ a.strip() for a in p.group("args").split(',') ]
@@ -481,7 +769,7 @@ def extract(source):
                 continue
 
             if state == 1:
-                p = macro_end_regexp.match(cur)
+                p = macro_end_regexp.match(cur_str)
                 if p is None:
                     current_body.append(cur)
                     continue
@@ -500,8 +788,9 @@ def extract(source):
     @staticmethod
     def extract_from_file(filename):
         """Parse all macro definitions in assembly file"""
-        f = open(filename,"r")
-        return AsmMacro.extract(f.read().splitlines())
+        with open(filename,"r",encoding="utf-8") as f:
+            res = AsmMacro.extract(f.read().splitlines())
+        return res
 
 class CPreprocessor():
     """Helper class for the application of the C preprocessor"""
@@ -550,7 +839,7 @@ def permutation_id(sz):
     @staticmethod
     def permutation_comp(p_b, p_a):
         """Compose two permutations.
-        
+
         This computes 'p_b o p_a', that is, 'p_a first, then p_b'."""
         l_a = len(p_a.values())
         l_b = len(p_b.values())
diff --git a/slothy/targets/README.md b/slothy/targets/README.md
new file mode 100644
index 00000000..971e8c63
--- /dev/null
+++ b/slothy/targets/README.md
@@ -0,0 +1,15 @@
+# SLOTHY Architecture and Microarchitecture models
+
+This directory contains experimental architecture and microarchitecture models.
+
+Currently, the following architectures have experimental support:
+
+* Armv8.1-M (mostly focused on Helium SIMD instructions)
+* AArch64 (scalar and Neon, yet incomplete)
+
+The following microarchitectures have experimental support:
+
+* Cortex-M55 (Armv8.1-M+Helium)
+* Cortex-M85 (Armv8.1-M+Helium)
+* Cortex-A55 (AArch64): Largely complete and accurate
+* Cortex-A72 (AArch64)
diff --git a/targets/aarch64/__init__.py b/slothy/targets/__init__.py
similarity index 100%
rename from targets/aarch64/__init__.py
rename to slothy/targets/__init__.py
diff --git a/targets/arm_v81m/__init__.py b/slothy/targets/aarch64/__init__.py
similarity index 100%
rename from targets/arm_v81m/__init__.py
rename to slothy/targets/aarch64/__init__.py
diff --git a/targets/aarch64/aarch64_big_experimental.py b/slothy/targets/aarch64/aarch64_big_experimental.py
similarity index 87%
rename from targets/aarch64/aarch64_big_experimental.py
rename to slothy/targets/aarch64/aarch64_big_experimental.py
index b98c6bdf..8f491c41 100644
--- a/targets/aarch64/aarch64_big_experimental.py
+++ b/slothy/targets/aarch64/aarch64_big_experimental.py
@@ -30,7 +30,7 @@
 #
 
 from enum import Enum
-from .aarch64_neon import *
+from slothy.targets.aarch64.aarch64_neon import *
 
 issue_rate = 6
 
@@ -77,6 +77,9 @@ def add_further_constraints(slothy):
         return
     slothy.restrict_slots_for_instructions_by_property(
         is_neon_instruction, [0,1,2,3])
+    slothy.restrict_slots_for_instructions_by_class(
+        [aesr_x4, aesr_x4], [0]
+    )
 
 def has_min_max_objective(config):
     return False
@@ -88,17 +91,22 @@ def get_min_max_objective(slothy):
 execution_units = {
     (Ldp_X, Ldr_X,
      Str_X, Stp_X,
-     Ldr_Q, Str_Q)            : ExecutionUnit.LSU(),
+     Ldr_Q, Str_Q, Ldp_Q, Stp_Q)     : ExecutionUnit.LSU(),
     (vuzp1, vuzp2, vzip1,
-     Vrev, uaddlp)           : ExecutionUnit.V(),
+     Vrev, uaddlp)            : ExecutionUnit.V(),
     (vmov)                    : ExecutionUnit.V(),
     VecToGprMov               : ExecutionUnit.V(),
     (vmovi)                   : ExecutionUnit.V(),
     (vand, vadd)              : ExecutionUnit.V(),
     (vxtn)                    : ExecutionUnit.V(),
+    veor3                     : ExecutionUnit.V(),
     (vshl, vshl_d, vshli, vshrn) : ExecutionUnit.V1(),
     vusra                     : ExecutionUnit.V1(),
     AESInstruction            : ExecutionUnit.V(),
+    Transpose                 : ExecutionUnit.V(),
+    aesr_x2                   : ExecutionUnit.V(),
+    aesr_x4                   : [ExecutionUnit.V()], # Use all V-pipes
+    aese_x4                   : [ExecutionUnit.V()], # Use all V-pipes
     (vmul, vmlal, vmull)      : ExecutionUnit.V0(),
     AArch64NeonLogical        : ExecutionUnit.V(),
     (AArch64BasicArithmetic,
@@ -118,14 +126,18 @@ def get_min_max_objective(slothy):
 
 inverse_throughput = {
     (Ldr_X, Str_X,
-     Ldr_Q, Str_Q)             : 1,
-    (Ldp_X, Stp_X)             : 2,
+     Ldr_Q, Str_Q) : 1,
+    (Ldp_X, Stp_X, Ldp_Q, Stp_Q) : 2,
     (vuzp1, vuzp2, vzip1,
      uaddlp, Vrev)            : 1,
     VecToGprMov                : 1,
+    Transpose                  : 1,
+    veor3                      : 2,
     (vand, vadd)               : 1,
     (vmov)                     : 1,
     AESInstruction             : 1,
+    aesr_x4                    : 1,
+    aese_x4                    : 1,
     AArch64NeonLogical         : 1,
     (vmovi)                    : 1,
     (vxtn)                     : 1,
@@ -151,13 +163,16 @@ def get_min_max_objective(slothy):
 default_latencies = {
     (Ldp_X,
      Ldr_X,
-     Ldr_Q)                   : 4,
-    (Stp_X, Str_X, Str_Q)     : 2,
+     Ldr_Q, Ldp_Q)            : 4,
+    (Stp_X, Str_X, Str_Q, Stp_Q) : 2,
     (vuzp1, vuzp2, vzip1,
      Vrev, uaddlp)           : 2,
     VecToGprMov               : 2,
+    veor3                     : 2,
     (vxtn)                    : 2,
+    Transpose                 : 2,
     AESInstruction            : 2,
+    (aesr_x4, aese_x4)        : 2,
     AArch64NeonLogical        : 2,
     (vand, vadd)              : 2,
     (vmov)                    : 2, # ???
@@ -184,6 +199,7 @@ def get_min_max_objective(slothy):
 def get_latency(src, out_idx, dst):
     instclass_src = find_class(src)
     instclass_dst = find_class(dst)
+
     latency = lookup_multidict(default_latencies, src)
     return latency
 
diff --git a/slothy/targets/aarch64/aarch64_neon.py b/slothy/targets/aarch64/aarch64_neon.py
new file mode 100644
index 00000000..4b7f3a0a
--- /dev/null
+++ b/slothy/targets/aarch64/aarch64_neon.py
@@ -0,0 +1,3053 @@
+#
+# Copyright (c) 2022 Arm Limited
+# Copyright (c) 2022 Hanno Becker
+# Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer
+# SPDX-License-Identifier: MIT
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+#
+# Author: Hanno Becker <hannobecker@posteo.de>
+#
+
+"""
+Partial SLOTHY architecture model for AArch64
+
+Various arithmetic and LSU scalar and Neon instructions are included,
+but many are still missing. The model is lazily growing with the workloads
+that SLOTHY is being used for.
+
+Adding new instructions is simple thanks to the generic AArch64Instruction
+class which generates instruction parsers and writers from instruction templates
+similar to those used in the Arm ARM.
+"""
+
+import logging
+import inspect
+import re
+import math
+from enum import Enum
+from functools import cache
+
+from sympy import simplify
+
+class RegisterType(Enum):
+    GPR = 1
+    NEON = 2
+    STACK_NEON = 3
+    STACK_GPR = 4
+    STACK_ANY = 5
+    FLAGS = 6
+    HINT = 7
+
+    def __str__(self):
+        return self.name
+    def __repr__(self):
+        return self.name
+
+    @cache
+    @staticmethod
+    def list_registers(reg_type, only_extra=False, only_normal=False, with_variants=False):
+        """Return the list of all registers of a given type"""
+
+        qstack_locations = [ f"QSTACK{i}" for i in range(8) ]
+        stack_locations  = [ f"STACK{i}"  for i in range(8) ]
+
+        # TODO: this is needed for X25519; as we use the same stack space
+        # for Neon and GPR; It would be great to unify. Ideally, one should
+        # be able to just use STACK_ without having to define it here
+        stack_any_locations = [
+            "STACK_MASK1",
+            "STACK_MASK2",
+            "STACK_A_0",
+            "STACK_A_8",
+            "STACK_A_16",
+            "STACK_A_24",
+            "STACK_A_32",
+            "STACK_B_0",
+            "STACK_B_8",
+            "STACK_B_16",
+            "STACK_B_24",
+            "STACK_B_32",
+            "STACK_CTR",
+            "STACK_LASTBIT",
+            "STACK_SCALAR",
+            "STACK_X_0",
+            "STACK_X_8",
+            "STACK_X_16",
+            "STACK_X_24",
+            "STACK_X_32"
+        ]
+
+
+        gprs_normal  = [ f"x{i}" for i in range(31) ] + ["sp"]
+        vregs_normal = [ f"v{i}" for i in range(32) ]
+
+        gprs_extra  = []
+        vregs_extra = []
+
+        gprs_variants = [ f"w{i}" for i in range(31) ]
+        vregs_variants = [ f"q{i}" for i in range(32) ]
+
+        gprs  = []
+        vregs = []
+        hints = [ f"t{i}" for i in range(100) ] + \
+                [ f"t{i}{j}" for i in range(8) for j in range(8) ] + \
+                [ f"t{i}_{j}" for i in range(16) for j in range(16) ]
+
+        flags = ["flags"]
+        if not only_extra:
+            gprs  += gprs_normal
+            vregs += vregs_normal
+        if not only_normal:
+            gprs  += gprs_extra
+            vregs += vregs_extra
+        if with_variants:
+            gprs += gprs_variants
+            vregs += vregs_variants
+
+        return { RegisterType.GPR      : gprs,
+                 RegisterType.STACK_GPR : stack_locations,
+                 RegisterType.STACK_NEON : qstack_locations,
+                 RegisterType.STACK_ANY  : stack_any_locations,
+                 RegisterType.NEON      : vregs,
+                 RegisterType.HINT      : hints,
+                 RegisterType.FLAGS     : flags}[reg_type]
+
+    @staticmethod
+    def find_type(r):
+        """Find type of architectural register"""
+        for ty in RegisterType:
+            if r in RegisterType.list_registers(ty):
+                return ty
+        raise UnknownRegister(f"Unknown architectural register {r}")
+
+    @staticmethod
+    def from_string(string):
+        """Find registe type from string"""
+        string = string.lower()
+        return { "qstack"    : RegisterType.STACK_NEON,
+                 "stack"     : RegisterType.STACK_GPR,
+                 "stack_any" : RegisterType.STACK_ANY,
+                 "neon"      : RegisterType.NEON,
+                 "gpr"       : RegisterType.GPR,
+                 "hint"      : RegisterType.HINT,
+                 "flags"     : RegisterType.FLAGS}.get(string,None)
+
+    @staticmethod
+    def default_reserved():
+        """Return the list of registers that should be reserved by default"""
+        return set(["flags", "sp",
+            "STACK_MASK1",
+            "STACK_MASK2",
+            "STACK_A_0",
+            "STACK_A_8",
+            "STACK_A_16",
+            "STACK_A_24",
+            "STACK_A_32",
+            "STACK_B_0",
+            "STACK_B_8",
+            "STACK_B_16",
+            "STACK_B_24",
+            "STACK_B_32",
+            "STACK_CTR",
+            "STACK_LASTBIT",
+            "STACK_SCALAR",
+            "STACK_X_0",
+            "STACK_X_8",
+            "STACK_X_16",
+            "STACK_X_24",
+            "STACK_X_32"
+                    ] + RegisterType.list_registers(RegisterType.HINT))
+
+    @staticmethod
+    def default_aliases():
+        "Register aliases used by the architecture"
+        return {}
+
+class Branch:
+    """Helper for emitting branches"""
+
+    @staticmethod
+    def if_equal(val, lbl):
+        """Emit assembly for a branch-if-equal sequence"""
+        reg = "count"
+        yield f"cmp {reg}, #{val}"
+        yield f"b.eq {lbl}"
+
+    @staticmethod
+    def if_greater_equal(val, lbl):
+        """Emit assembly for a branch-if-greater-equal sequence"""
+        reg = "count"
+        yield f"cmp {reg}, #{val}"
+        yield f"b.ge {lbl}"
+
+    @staticmethod
+    def unconditional(lbl):
+        """Emit unconditional branch"""
+        yield f"b {lbl}"
+
+class Loop:
+    """Helper functions for parsing and writing simple loops in AArch64
+
+    TODO: Generalize; current implementation too specific about shape of loop"""
+
+    def __init__(self, lbl_start="1", lbl_end="2", loop_init="lr"):
+        self.lbl_start = lbl_start
+        self.lbl_end   = lbl_end
+        self.loop_init = loop_init
+
+    def start(self, indentation=0, fixup=0, unroll=1, jump_if_empty=None):
+        """Emit starting instruction(s) and jump label for loop"""
+        indent = ' ' * indentation
+        if unroll > 1:
+            assert unroll in [1,2,4,8,16,32]
+            yield f"{indent}lsr count, count, #{int(math.log2(unroll))}"
+        if fixup != 0:
+            yield f"{indent}sub count, count, #{fixup}"
+        if jump_if_empty is not None:
+            yield f"cbz count, {jump_if_empty}"
+        yield f"{self.lbl_start}:"
+
+    def end(self, other, indentation=0):
+        """Emit compare-and-branch at the end of the loop"""
+        (reg0, reg1, imm) = other
+        indent = ' ' * indentation
+        lbl_start = self.lbl_start
+        if lbl_start.isdigit():
+            lbl_start += "b"
+
+        yield f"{indent}sub {reg0}, {reg1}, {imm}"
+        yield f"{indent}cbnz {reg0}, {lbl_start}"
+
+    @staticmethod
+    def extract(source, lbl):
+        """Locate a loop with start label `lbl` in `source`.
+
+        We currently only support the following loop forms:
+
+           ```
+           loop_lbl:
+               {code}
+               sub[s] <cnt>, <cnt>, #1
+               (cbnz|bnz|bne) <cnt>, loop_lbl
+           ```
+
+        """
+        assert isinstance(source, list)
+
+        pre  = []
+        body = []
+        post = []
+        loop_lbl_regexp_txt = r"^\s*(?P<label>\w+)\s*:(?P<remainder>.*)$"
+        loop_lbl_regexp = re.compile(loop_lbl_regexp_txt)
+
+        # TODO: Allow other forms of looping
+
+        loop_end_regexp_txt = (r"^\s*sub[s]?\s+(?P<reg0>\w+),\s*(?P<reg1>\w+),\s*(?P<imm>#1)",
+                               rf"^\s*(cbnz|bnz|bne)\s+(?P<reg0>\w+),\s*{lbl}")
+        loop_end_regexp = [re.compile(txt) for txt in loop_end_regexp_txt]
+        lines = iter(source)
+        l = None
+        keep = False
+        state = 0 # 0: haven't found loop yet, 1: extracting loop, 2: after loop
+        while True:
+            if not keep:
+                l = next(lines, None)
+                l_str = str(l)
+            keep = False
+            if l is None:
+                break
+            assert isinstance(l, str) is False
+            if state == 0:
+                p = loop_lbl_regexp.match(l_str)
+                if p is not None and p.group("label") == lbl:
+                    l = l.copy().set_text(p.group("remainder"))
+                    keep = True
+                    state = 1
+                else:
+                    pre.append(l)
+                continue
+            if state == 1:
+                p = loop_end_regexp[0].match(l_str)
+                if p is not None:
+                    reg0 = p.group("reg0")
+                    reg1 = p.group("reg1")
+                    imm = p.group("imm")
+                    state = 2
+                    continue
+                body.append(l)
+                continue
+            if state == 2:
+                p = loop_end_regexp[1].match(l_str)
+                if p is not None:
+                    state = 3
+                    continue
+                body.append(l)
+                continue
+            if state == 3:
+                post.append(l)
+                continue
+        if state < 3:
+            raise FatalParsingException(f"Couldn't identify loop {lbl}")
+        return pre, body, post, lbl, (reg0, reg1, imm)
+
+class FatalParsingException(Exception):
+    """A fatal error happened during instruction parsing"""
+
+class UnknownInstruction(Exception):
+    """The parent instruction class for the given object could not be found"""
+
+class UnknownRegister(Exception):
+    """The register could not be found"""
+
+class Instruction:
+
+    class ParsingException(Exception):
+        """An attempt to parse an assembly line as a specific instruction failed
+
+        This is a frequently encountered exception since assembly lines are parsed by
+        trial and error, iterating over all instruction parsers."""
+        def __init__(self, err=None):
+            super().__init__(err)
+
+    def __init__(self, *, mnemonic,
+                 arg_types_in= None, arg_types_in_out = None, arg_types_out = None):
+
+        if arg_types_in is None:
+            arg_types_in = []
+        if arg_types_out is None:
+            arg_types_out = []
+        if arg_types_in_out is None:
+            arg_types_in_out = []
+
+        self.mnemonic = mnemonic
+
+        self.args_out_combinations = None
+        self.args_in_combinations = None
+        self.args_in_out_combinations = None
+        self.args_in_out_different = None
+        self.args_in_inout_different = None
+
+        self.arg_types_in     = arg_types_in
+        self.arg_types_out    = arg_types_out
+        self.arg_types_in_out = arg_types_in_out
+        self.num_in           = len(arg_types_in)
+        self.num_out          = len(arg_types_out)
+        self.num_in_out       = len(arg_types_in_out)
+
+        self.args_out_restrictions    = [ None for _ in range(self.num_out)    ]
+        self.args_in_restrictions     = [ None for _ in range(self.num_in)     ]
+        self.args_in_out_restrictions = [ None for _ in range(self.num_in_out) ]
+
+        self.args_in     = []
+        self.args_out    = []
+        self.args_in_out = []
+
+        self.addr = None
+        self.increment = None
+        self.pre_index = None
+        self.immediate = None
+
+    def global_parsing_cb(self, a, log=None):
+        """Parsing callback triggered after DataFlowGraph parsing which allows modification
+        of the instruction in the context of the overall computation.
+
+        This is primarily used to remodel input-outputs as outputs in jointly destructive
+        instruction patterns (See Section 4.4, https://eprint.iacr.org/2022/1303.pdf)."""
+        _ = log # log is not used
+        return False
+
+    def global_fusion_cb(self, a, log=None):
+        """Fusion callback triggered after DataFlowGraph parsing which allows fusing
+        of the instruction in the context of the overall computation.
+
+        This can be used e.g. to detect eor-eor pairs and replace them by eor3."""
+        _ = log # log is not used
+        return False
+
+    def write(self):
+        """Write the instruction"""
+        args = self.args_out + self.args_in_out + self.args_in
+        return self.mnemonic + ' ' + ', '.join(args)
+
+    @staticmethod
+    def unfold_abbrevs(mnemonic):
+        if mnemonic.count("<dt") > 1:
+            for i in range(mnemonic.count("<dt")):
+                mnemonic = re.sub(f"<dt{i}>", f"(?P<datatype{i}>(?:2|4|8|16)(?:B|H|S|D))",
+                                  mnemonic)
+        else:
+            mnemonic = re.sub("<dt>",  "(?P<datatype>(?:|i|u|s)(?:8|16|32|64))", mnemonic)
+            mnemonic = re.sub("<fdt>", "(?P<datatype>(?:f)(?:8|16|32))", mnemonic)
+        return mnemonic
+
+    def _is_instance_of(self, inst_list):
+        for inst in inst_list:
+            if isinstance(self,inst):
+                return True
+        return False
+
+    # vector
+    def is_q_form_vector_instruction(self):
+        """Indicates whether an instruction is Neon instruction operating on
+        a 128-bit vector"""
+        if not hasattr(self, "datatype"):
+            return self._is_instance_of([
+                                      vmul, vmul_lane,
+                                      vmla, vmla_lane,
+                                      vmls, vmls_lane,
+                                      vqrdmulh, vqrdmulh_lane,
+                                      vqdmulh_lane,
+                                      vsrshr,
+                                      Str_Q, Ldr_Q,
+                                      stack_vld1r])
+        dt = getattr(self, "datatype")
+        if dt == "":
+            return False
+        if dt[0].lower() in ["2d", "4s", "8h", "16b"]:
+            return True
+        if dt[0].lower() in ["1d", "2s", "4h", "8b"]:
+            return False
+        raise FatalParsingException(f"unknown datatype {dt}")
+
+    def is_vector_mul(self):
+        """Indicates if an instruction is a Neon vector multiplication"""
+        return self._is_instance_of([ vmul, vmul_lane,
+                                      vmla, vmls_lane, vmls,
+                                      vqrdmulh, vqrdmulh_lane, vqdmulh_lane,
+                                      vmull, vmlal ])
+    def is_vector_add_sub(self):
+        """Indicates if an instruction is a Neon add/sub operation"""
+        return self._is_instance_of([ vadd, vsub ])
+    def is_vector_load(self):
+        """Indicates if an instruction is a Neon load instruction"""
+        return self._is_instance_of([ Ldr_Q, Ldp_Q ]) # TODO: Ld4 missing?
+    def is_vector_store(self):
+        """Indicates if an instruction is a Neon store instruction"""
+        return self._is_instance_of([ Str_Q, Stp_Q, St4, stack_vstp_dform, stack_vstr_dform])
+
+    # scalar
+    def is_scalar_load(self):
+        """Indicates if an instruction is a scalar load instruction"""
+        return self._is_instance_of([ Ldr_X, Ldp_X ])
+    def is_scalar_store(self):
+        """Indicates if an instruction is a scalar store instruction"""
+        return  self._is_instance_of([ Stp_X, Str_X ])
+
+    # scalar or vector
+    def is_load(self):
+        """Indicates if an instruction is a scalar or Neon load instruction"""
+        return self.is_vector_load() or self.is_scalar_load()
+    def is_store(self):
+        """Indicates if an instruction is a scalar or Neon store instruction"""
+        return self.is_vector_store() or self.is_scalar_store()
+    def is_load_store_instruction(self):
+        """Indicates if an instruction is a scalar or Neon load or store instruction"""
+        return self.is_load() or self.is_store()
+
+    @classmethod
+    def make(cls, src):
+        """Abstract factory method parsing a string into an instruction instance."""
+
+    @staticmethod
+    def build(c, src, mnemonic, **kwargs):
+        """Attempt to parse a string as an instance of an instruction.
+
+        Args:
+            c: The target instruction the string should be attempted to be parsed as.
+            src: The string to parse.
+            mnemonic: The mnemonic of instruction c
+
+        Returns:
+            Upon success, the result of parsing src as an instance of c.
+
+        Raises:
+            ParsingException: The str argument cannot be parsed as an
+                instance of c.
+            FatalParsingException: A fatal error during parsing happened
+                that's likely a bug in the model.
+        """
+
+        if src.split(' ')[0] != mnemonic:
+            raise Instruction.ParsingException("Mnemonic does not match")
+
+        obj = c(mnemonic=mnemonic, **kwargs)
+
+        # Replace <dt> by list of all possible datatypes
+        mnemonic = Instruction.unfold_abbrevs(obj.mnemonic)
+
+        expected_args = obj.num_in + obj.num_out + obj.num_in_out
+        regexp_txt  = rf"^\s*{mnemonic}"
+        if expected_args > 0:
+            regexp_txt += r"\s+"
+        regexp_txt += ','.join([r"\s*(\w+)\s*" for _ in range(expected_args)])
+        regexp = re.compile(regexp_txt)
+
+        p = regexp.match(src)
+        if p is None:
+            raise Instruction.ParsingException(
+                f"Doesn't match basic instruction template {regexp_txt}")
+
+        operands = list(p.groups())
+        obj.datatype = ""
+
+        if obj.num_out > 0:
+            obj.args_out = operands[:obj.num_out]
+            idx_args_in = obj.num_out
+        elif obj.num_in_out > 0:
+            obj.args_in_out = operands[:obj.num_in_out]
+            idx_args_in = obj.num_in_out
+        else:
+            idx_args_in = 0
+
+        obj.args_in = operands[idx_args_in:]
+
+        if not len(obj.args_in) == obj.num_in:
+            raise FatalParsingException(f"Something wrong parsing {src}: Expect {obj.num_in} input,"
+                f" but got {len(obj.args_in)} ({obj.args_in})")
+
+        return obj
+
+    @staticmethod
+    def parser(src_line):
+        """Global factory method parsing an assembly line into an instance
+        of a subclass of Instance"""
+        insts = []
+        exceptions = {}
+        instnames = []
+
+        src = src_line.text.strip()
+
+        # Iterate through all derived classes and call their parser
+        # until one of them hopefully succeeds
+        for inst_class in Instruction.all_subclass_leaves:
+            try:
+                inst = inst_class.make(src)
+                instnames = [inst_class.__name__]
+                insts = [inst]
+                break
+            except Instruction.ParsingException as e:
+                exceptions[inst_class.__name__] = e
+
+        if len(insts) == 0:
+            logging.error("Failed to parse instruction %s", src)
+            logging.error("A list of attempted parsers and their exceptions follows.")
+            for i,e in exceptions.items():
+                msg = f"* {i + ':':20s} {e}"
+                logging.error(msg)
+            raise Instruction.ParsingException(
+                f"Couldn't parse {src}\nYou may need to add support "\
+                  "for a new instruction (variant)?")
+
+        logging.debug("Parsing result for '%s': %s", src, instnames)
+        return insts
+
+    def __repr__(self):
+        return self.write()
+
+class AArch64Instruction(Instruction):
+    """Abstract class representing AArch64 instructions"""
+
+    PARSERS = {}
+
+    @staticmethod
+    def _unfold_pattern(src):
+
+        src = re.sub(r"\.",  "\\\\s*\\\\.\\\\s*", src)
+        src = re.sub(r"\[", "\\\\s*\\\\[\\\\s*", src)
+        src = re.sub(r"\]", "\\\\s*\\\\]\\\\s*", src)
+
+        def pattern_transform(g):
+            return \
+                f"([{g.group(1).lower()}{g.group(1)}]" +\
+                f"(?P<raw_{g.group(1)}{g.group(2)}>[0-9_][0-9_]*)|" +\
+                f"([{g.group(1).lower()}{g.group(1)}]<(?P<symbol_{g.group(1)}{g.group(2)}>\\w+)>))"
+        src = re.sub(r"<([BHWXVQTD])(\w+)>", pattern_transform, src)
+
+        # Replace <key> or <key0>, <key1>, ... with pattern
+        def replace_placeholders(src, mnemonic_key, regexp, group_name):
+            prefix = f"<{mnemonic_key}"
+            pattern = f"<{mnemonic_key}>"
+            def pattern_i(i):
+                return f"<{mnemonic_key}{i}>"
+
+            cnt = src.count(prefix)
+            if cnt > 1:
+                for i in range(cnt):
+                    src = re.sub(pattern_i(i),  f"(?P<{group_name}{i}>{regexp})", src)
+            else:
+                src = re.sub(pattern, f"(?P<{group_name}>{regexp})", src)
+
+            return src
+
+        flaglist = ["eq","ne","cs","hs","cc","lo","mi","pl","vs","vc","hi","ls","ge","lt","gt","le"]
+
+        flag_pattern = '|'.join(flaglist)
+        dt_pattern = "(?:|2|4|8|16)(?:B|H|S|D|b|h|s|d)"
+        imm_pattern = "#(\\\\w|\\\\s|/| |-|\\*|\\+|\\(|\\)|=|,)+"
+        index_pattern = "[0-9]+"
+
+        src = re.sub(" ", "\\\\s+", src)
+        src = re.sub(",", "\\\\s*,\\\\s*", src)
+
+        src = replace_placeholders(src, "imm", imm_pattern, "imm")
+        src = replace_placeholders(src, "dt", dt_pattern, "datatype")
+        src = replace_placeholders(src, "index", index_pattern, "index")
+        src = replace_placeholders(src, "flag", flag_pattern, "flag")
+
+        src = r"\s*" + src + r"\s*(//.*)?\Z"
+        return src
+
+    @staticmethod
+    def _build_parser(src):
+        regexp_txt = AArch64Instruction._unfold_pattern(src)
+        regexp = re.compile(regexp_txt)
+
+        def _parse(line):
+            regexp_result = regexp.match(line)
+            if regexp_result is None:
+                raise Instruction.ParsingException(f"Does not match instruction pattern {src}"\
+                                                   f"[regex: {regexp_txt}]")
+            res = regexp.match(line).groupdict()
+            items = list(res.items())
+            for k, v in items:
+                for l in ["symbol_", "raw_"]:
+                    if k.startswith(l):
+                        del res[k]
+                        if v is None:
+                            continue
+                        k = k[len(l):]
+                        res[k] = v
+            return res
+        return _parse
+
+    @staticmethod
+    def get_parser(pattern):
+        """Build parser for given AArch64 instruction pattern"""
+        if pattern in AArch64Instruction.PARSERS:
+            return AArch64Instruction.PARSERS[pattern]
+        parser = AArch64Instruction._build_parser(pattern)
+        AArch64Instruction.PARSERS[pattern] = parser
+        return parser
+
+    @cache
+    @staticmethod
+    def _infer_register_type(ptrn):
+        if ptrn[0].upper() in ["X","W"]:
+            return RegisterType.GPR
+        if ptrn[0].upper() in ["V","Q","D"]:
+            return RegisterType.NEON
+        if ptrn[0].upper() in ["T"]:
+            return RegisterType.HINT
+        raise FatalParsingException(f"Unknown pattern: {ptrn}")
+
+    def __init__(self, pattern, *, inputs=None, outputs=None, in_outs=None, modifiesFlags=False,
+                 dependsOnFlags=False):
+
+        self.mnemonic = pattern.split(" ")[0]
+
+        if inputs is None:
+            inputs = []
+        if outputs is None:
+            outputs = []
+        if in_outs is None:
+            in_outs = []
+        arg_types_in     = [AArch64Instruction._infer_register_type(r) for r in inputs]
+        arg_types_out    = [AArch64Instruction._infer_register_type(r) for r in outputs]
+        arg_types_in_out = [AArch64Instruction._infer_register_type(r) for r in in_outs]
+
+        if modifiesFlags:
+            arg_types_out += [RegisterType.FLAGS]
+            outputs       += ["flags"]
+
+        if dependsOnFlags:
+            arg_types_in += [RegisterType.FLAGS]
+            inputs += ["flags"]
+
+        super().__init__(mnemonic=pattern,
+                     arg_types_in=arg_types_in,
+                     arg_types_out=arg_types_out,
+                     arg_types_in_out=arg_types_in_out)
+
+        self.inputs = inputs
+        self.outputs = outputs
+        self.in_outs = in_outs
+
+
+        self.pattern = pattern
+        self.pattern_inputs = list(zip(inputs, arg_types_in, strict=True))
+        self.pattern_outputs = list(zip(outputs, arg_types_out, strict=True))
+        self.pattern_in_outs = list(zip(in_outs, arg_types_in_out, strict=True))
+
+    @staticmethod
+    def _to_reg(ty, s):
+        if ty == RegisterType.GPR:
+            c = "x"
+        elif ty == RegisterType.NEON:
+            c = "v"
+        elif ty == RegisterType.HINT:
+            c = "t"
+        else:
+            assert False
+        if s.replace('_','').isdigit():
+            return f"{c}{s}"
+        return s
+
+    @staticmethod
+    def _build_pattern_replacement(s, ty, arg):
+        if ty == RegisterType.GPR:
+            if arg[0] != "x":
+                return f"{s[0].upper()}<{arg}>"
+            return s[0].lower() + arg[1:]
+        if ty == RegisterType.NEON:
+            if arg[0] != "v":
+                return f"{s[0].upper()}<{arg}>"
+            return s[0].lower() + arg[1:]
+        if ty == RegisterType.HINT:
+            if arg[0] != "t":
+                return f"{s[0].upper()}<{arg}>"
+            return s[0].lower() + arg[1:]
+        raise FatalParsingException(f"Unknown register type ({s}, {ty}, {arg})")
+
+    @staticmethod
+    def _instantiate_pattern(s, ty, arg, out):
+        if ty == RegisterType.FLAGS:
+            return out
+        rep = AArch64Instruction._build_pattern_replacement(s, ty, arg)
+        res = out.replace(f"<{s}>", rep)
+        if res == out:
+            raise FatalParsingException(f"Failed to replace <{s}> by {rep} in {out}!")
+        return res
+
+    @staticmethod
+    def build_core(obj, res):
+        obj.args_in = []
+        obj.args_in_out = []
+        obj.args_out = []
+
+        def group_to_attribute(group_name, attr_name, f=None):
+            def f_default(x):
+                return x
+            def group_name_i(i):
+                return f"{group_name}{i}"
+            if f is None:
+                f = f_default
+            if group_name in res.keys():
+                setattr(obj, attr_name, f(res[group_name]))
+            else:
+                idxs = [ i for i in range(4) if group_name_i(i) in res.keys() ]
+                if len(idxs) == 0:
+                    return
+                assert idxs == list(range(len(idxs)))
+                setattr(obj, attr_name,
+                        list(map(lambda i: f(res[group_name_i(i)]), idxs)))
+
+        group_to_attribute('datatype', 'datatype', lambda x: x.lower())
+        group_to_attribute('imm', 'immediate', lambda x:x[1:]) # Strip '#'
+        group_to_attribute('index', 'index', int)
+        group_to_attribute('flag', 'flag')
+
+        for s, ty in obj.pattern_inputs:
+            if ty == RegisterType.FLAGS:
+                obj.args_in.append("flags")
+            else:
+                obj.args_in.append(AArch64Instruction._to_reg(ty, res[s]))
+        for s, ty in obj.pattern_outputs:
+            if ty == RegisterType.FLAGS:
+                obj.args_out.append("flags")
+            else:
+                obj.args_out.append(AArch64Instruction._to_reg(ty, res[s]))
+
+        for s, ty in obj.pattern_in_outs:
+            obj.args_in_out.append(AArch64Instruction._to_reg(ty, res[s]))
+
+    @staticmethod
+    def build(c, src):
+        pattern = getattr(c, "pattern")
+        inputs = getattr(c, "inputs", []).copy()
+        outputs = getattr(c, "outputs", []).copy()
+        in_outs = getattr(c, "in_outs", []).copy()
+        modifies_flags = getattr(c,"modifiesFlags", False)
+        depends_on_flags = getattr(c,"dependsOnFlags", False)
+
+        if isinstance(src, str):
+            if src.split(' ')[0] != pattern.split(' ')[0]:
+                raise Instruction.ParsingException("Mnemonic does not match")
+            res = AArch64Instruction.get_parser(pattern)(src)
+        else:
+            assert isinstance(src, dict)
+            res = src
+
+        obj = c(pattern, inputs=inputs, outputs=outputs, in_outs=in_outs,
+                modifiesFlags=modifies_flags, dependsOnFlags=depends_on_flags)
+
+        AArch64Instruction.build_core(obj, res)
+        return obj
+
+    @classmethod
+    def make(cls, src):
+        return AArch64Instruction.build(cls, src)
+
+    def write(self):
+        out = self.pattern
+        l = list(zip(self.args_in, self.pattern_inputs))     + \
+            list(zip(self.args_out, self.pattern_outputs))   + \
+            list(zip(self.args_in_out, self.pattern_in_outs))
+        for arg, (s, ty) in l:
+            out = AArch64Instruction._instantiate_pattern(s, ty, arg, out)
+
+        def replace_pattern(txt, attr_name, mnemonic_key, t=None):
+            def t_default(x):
+                return x
+            if t is None:
+                t = t_default
+            if not hasattr(self, attr_name):
+                return txt
+            a = getattr(self, attr_name)
+            if not isinstance(a, list):
+                txt = txt.replace(f"<{mnemonic_key}>", t(a))
+                return txt
+            for i, v in enumerate(a):
+                txt = txt.replace(f"<{mnemonic_key}{i}>", t(v))
+            return txt
+
+        out = replace_pattern(out, "immediate", "imm", lambda x: f"#{x}")
+        out = replace_pattern(out, "datatype", "dt", lambda x: x.upper())
+        out = replace_pattern(out, "flag", "flag")
+        out = replace_pattern(out, "index", "index", str)
+
+        out = out.replace("\\[", "[")
+        out = out.replace("\\]", "]")
+        return out
+
+####################################################################################
+#                                                                                  #
+# Virtual instruction to model pushing to stack locations without modelling memory #
+#                                                                                  #
+####################################################################################
+
+class qsave(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="qsave",
+                               arg_types_in=[RegisterType.NEON],
+                               arg_types_out=[RegisterType.STACK_NEON])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class qrestore(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="qrestore",
+                               arg_types_in=[RegisterType.STACK_NEON],
+                               arg_types_out=[RegisterType.NEON])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class save(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="save",
+                               arg_types_in=[RegisterType.GPR],
+                               arg_types_out=[RegisterType.STACK_GPR])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class restore(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="restore",
+                               arg_types_in=[RegisterType.STACK_GPR],
+                               arg_types_out=[RegisterType.GPR])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+# TODO: Need to unify these
+class stack_vstp_dform(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_vstp_dform",
+                               arg_types_in=[RegisterType.NEON, RegisterType.NEON],
+                               arg_types_out=[RegisterType.STACK_ANY, RegisterType.STACK_ANY])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_vstr_dform(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_vstr_dform",
+                               arg_types_in=[RegisterType.NEON],
+                               arg_types_out=[RegisterType.STACK_ANY])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_stp(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_stp",
+                               arg_types_in=[RegisterType.GPR, RegisterType.GPR],
+                               arg_types_out=[RegisterType.STACK_ANY, RegisterType.STACK_ANY])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_stp_wform(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_stp_wform",
+                               arg_types_in=[RegisterType.GPR, RegisterType.GPR],
+                               arg_types_out=[RegisterType.STACK_ANY, RegisterType.STACK_ANY])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_str(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_str",
+                               arg_types_in=[RegisterType.GPR],
+                               arg_types_out=[RegisterType.STACK_ANY])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_ldr(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_ldr",
+                               arg_types_in=[RegisterType.STACK_ANY],
+                               arg_types_out=[RegisterType.GPR])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_vld1r(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_vld1r",
+                               arg_types_in=[RegisterType.STACK_ANY],
+                               arg_types_out=[RegisterType.NEON])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_vldr_bform(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_vldr_bform",
+                               arg_types_in=[RegisterType.STACK_ANY],
+                               arg_types_out=[RegisterType.NEON])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_vldr_dform(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_vldr_dform",
+                               arg_types_in=[RegisterType.STACK_ANY],
+                               arg_types_out=[RegisterType.NEON])
+        obj.addr = "sp"
+        obj.increment = None
+        return obj
+
+class stack_vld2_lane(Instruction): # pylint: disable=missing-docstring,invalid-name
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.detected_stack_vld2_lane_pair = None
+        self.lane = None
+
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="stack_vld2_lane",
+            arg_types_in=[RegisterType.STACK_ANY],
+            arg_types_in_out=[RegisterType.NEON, RegisterType.NEON, RegisterType.GPR])
+
+        regexp_txt = r"stack_vld2_lane\s+(?P<dst1>\w+)\s*,\s*(?P<dst2>\w+)\s*,\s*"\
+            r"(?P<src1>\w+)\s*,\s*"\
+            r"(?P<src2>\w+)\s*,\s*"\
+            r"(?P<lane>.*),\s*(?P<immediate>.*)"
+        regexp_txt = Instruction.unfold_abbrevs(regexp_txt)
+        regexp = re.compile(regexp_txt)
+        p = regexp.match(src)
+        if p is None:
+            raise Instruction.ParsingException("Does not match pattern")
+        obj.args_in     = [p.group("src2")]
+        obj.args_in_out = [p.group("dst1"), p.group("dst2"), p.group("src1")]
+        obj.args_out = []
+
+        obj.lane = p.group("lane")
+        obj.immediate = p.group("immediate")
+
+        obj.args_in_out_combinations = [
+                ( [0,1], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,31) ] )
+            ]
+
+        obj.addr = p.group("src1")
+        obj.increment = obj.immediate
+        obj.detected_stack_vld2_lane_pair = False
+        return obj
+
+    def write(self):
+        if not self.detected_stack_vld2_lane_pair:
+            return f"stack_vld2_lane {self.args_in_out[0]}, {self.args_in_out[1]}, "\
+                   f"{self.args_in_out[2]}, {self.args_in[0]}, {self.lane}, {self.immediate}"
+
+        return f"stack_vld2_lane {self.args_out[0]}, {self.args_out[1]}, "\
+               f"{self.args_in_out[0]}, {self.args_in[0]}, {self.lane}, {self.immediate}"
+
+class nop(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "nop"
+
+class vadd(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>"
+    inputs = ["Vb", "Vc"]
+    outputs = ["Va"]
+
+class vsub(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sub <Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>"
+    inputs = ["Vb", "Vc"]
+    outputs = ["Va"]
+
+############################
+#                          #
+# Some LSU instructions    #
+#                          #
+############################
+
+class Ldr_Q(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class Ldp_Q(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class d_ldp_sp_imm(Ldr_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Da>, <Db>, [sp, <imm>]"
+    outputs = ["Da", "Db"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+class q_ldr(Ldr_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Qa>, [<Xc>]"
+    inputs = ["Xc"]
+    outputs = ["Qa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class q_ldr_with_inc_hint(Ldr_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldrh <Qa>, <Xc>, <imm>, <Th>"
+    inputs = ["Xc", "Th"]
+    outputs = ["Qa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class q_ldr_with_inc(Ldr_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Qa>, [<Xc>, <imm>]"
+    inputs = ["Xc"]
+    outputs = ["Qa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class q_ldp_with_inc(Ldp_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Qa>, <Qb>, [<Xc>, <imm>]"
+    inputs = ["Xc"]
+    outputs = ["Qa", "Qb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class q_ldr_with_inc_writeback(Ldr_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Qa>, [<Xc>, <imm>]!"
+    inputs = ["Xc"]
+    outputs = ["Qa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class q_ldr_with_postinc(Ldr_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Qa>, [<Xc>], <imm>"
+    inputs = ["Xc"]
+    outputs = ["Qa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class q_ldp_with_postinc(Ldp_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Qa>, <Qb>, [<Xc>], <imm>"
+    inputs = ["Xc"]
+    outputs = ["Qa", "Qb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class Str_Q(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class Stp_Q(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class d_stp_sp_imm(Str_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Da>, <Db>, [sp, <imm>]"
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+class q_str(Str_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Qa>, [<Xc>]"
+    inputs = ["Qa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = obj.args_in[1]
+        return obj
+
+class q_str_with_inc_hint(Str_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "strh <Qa>, <Xc>, <imm>, <Th>"
+    inputs = ["Qa", "Xc"]
+    outputs = ["Th"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[1]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class q_str_with_inc(Str_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Qa>, [<Xc>, <imm>]"
+    inputs = ["Qa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[1]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class q_stp_with_inc(Stp_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Qa>, <Qb>, [<Xc>, <imm>]"
+    inputs = ["Qa", "Qb", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[2]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class q_str_with_inc_writeback(Str_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Qa>, [<Xc>, <imm>]!"
+    inputs = ["Qa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[1]
+        return obj
+
+class q_str_with_postinc(Str_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Qa>, [<Xc>], <imm>"
+    inputs = ["Qa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[1]
+        return obj
+
+class q_stp_with_postinc(Stp_Q): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Qa>, <Qb>, [<Xc>], <imm>"
+    inputs = ["Qa", "Qb", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[2]
+        return obj
+
+class Ldr_X(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class x_ldr(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Xa>, [<Xc>]"
+    inputs = ["Xc"]
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        # For now, assert that no fixup has happened
+        # Eventually, this instruction should be merged
+        # into the LDP with increment.
+        assert self.pre_index is None
+        return super().write()
+
+class x_ldr_with_imm(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Xa>, [<Xc>, <imm>]"
+    inputs = ["Xc"]
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldr_with_postinc(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Xa>, [<Xc>], <imm>"
+    inputs = ["Xc"]
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class x_ldr_stack(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Xa>, [sp]"
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        # For now, assert that no fixup has happened
+        # Eventually, this instruction should be merged
+        # into the LDP with increment.
+        assert self.pre_index is None
+        return super().write()
+
+class x_ldr_stack_imm(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Xa>, [sp, <imm>]"
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldr_stack_imm_with_hint(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldrh <Xa>, sp, <imm>, <Th>"
+    inputs = ["Th"]
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldr_imm_with_hint(Ldr_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldrh <Xa>, <Xb>, <imm>, <Th>"
+    inputs = ["Xb","Th"]
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class Ldp_X(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class x_ldp(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Xa>, <Xb>, [<Xc>]"
+    inputs = ["Xc"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        # For now, assert that no fixup has happened
+        # Eventually, this instruction should be merged
+        # into the LDP with increment.
+        assert self.pre_index is None
+        return super().write()
+
+class x_ldp_with_imm_sp_xzr(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Xa>, xzr, [sp, <imm>]"
+    outputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldp_with_imm_sp(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Xa>, <Xb>, [sp, <imm>]"
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldp_with_inc(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Xa>, <Xb>, [<Xc>, <imm>]"
+    inputs = ["Xc"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldp_with_inc_writeback(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Xa>, <Xb>, [<Xc>, <imm>]!"
+    inputs = ["Xc"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class x_ldp_with_postinc_writeback(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldp <Xa>, <Xb>, [<Xc>], <imm>"
+    inputs = ["Xc"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class x_ldp_with_inc_hint(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldph <Xa>, <Xb>, <Xc>, <imm>, <Th>"
+    inputs = ["Xc", "Th"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldp_sp_with_inc_hint(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldph <Xa>, <Xb>, sp, <imm>, <Th>"
+    inputs = ["Th"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldp_sp_with_inc_hint2(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldphp <Xa>, <Xb>, sp, <imm>, <Th0>, <Th1>"
+    inputs = ["Th0", "Th1"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_ldp_with_inc_hint2(Ldp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldphp <Xa>, <Xb>, <Xc>, <imm>, <Th0>, <Th1>"
+    inputs = ["Xc", "Th0", "Th1"]
+    outputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class ldr_sxtw_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Wd>, [<Xa>, <Wb>, SXTW <imm>]"
+    inputs = ["Xa", "Wb"]
+    outputs = ["Wd"]
+
+############################
+#                          #
+# Some scalar instructions #
+#                          #
+############################
+
+class lsr_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "lsr <Wd>, <Wa>, <Wb>"
+    inputs = ["Wa", "Wb"]
+    outputs = ["Wd"]
+
+class asr_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "asr <Wd>, <Wa>, <imm>"
+    inputs = ["Wa"]
+    outputs = ["Wd"]
+
+class eor_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "eor <Wd>, <Wa>, <Wb>"
+    inputs = ["Wa", "Wb"]
+    outputs = ["Wd"]
+
+class AArch64BasicArithmetic(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class subs_wform(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "subs <Wd>, <Wa>, <imm>"
+    inputs = ["Wa"]
+    outputs = ["Wd"]
+    modifiesFlags = True
+
+class subs_imm(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "subs <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    modifiesFlags = True
+
+class sub_imm(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sub <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class add_imm(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class add_sp_imm(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, sp, <imm>"
+    outputs = ["Xd"]
+
+class neg(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "neg <Xd>, <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class adds(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adds <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+
+class adds_to_zero(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adds xzr, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    modifiesFlags=True
+
+class adds_imm_to_zero(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adds xzr, <Xa>, <imm>"
+    inputs = ["Xa"]
+    modifiesFlags=True
+
+class subs_twoarg(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "subs <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+
+class adds_twoarg(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adds <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+
+class adcs(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adcs <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class sbcs(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sbcs <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class sbcs_zero(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sbcs <Xd>, <Xa>, xzr"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class sbc(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sbc <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class sbc_zero_r(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sbc <Xd>, <Xa>, xzr"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class adcs_zero_r(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adcs <Xd>, <Xa>, xzr"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class adcs_zero_l(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adcs <Xd>, xzr, <Xb>"
+    inputs = ["Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class adc(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adc <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class adc_zero2(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adc <Xd>, xzr, xzr"
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class adc_zero_r(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adc <Xd>, <Xa>, xzr"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class adc_zero_l(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adc <Xd>, xzr, <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class add(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class add2(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <Xb>, <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class add_w_imm(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Wd>, <Wa>, <imm>"
+    inputs = ["Wa"]
+    outputs = ["Wd"]
+
+class sub(AArch64BasicArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sub <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class AArch64ShiftedArithmetic(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class add_lsl(AArch64ShiftedArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <Xb>, lsl <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class add_lsr(AArch64ShiftedArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <Xb>, lsr <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class adds_lsl(AArch64ShiftedArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adds <Xd>, <Xa>, <Xb>, lsl <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+
+class adds_lsr(AArch64ShiftedArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "adds <Xd>, <Xa>, <Xb>, lsr <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+
+class add_asr(AArch64ShiftedArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <Xb>, asr <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class add_imm_lsl(AArch64ShiftedArithmetic): # pylint: disable=missing-docstring,invalid-name
+    pattern = "add <Xd>, <Xa>, <imm0>, lsl <imm1>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class AArch64Shift(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class lsr(AArch64Shift): # pylint: disable=missing-docstring,invalid-name
+    pattern = "lsr <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+# TODO: This likely has different perf characteristics!
+class lsr_variable(AArch64Shift): # pylint: disable=missing-docstring,invalid-name
+    pattern = "lsr <Xd>, <Xa>, <Xc>"
+    inputs = ["Xa", "Xc"]
+    outputs = ["Xd"]
+
+class lsl(AArch64Shift): # pylint: disable=missing-docstring,invalid-name
+    pattern = "lsl <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class asr(AArch64Shift): # pylint: disable=missing-docstring,invalid-name
+    pattern = "asr <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class AArch64Logical(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class rev_w(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "rev <Wd>, <Wa>"
+    inputs = ["Wa"]
+    outputs = ["Wd"]
+
+class eor(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "eor <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class orr(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "orr <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class orr_w(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "orr <Wd>, <Wa>, <Wb>"
+    inputs = ["Wa","Wb"]
+    outputs = ["Wd"]
+
+class bfi(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "bfi <Xd>, <Xa>, <imm0>, <imm1>"
+    inputs = ["Xa"]
+    in_outs=["Xd"]
+
+class and_imm(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "and <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class ands_imm(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ands <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+    modifiesFlags=True
+
+class ands_xzr_imm(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ands xzr, <Xa>, <imm>"
+    inputs = ["Xa"]
+    modifiesFlags=True
+
+class and_twoarg(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "and <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+
+class bic(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "bic <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class orr_imm(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "orr <Xd>, <Xa>, <imm>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class sbfx(AArch64Logical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sbfx <Xd>, <Xa>, <imm0>, <imm1>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class extr(AArch64Logical): ### TODO! Review this...
+    pattern = "extr <Xd>, <Xa>, <Xb>, <imm>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Xd"]
+
+class AArch64LogicalShifted(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class orr_shifted(AArch64LogicalShifted): # pylint: disable=missing-docstring,invalid-name
+    pattern = "orr <Xd>, <Xa>, <Xb>, lsl <imm>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class AArch64ConditionalCompare(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class ccmp_xzr(AArch64ConditionalCompare): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ccmp <Xa>, xzr, <imm>, <flag>"
+    inputs = ["Xa"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class ccmp(AArch64ConditionalCompare): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ccmp <Xa>, <Xb>, <imm>, <flag>"
+    inputs = ["Xa", "Xb"]
+    modifiesFlags=True
+    dependsOnFlags=True
+
+class AArch64ConditionalSelect(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class cneg(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cneg <Xd>, <Xe>, <flag>"
+    inputs = ["Xe"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class csel_xzr_ne(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "csel <Xd>, <Xe>, xzr, <flag>"
+    inputs = ["Xe"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class csel_ne(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "csel <Xd>, <Xe>, <Xf>, <flag>"
+    inputs = ["Xe", "Xf"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class cinv(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cinv <Xd>, <Xe>, <flag>"
+    inputs = ["Xe"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class cinc(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cinc <Xd>, <Xe>, <flag>"
+    inputs = ["Xe"]
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class csetm(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "csetm <Xd>, <flag>"
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class cset(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cset <Xd>, <flag>"
+    outputs = ["Xd"]
+    dependsOnFlags=True
+
+class cmn_imm(AArch64ConditionalSelect): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cmn <Xd>, <imm>"
+    inputs = ["Xd"]
+    modifiesFlags=True
+
+class ldr_const(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ldr <Xd>, <imm>"
+    inputs = []
+    outputs = ["Xd"]
+
+class movk_imm(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "movk <Xd>, <imm>"
+    inputs = []
+    in_outs=["Xd"]
+
+class mov(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Wd>, <Wa>"
+    inputs = ["Wa"]
+    outputs = ["Wd"]
+
+class AArch64Move(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class mov_imm(AArch64Move): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Xd>, <imm>"
+    inputs = []
+    outputs = ["Xd"]
+
+class mvn_xzr(AArch64Move): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mvn <Xd>, xzr"
+    inputs = []
+    outputs = ["Xd"]
+
+class mov_xform(AArch64Move): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Xd>, <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Xd"]
+
+class umull_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "umull <Xd>, <Wa>, <Wb>"
+    inputs = ["Wa","Wb"]
+    outputs = ["Xd"]
+
+class umaddl_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "umaddl <Xn>, <Wa>, <Wb>, <Xacc>"
+    inputs = ["Wa","Wb","Xacc"]
+    outputs = ["Xn"]
+
+class mul_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mul <Wd>, <Wa>, <Wb>"
+    inputs = ["Wa","Wb"]
+    outputs = ["Wd"]
+
+class AArch64HighMultiply(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class umulh_xform(AArch64HighMultiply): # pylint: disable=missing-docstring,invalid-name
+    pattern = "umulh <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class smulh_xform(AArch64HighMultiply): # pylint: disable=missing-docstring,invalid-name
+    pattern = "smulh <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class AArch64Multiply(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class mul_xform(AArch64Multiply): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mul <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class madd_xform(AArch64Multiply): # pylint: disable=missing-docstring,invalid-name
+    pattern = "madd <Xd>, <Xacc>, <Xa>, <Xb>"
+    inputs = ["Xacc", "Xa","Xb"]
+    outputs = ["Xd"]
+
+class mneg_xform(AArch64Multiply): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mneg <Xd>, <Xa>, <Xb>"
+    inputs = ["Xa","Xb"]
+    outputs = ["Xd"]
+
+class msub_xform(AArch64Multiply): # pylint: disable=missing-docstring,invalid-name
+    pattern = "msub <Xd>, <Xacc>, <Xa>, <Xb>"
+    inputs = ["Xacc", "Xa","Xb"]
+    outputs = ["Xd"]
+
+class and_imm_wform(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "and <Wd>, <Wa>, <imm>"
+    inputs = ["Wa"]
+    outputs = ["Wd"]
+
+class Tst(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class tst_wform(Tst): # pylint: disable=missing-docstring,invalid-name
+    pattern = "tst <Wa>, <imm>"
+    inputs = ["Wa"]
+    modifiesFlags=True
+
+class tst_imm_xform(Tst): # pylint: disable=missing-docstring,invalid-name
+    pattern = "tst <Xa>, <imm>"
+    inputs = ["Xa"]
+    modifiesFlags=True
+
+class tst_xform(Tst): # pylint: disable=missing-docstring,invalid-name
+    pattern = "tst <Xa>, <Xb>"
+    inputs = ["Xa", "Xb"]
+    modifiesFlags=True
+
+class cmp_xzr(Tst): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cmp <Xa>, xzr"
+    inputs = ["Xa"]
+    modifiesFlags=True
+
+class cmp_imm(Tst): # pylint: disable=missing-docstring,invalid-name
+    pattern = "cmp <Xa>, <imm>"
+    inputs = ["Xa"]
+    modifiesFlags=True
+
+######################################################
+#                                                    #
+# Some 'wrappers' around AArch64 Neon instructions   #
+#                                                    #
+######################################################
+
+class vmov(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Vd>.<dt0>, <Va>.<dt1>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class vmovi(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "movi <Vd>.<dt>, <imm>"
+    outputs = ["Vd"]
+
+class vxtn(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "xtn <Vd>.<dt0>, <Va>.<dt1>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class Vrev(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class rev64(Vrev): # pylint: disable=missing-docstring,invalid-name
+    pattern = "rev64 <Vd>.<dt0>, <Va>.<dt1>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class rev32(Vrev): # pylint: disable=missing-docstring,invalid-name
+    pattern = "rev32 <Vd>.<dt0>, <Va>.<dt1>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class uaddlp(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "uaddlp <Vd>.<dt0>, <Va>.<dt1>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class vand(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "and <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vbic(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "bic <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vzip1(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "zip1 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vzip2(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "zip2 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vuzp1(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "uzp1 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vuzp2(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "uzp2 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vqrdmulh(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sqrdmulh <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vqrdmulh_lane(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sqrdmulh <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        if obj.datatype[0] == "8h":
+            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
+                                          [ f"v{i}" for i in range(0,16) ]]
+        return obj
+
+class vqdmulh_lane(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sqdmulh <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        if obj.datatype[0] == "8h":
+            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
+                                          [ f"v{i}" for i in range(0,16) ]]
+
+        return obj
+
+class vmul_lane(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mul <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        if obj.datatype[0] == "8h":
+            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
+                                          [ f"v{i}" for i in range(0,16) ]]
+
+        return obj
+
+class fcsel_dform(Instruction): # pylint: disable=missing-docstring,invalid-name
+    @classmethod
+    def make(cls, src):
+        obj = Instruction.build(cls, src, mnemonic="fcsel_dform",
+                         arg_types_in=[RegisterType.NEON, RegisterType.NEON, RegisterType.FLAGS],
+                         arg_types_out=[RegisterType.NEON])
+
+        regexp_txt = r"fcsel_dform\s+(?P<dst>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<src2>\w+)\s*,\s*eq"
+        regexp_txt = Instruction.unfold_abbrevs(regexp_txt)
+        regexp = re.compile(regexp_txt)
+        p = regexp.match(src)
+        if p is None:
+            raise Instruction.ParsingException("Does not match pattern")
+        obj.args_in     = [ p.group("src1"), p.group("src2"), "flags" ]
+        obj.args_out    = [ p.group("dst") ]
+        obj.args_in_out = []
+
+        return obj
+
+    def write(self):
+        return f"fcsel_dform {self.args_out[0]}, {self.args_in[0]}, {self.args_in[1]}, eq"
+
+class Vins(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class vins_d(Vins): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ins <Vd>.d[<index>], <Xa>"
+    inputs = ["Xa"]
+    in_outs=["Vd"]
+
+class vins_d_force_output(Vins): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ins <Vd>.d[<index>], <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Vd"]
+    @classmethod
+    def make(cls, src, force=False):
+        if force is False:
+            raise Instruction.ParsingException("Instruction ignored")
+        return AArch64Instruction.build(cls, src)
+
+class Mov_xtov_d(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class mov_xtov_d(Mov_xtov_d): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Vd>.d[<index>], <Xa>"
+    inputs = ["Xa"]
+    in_outs=["Vd"]
+
+class mov_xtov_d_xzr(Mov_xtov_d): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Vd>.d[<index>], xzr"
+    in_outs=["Vd"]
+
+class mov_b00(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Vd>.b[0], <Va>.b[0]"
+    inputs = ["Va"]
+    in_outs=["Vd"]
+
+class mov_d01(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Vd>.d[0], <Va>.d[1]"
+    inputs = ["Va"]
+    in_outs=["Vd"]
+
+class AArch64NeonLogical(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class veor(AArch64NeonLogical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "eor <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class veor3(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "eor3 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>, <Vc>.<dt3>"
+    inputs = ["Va", "Vb", "Vc"]
+    outputs = ["Vd"]
+
+class vbif(AArch64NeonLogical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "bif <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    in_outs=["Vd"]
+
+# Not sure about the classification as logical... couldn't find it in SWOG
+class vmov_d(AArch64NeonLogical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Dd>, <Va>.d[1]"
+    inputs = ["Va"]
+    outputs = ["Dd"]
+
+class vext(AArch64NeonLogical): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ext <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>, <imm>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vmul(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mul <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vmla(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mla <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    in_outs=["Vd"]
+
+class vmla_lane(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mla <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]"
+    inputs = ["Va", "Vb"]
+    in_outs=["Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        if obj.datatype[0] == "8h":
+            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
+                                          [ f"v{i}" for i in range(0,16) ]]
+        return obj
+
+class vmls(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mls <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    in_outs = ["Vd"]
+
+class vmls_lane(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mls <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]"
+    inputs = ["Va", "Vb"]
+    in_outs=["Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        if obj.datatype[0] == "8h":
+            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
+                                          [ f"v{i}" for i in range(0,16) ]]
+        return obj
+
+class vdup(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "dup <Vd>.<dt>, <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Vd"]
+
+class vmull(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "umull <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class vmlal(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "umlal <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    in_outs=["Vd"]
+
+class vsrshr(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "srshr <Vd>.<dt0>, <Va>.<dt1>, <imm>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class vshl(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "shl <Vd>.<dt0>, <Va>.<dt1>, <imm>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class vshl_d(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "shl <Dd>, <Da>, <imm>"
+    inputs = ["Da"]
+    outputs = ["Dd"]
+
+class vshli(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "sli <Vd>.<dt0>, <Va>.<dt1>, <imm>"
+    inputs = ["Va"]
+    in_outs=["Vd"]
+
+class vusra(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "usra <Vd>.<dt0>, <Va>.<dt1>, <imm>"
+    inputs = ["Va"]
+    in_outs=["Vd"]
+
+class vshrn(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "shrn <Vd>.<dt0>, <Va>.<dt1>, <imm>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class VecToGprMov(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class umov_d(VecToGprMov): # pylint: disable=missing-docstring,invalid-name
+    pattern = "umov <Xd>, <Va>.d[<index>]"
+    inputs = ["Va"]
+    outputs = ["Xd"]
+
+class mov_d(VecToGprMov): # pylint: disable=missing-docstring,invalid-name
+    pattern = "mov <Xd>, <Va>.d[<index>]"
+    inputs = ["Va"]
+    outputs = ["Xd"]
+
+class Fmov(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class fmov_0(Fmov): # pylint: disable=missing-docstring,invalid-name
+    pattern = "fmov <Dd>, <Xa>"
+    inputs = ["Xa"]
+    in_outs=["Dd"]
+
+class fmov_0_force_output(Fmov): # pylint: disable=missing-docstring,invalid-name
+    pattern = "fmov <Dd>, <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Dd"]
+    @classmethod
+    def make(cls, src, force=False):
+        if force is False:
+            raise Instruction.ParsingException("Instruction ignored")
+        return AArch64Instruction.build(cls, src)
+
+class fmov_1(Fmov): # pylint: disable=missing-docstring,invalid-name
+    pattern = "fmov <Vd>.d[1], <Xa>"
+    inputs = ["Xa"]
+    in_outs=["Vd"]
+
+class fmov_1_force_output(Fmov): # pylint: disable=missing-docstring,invalid-name
+    pattern = "fmov <Vd>.d[1], <Xa>"
+    inputs = ["Xa"]
+    outputs = ["Vd"]
+    @classmethod
+    def make(cls, src, force=False):
+        if force is False:
+            raise Instruction.ParsingException("Instruction ignored")
+        return AArch64Instruction.build(cls, src)
+
+class vushr(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ushr <Vd>.<dt0>, <Va>.<dt1>, <imm>"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class Transpose(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class trn1(Transpose): # pylint: disable=missing-docstring,invalid-name
+    pattern = "trn1 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class trn2(Transpose): # pylint: disable=missing-docstring,invalid-name
+    pattern = "trn2 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+# Wrapper around AESE+AESMC, treated as one instructions in SLOTHY
+# so as to prevent pulling them apart and hindering instruction fusion.
+
+class AESInstruction(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class aesr(AESInstruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "aesr <Vd>.16b, <Va>.16b"
+    inputs = ["Va"]
+    in_outs=["Vd"]
+
+class aesr_x2(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "aesr_x2 <Vd0>.16b, <Vd1>.16b, <Va>.16b"
+    inputs = ["Va"]
+    in_outs=["Vd0", "Vd1"]
+
+class aesr_x4(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "aesr_x4 <Vd0>.16b, <Vd1>.16b, <Vd2>.16b, <Vd3>.16b, <Va>.16b"
+    inputs = ["Va"]
+    in_outs=["Vd0", "Vd1", "Vd2", "Vd3"]
+
+class aese_x4(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "aese_x4 <Vd0>.16b, <Vd1>.16b, <Vd2>.16b, <Vd3>.16b, <Va>.16b"
+    inputs = ["Va"]
+    in_outs=["Vd0", "Vd1", "Vd2", "Vd3"]
+
+class aese(AESInstruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "aese <Vd>.16b, <Va>.16b"
+    inputs = ["Va"]
+    in_outs=["Vd"]
+
+class aesmc(AESInstruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "aesmc <Vd>.16b, <Va>.16b"
+    inputs = ["Va"]
+    outputs = ["Vd"]
+
+class pmull1_q(AESInstruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "pmull <Vd>.1q, <Va>.1d, <Vb>.1d"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class pmull2_q(AESInstruction): # pylint: disable=missing-docstring,invalid-name
+    pattern = "pmull2 <Vd>.1q, <Va>.2d, <Vb>.2d"
+    inputs = ["Va", "Vb"]
+    outputs = ["Vd"]
+
+class Str_X(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class x_str(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Xa>, [<Xc>]"
+    inputs = ["Xa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = obj.args_in[1]
+        return obj
+
+    def write(self):
+        # For now, assert that no fixup has happened
+        # Eventually, this instruction should be merged
+        # into the LDP with increment.
+        assert self.pre_index is None
+        return super().write()
+
+class x_str_imm(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Xa>, [<Xc>, <imm>]"
+    inputs = ["Xa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[1]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class w_str_imm(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Wa>, [<Xc>, <imm>]"
+    inputs = ["Wa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[1]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_str_postinc(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Xa>, [<Xc>], <imm>"
+    inputs = ["Xa", "Xc"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[1]
+        return obj
+
+class x_str_sp_imm(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "str <Xa>, [sp, <imm>]"
+    inputs = ["Xa"]
+    outputs = ["Th"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_str_sp_imm_hint(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "strh <Xa>, sp, <imm>, <Th>"
+    inputs = ["Xa"]
+    outputs = ["Th"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_str_imm_hint(Str_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "strh <Xa>, <Xb>, <imm>, <Th>"
+    inputs = ["Xa", "Xb"]
+
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[1]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class Stp_X(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class x_stp(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Xa>, <Xb>, [<Xc>]"
+    inputs = ["Xc", "Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+
+    def write(self):
+        # For now, assert that no fixup has happened
+        # Eventually, this instruction should be merged
+        # into the STP with increment.
+        assert self.pre_index is None
+        return super().write()
+
+class x_stp_with_imm_xzr_sp(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Xa>, xzr, [sp, <imm>]"
+    inputs = ["Xa"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_stp_with_imm_sp(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Xa>, <Xb>, [sp, <imm>]"
+    inputs = ["Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_stp_with_inc(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Xa>, <Xb>, [<Xc>, <imm>]"
+    inputs = ["Xc", "Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_stp_with_inc_writeback(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stp <Xa>, <Xb>, [<Xc>, <imm>]!"
+    inputs = ["Xc", "Xa", "Xb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.addr = obj.args_in[0]
+        return obj
+
+class x_stp_with_inc_hint(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stph <Xa>, <Xb>, <Xc>, <imm>, <Th>"
+    inputs = ["Xc", "Xa", "Xb"]
+    outputs = ["Th"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[0]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_stp_sp_with_inc_hint(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stph <Xa>, <Xb>, sp, <imm>, <Th>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Th"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_stp_sp_with_inc_hint2(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stphp <Xa>, <Xb>, sp, <imm>, <Th0>, <Th1>"
+    inputs = ["Xa", "Xb"]
+    outputs = ["Th0", "Th1"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = "sp"
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class x_stp_with_inc_hint2(Stp_X): # pylint: disable=missing-docstring,invalid-name
+    pattern = "stphp <Xa>, <Xb>, <Xc>, <imm>, <Th0>, <Th1>"
+    inputs = ["Xa", "Xb", "Xc"]
+    outputs = ["Th0", "Th1"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.increment = None
+        obj.pre_index = obj.immediate
+        obj.addr = obj.args_in[2]
+        return obj
+
+    def write(self):
+        self.immediate = simplify(self.pre_index)
+        return super().write()
+
+class St4(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class st4_base(St4): # pylint: disable=missing-docstring,invalid-name
+    pattern = "st4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>]"
+    inputs = ["Xc", "Va", "Vb", "Vc", "Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.args_in_combinations = [
+                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
+            ]
+        return obj
+
+class st4_with_inc(St4): # pylint: disable=missing-docstring,invalid-name
+    pattern = "st4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>], <imm>"
+    inputs = ["Xc", "Va", "Vb", "Vc", "Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.args_in_combinations = [
+                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
+            ]
+        return obj
+
+class St2(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class st2_base(St2): # pylint: disable=missing-docstring,invalid-name
+    pattern = "st2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>]"
+    inputs = ["Xc", "Va", "Vb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.args_in_combinations = [
+                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
+            ]
+        return obj
+
+class st2_with_inc(St2): # pylint: disable=missing-docstring,invalid-name
+    pattern = "st2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>], <imm>"
+    inputs = ["Xc", "Va", "Vb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.args_in_combinations = [
+                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
+            ]
+        return obj
+
+class Ld4(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class ld4_base(Ld4): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ld4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>]"
+    inputs = ["Xc"]
+    outputs = ["Va", "Vb", "Vc", "Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.args_out_combinations = [
+                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
+            ]
+        return obj
+
+class ld4_with_inc(Ld4): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ld4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>], <imm>"
+    inputs = ["Xc"]
+    outputs = ["Va", "Vb", "Vc", "Vd"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.args_out_combinations = [
+                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
+            ]
+        return obj
+
+class Ld2(AArch64Instruction): # pylint: disable=missing-docstring,invalid-name
+    pass
+
+class ld2_base(Ld2): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ld2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>]"
+    inputs = ["Xc"]
+    outputs = ["Va", "Vb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.args_out_combinations = [
+                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
+            ]
+        return obj
+
+class ld2_with_inc(Ld2): # pylint: disable=missing-docstring,invalid-name
+    pattern = "ld2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>], <imm>"
+    inputs = ["Xc"]
+    outputs = ["Va", "Vb"]
+    @classmethod
+    def make(cls, src):
+        obj = AArch64Instruction.build(cls, src)
+        obj.addr = obj.args_in[0]
+        obj.increment = obj.immediate
+        obj.pre_index = None
+        obj.args_out_combinations = [
+                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
+            ]
+
+        return obj
+
+# In a pair of vins writing both 64-bit lanes of a vector, mark the
+# target vector as output rather than input/output. This enables further
+# renaming opportunities.
+def vins_d_parsing_cb():
+    def core(inst, t, log=None):
+        _ = log # log is not used
+        succ = None
+        # Check if this is the first in a pair of vins+vins
+        if len(t.dst_in_out[0]) == 1:
+            r = t.dst_in_out[0][0]
+            if isinstance(r.inst, vins_d):
+                if r.inst.args_in_out == inst.args_in_out and \
+                   {r.inst.index, inst.index} == {0,1}:
+                    succ = r
+        if succ is None:
+            return False
+        # Reparse as instruction-variant treating the input/output as an output
+        inst_txt = t.inst.write()
+        old_src = t.inst.source_line.copy()
+        t.inst = vins_d_force_output.make(inst_txt, force=True)
+        t.inst.source_line = old_src.set_text(inst_txt)
+        t.changed = True
+        return True
+    return core
+vins_d.global_parsing_cb = vins_d_parsing_cb()
+
+# In a pair of fmov writing both 64-bit lanes of a vector, mark the
+# target vector as output rather than input/output. This enables further
+# renaming opportunities.
+def fmov_0_parsing_cb():
+    def core(inst, t, log=None):
+        _ = log # log is not used
+        succ = None
+        r = None
+        # Check if this is the first in a pair of fmov's
+        if len(t.dst_in_out[0]) == 1:
+            r = t.dst_in_out[0][0]
+            if isinstance(r.inst, fmov_1):
+                if r.inst.args_in_out == inst.args_in_out:
+                    succ = r
+        if succ is None:
+            return False
+        # Reparse as instruction-variant treating the input/output as an output
+        inst_txt = t.inst.write()
+        old_src = t.inst.source_line
+        t.inst = fmov_0_force_output.make(inst_txt, force=True)
+        t.inst.source_line = old_src.set_text(inst_txt)
+        t.changed = True
+        return True
+    return core
+fmov_0.global_parsing_cb = fmov_0_parsing_cb()
+
+def fmov_1_parsing_cb():
+    def core(inst, t, log=None):
+        _ = log # log is not used
+        succ = None
+        r = None
+        # Check if this is the first in a pair of fmov's
+        if len(t.dst_in_out[0]) == 1:
+            r = t.dst_in_out[0][0]
+            if isinstance(r.inst, fmov_0):
+                if r.inst.args_in_out == inst.args_in_out:
+                    succ = r
+        if succ is None:
+            return False
+        # Reparse as instruction-variant treating the input/output as an output
+        inst_txt = t.inst.write()
+        old_src = t.inst.source_line
+        t.inst = fmov_1_force_output.make(inst_txt, force=True)
+        t.inst.source_line = old_src.set_text("inst_txt")
+        t.changed = True
+        return True
+    return core
+fmov_1.global_parsing_cb = fmov_1_parsing_cb()
+
+def stack_vld2_lane_parsing_cb():
+    def core(inst,t, log=None):
+        _ = log # log is not used
+
+        succ = None
+
+        if inst.detected_stack_vld2_lane_pair:
+            return False
+
+        # Check if this is the first in a pair of stack_vld2_lane+stack_vld2_lane
+        if len(t.dst_in_out[0]) == 1:
+            r = t.dst_in_out[0][0]
+            if isinstance(r.inst, stack_vld2_lane):
+                if r.inst.args_in_out[:2] == inst.args_in_out[:2] and \
+                   {r.inst.lane, inst.lane} == {'0','1'}:
+                    succ = r
+
+        if succ is None:
+            return False
+
+        # If so, mark in/out as output only, and signal the need for re-building
+        # the dataflow graph
+
+        inst.num_out = 2
+        inst.args_out = [ inst.args_in_out[0], inst.args_in_out[1] ]
+        inst.arg_types_out = [ RegisterType.NEON, RegisterType.NEON ]
+        inst.args_out_restrictions = inst.args_in_out_restrictions[:2]
+        inst.args_out_combinations = inst.args_in_out_combinations[:2]
+
+        inst.num_in_out = 1
+        inst.args_in_out = [ inst.args_in_out[2] ]
+        inst.arg_types_in_out = [ RegisterType.GPR ]
+        inst.args_in_out_restrictions = [None]
+        inst.args_in_out_combinations = None
+
+        inst.detected_stack_vld2_lane_pair = True
+
+        t.changed = True
+        return True
+
+    return core
+
+stack_vld2_lane.global_parsing_cb  = stack_vld2_lane_parsing_cb()
+
+def eor3_fusion_cb():
+    def core(inst,t,log=None):
+        succ = None
+
+        # Check if this is the first in a fusable pair of eor3
+        if len(t.dst_out[0]) == 1:
+            r = t.dst_out[0][0]
+            if isinstance(r.inst, veor) and r.src_in[0].src == t:
+                if r.inst.args_in[0] == t.inst.args_out[0]:
+                    succ = r
+
+        if succ is None:
+            return False
+
+        d = r.inst.args_out[0]
+        a = inst.args_in[0]
+        b = inst.args_in[1]
+        c = r.inst.args_in[1]
+
+        # Check if the a,b inputs are overwritten between the
+        # first and second eor.
+        if r.reg_state[a] != t.reg_state[a] and not \
+            (r.reg_state[a].src == t and t.reg_state[a].idx == 0):
+            if log is not None:
+                log(f"NOTE: Skipping potential EOR3 fusion for ({t}:{r}) "\
+                    f"because {a} is modified by {r.reg_state[a]} in the interim.")
+            return False
+        if r.reg_state[b] != t.reg_state[b] and not \
+            (r.reg_state[b].src == t and t.reg_state[b].idx == 0):
+            if log is not None:
+                log(f"NOTE: Skipping potential EOR3 fusion for ({t}:{r}) "\
+                    f"because {b} is modified by {r.reg_state[b]} in the interim.")
+            return False
+
+        new_inst = AArch64Instruction.build(veor3, { "Vd": d, "Va" : a, "Vb" : b, "Vc" : c,
+                                                     "datatype0":"16b",
+                                                     "datatype1":"16b",
+                                                     "datatype2":"16b",
+                                                     "datatype3":"16b" })
+
+        # TODO: Hoist this merging logic into a separate function
+        src = r.inst.source_line.copy()
+        src.add_tags(inst.source_line.tags)
+        src.add_comments(inst.source_line.comments)
+        new_inst.source_line = src
+
+        if log is not None:
+            log(f"EOR3 fusion: {t.inst}; {r.inst} ~> {new_inst}")
+
+        # If so, delete first note, and reparse second as eor3
+        t.delete = True
+        r.changed = True
+        r.inst = new_inst
+        return True
+
+    return core
+
+veor.global_fusion_cb  = eor3_fusion_cb()
+
+def iter_aarch64_instructions():
+    yield from all_subclass_leaves(Instruction)
+
+def find_class(src):
+    for inst_class in iter_aarch64_instructions():
+        if isinstance(src,inst_class):
+            return inst_class
+    raise UnknownInstruction(f"Couldn't find instruction class for {src} (type {type(src)})")
+
+def is_dt_form_of(instr_class, dts=None):
+    if not isinstance(instr_class, list):
+        instr_class = [instr_class]
+    def _intersects(ls_a,ls_b):
+        return len([a for a in ls_a if a in ls_b]) > 0
+    def _check_instr_dt(src):
+        if find_class(src) in instr_class:
+            if dts is None or _intersects(src.datatype, dts):
+                return True
+        return False
+    return _check_instr_dt
+
+def is_dform_form_of(instr_class):
+    return is_dt_form_of(instr_class, ["1d","2s","4h","8b"])
+def is_qform_form_of(instr_class):
+    return is_dt_form_of(instr_class, ["2d","4s","8h","16b"])
+
+def check_instr_dt(src, instr_classes, dt=None):
+    if not isinstance(instr_classes, list):
+        instr_classes = list(instr_classes)
+    for instr_class in instr_classes:
+        if find_class(src) == instr_class:
+            if dt is None or len(set(dt + src.datatype)) > 0:
+                return True
+    return False
+
+def is_neon_instruction(inst):
+    args = inst.arg_types_in + inst.arg_types_out + inst.arg_types_in_out
+    return RegisterType.NEON in args
+
+
+# Returns the list of all subclasses of a class which don't have
+# subclasses themselves
+def all_subclass_leaves(c):
+
+    def has_subclasses(cl):
+        return len(cl.__subclasses__()) > 0
+    def is_leaf(c):
+        return not has_subclasses(c)
+
+    def all_subclass_leaves_core(leaf_lst, todo_lst):
+        leaf_lst += filter(is_leaf, todo_lst)
+        todo_lst = [ csub
+                     for c in filter(has_subclasses, todo_lst)
+                     for csub in c.__subclasses__() ]
+        if len(todo_lst) == 0:
+            return leaf_lst
+        return all_subclass_leaves_core(leaf_lst, todo_lst)
+
+    return all_subclass_leaves_core([], [c])
+
+Instruction.all_subclass_leaves = all_subclass_leaves(Instruction)
+
+def lookup_multidict(d, inst, default=None):
+    instclass = find_class(inst)
+    for l,v in d.items():
+        # Multidict entries can be the following:
+        # - An instruction class. It matches any instruction of that class.
+        # - A callable. It matches any instruction returning `True` when passed
+        #   to the callable.
+        # - A tuple of instruction classes or callables. It matches any instruction
+        #   which matches at least one element in the tuple.
+        def match(x):
+            if inspect.isclass(x):
+                return isinstance(inst, x)
+            assert callable(x)
+            return x(inst)
+        if not isinstance(l, tuple):
+            l = [l]
+        for lp in l:
+            if match(lp):
+                return v
+    if default is None:
+        raise UnknownInstruction(f"Couldn't find {instclass} for {inst}")
+    return default
diff --git a/targets/aarch64/cortex_a55.py b/slothy/targets/aarch64/cortex_a55.py
similarity index 89%
rename from targets/aarch64/cortex_a55.py
rename to slothy/targets/aarch64/cortex_a55.py
index c418deaf..acee2f8b 100644
--- a/targets/aarch64/cortex_a55.py
+++ b/slothy/targets/aarch64/cortex_a55.py
@@ -25,43 +25,47 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
-########################################################################################
+"""
+Experimental Cortex-A55 microarchitecture model for SLOTHY
+
+Most data in this model is derived from the Cortex-A55 software optimization guide.
+Some latency exceptions were manually identified through microbenchmarks.
+
+WARNING: The data in this module is approximate and may contain errors.
+"""
+
 ################################### NOTE ###############################################
-########################################################################################
 ###                                                                                  ###
 ### WARNING: The data in this module is approximate and may contain errors.          ###
 ###          They are _NOT_ an official software optimization guide for Cortex-A55.  ###
 ###                                                                                  ###
 ########################################################################################
-########################################################################################
-########################################################################################
 
 from enum import Enum
-from .aarch64_neon import *
+from slothy.targets.aarch64.aarch64_neon import *
 
 issue_rate = 2
 
 class ExecutionUnit(Enum):
-    SCALAR_ALU0=1,
-    SCALAR_ALU1=2,
-    SCALAR_MAC=3,
-    SCALAR_LOAD=4,
-    SCALAR_STORE=5,
-    VEC0=6,
-    VEC1=7,
+    """Enumeration of execution units in Cortex-A55 model"""
+    SCALAR_ALU0=1
+    SCALAR_ALU1=2
+    SCALAR_MAC=3
+    SCALAR_LOAD=4
+    SCALAR_STORE=5
+    VEC0=6
+    VEC1=7
     def __repr__(self):
         return self.name
-    def SCALAR():
+    @classmethod
+    def SCALAR(cls): # pylint: disable=invalid-name
+        """All scalar execution units"""
         return [ExecutionUnit.SCALAR_ALU0, ExecutionUnit.SCALAR_ALU1]
-    def SCALAR_MUL():
+    @classmethod
+    def SCALAR_MUL(cls): # pylint: disable=invalid-name
+        """All multiply-capable scalar execution units"""
         return [ExecutionUnit.SCALAR_MAC]
 
-    def indentation(unit):
-        if unit in ExecutionUnit.SCALAR():
-            return 100
-        else:
-            return 0
-
 # Opaque function called by SLOTHY to add further microarchitecture-
 # specific constraints which are not encapsulated by the general framework.
 def add_further_constraints(slothy):
@@ -73,15 +77,14 @@ def add_further_constraints(slothy):
 def add_slot_constraints(slothy):
     # Q-Form vector instructions are on slot 0 only
     slothy.restrict_slots_for_instructions_by_property(
-        Instruction.is_Qform_vector_instruction, [0])
+        Instruction.is_q_form_vector_instruction, [0])
     # fcsel and vld2 on slot 0 only
     slothy.restrict_slots_for_instructions_by_class(
         [fcsel_dform, stack_vld2_lane], [0])
 
 def add_st_hazard(slothy):
-
-    def is_vec_st_st_pair(instA, instB):
-        return instA.inst.is_vector_store() and instB.inst.is_vector_store()
+    def is_vec_st_st_pair(inst_a, inst_b):
+        return inst_a.inst.is_vector_store() and inst_b.inst.is_vector_store()
 
     for t0, t1 in slothy.get_inst_pairs(is_vec_st_st_pair):
         if t0.is_locked and t1.is_locked:
@@ -91,8 +94,11 @@ def is_vec_st_st_pair(instA, instB):
 # Opaque function called by SLOTHY to add further microarchitecture-
 # specific objectives.
 def has_min_max_objective(config):
+    """Adds Cortex-"""
+    _ = config
     return False
 def get_min_max_objective(slothy):
+    _ = slothy
     return
 
 execution_units = {
@@ -151,7 +157,6 @@ def get_min_max_objective(slothy):
     is_qform_form_of(vshl) : [[ExecutionUnit.VEC0, ExecutionUnit.VEC1]],
     is_dform_form_of(vshl) : [ExecutionUnit.VEC0, ExecutionUnit.VEC1],
 
-    # TODO: double check these new instructions:
     (stack_stp, stack_stp_wform, stack_str, Str_X) : ExecutionUnit.SCALAR_STORE,
     (stack_ldr, ldr_const, ldr_sxtw_wform, Ldr_X) : ExecutionUnit.SCALAR_LOAD,
     (umull_wform, mul_wform, umaddl_wform ): ExecutionUnit.SCALAR_MUL(),
@@ -239,6 +244,8 @@ def get_min_max_objective(slothy):
 }
 
 def get_latency(src, out_idx, dst):
+    _ = out_idx # out_idx unused
+
     instclass_src = find_class(src)
     instclass_dst = find_class(dst)
 
@@ -271,8 +278,7 @@ def get_units(src):
     units = lookup_multidict(execution_units, src)
     if isinstance(units,list):
         return units
-    else:
-        return [units]
+    return [units]
 
 def get_inverse_throughput(src):
     return lookup_multidict(
diff --git a/targets/aarch64/cortex_a72_frontend.py b/slothy/targets/aarch64/cortex_a72_frontend.py
similarity index 70%
rename from targets/aarch64/cortex_a72_frontend.py
rename to slothy/targets/aarch64/cortex_a72_frontend.py
index 0462b691..e01d20ff 100644
--- a/targets/aarch64/cortex_a72_frontend.py
+++ b/slothy/targets/aarch64/cortex_a72_frontend.py
@@ -25,31 +25,30 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
-#
-# Experimental and incomplete model capturing an approximation of the
-# frontend limitations and latencies of the Cortex-A72 CPU.
-#
-# It might be surprising at first that an in-order optimizer such as Slothy could be
-# used for an out of order core such as Cortex-A72.
-#
-# The key observation is that unless the frontend is much wider than the backend,
-# a high overall throughput requires a high throughput in the frontend. Since the
-# frontend is in-order and has documented dispatch constraints, we can model those
-# constraints in SLOTHY.
-#
-# The consideration of latencies is less important, yet not irrelevant in this view:
-# Instructions dispatched well before they are ready to execute will occupy the issue
-# queue (IQ) for a long time, and once the IQs are full, the frontend will stall.
-# It is therefore advisable to generally seek to obey latencies to reduce presssure
-# on issue queues.
-#
-# This file thus tries to model basic aspects of the frontend of Cortex-A72
-# alongside instruction latencies, both taken from the Cortex-A72 Software Optimization Guide.
-#
-# NOTE
-# We focus on a very small subset of AArch64, just enough to experiment with the
-# optimization of the Kyber and Dilithium NTT.
-#
+"""
+Experimental and incomplete model capturing an approximation of the
+frontend limitations and latencies of the Cortex-A72 CPU.
+
+It might be surprising at first that an in-order optimizer such as Slothy could be
+used for an out of order core such as Cortex-A72.
+
+The key observation is that unless the frontend is much wider than the backend,
+a high overall throughput requires a high throughput in the frontend. Since the
+frontend is in-order and has documented dispatch constraints, we can model those
+constraints in SLOTHY.
+
+The consideration of latencies is less important, yet not irrelevant in this view:
+Instructions dispatched well before they are ready to execute will occupy the issue
+queue (IQ) for a long time, and once the IQs are full, the frontend will stall.
+It is therefore advisable to generally seek to obey latencies to reduce presssure
+on issue queues.
+
+This file thus tries to model basic aspects of the frontend of Cortex-A72
+alongside instruction latencies, both taken from the Cortex-A72 Software Optimization Guide.
+
+Note: We focus on a very small subset of AArch64, just enough to experiment with the
+    optimization of the Kyber and Dilithium NTT.
+"""
 
 from enum import Enum, auto
 from .aarch64_neon import *
@@ -62,53 +61,50 @@
 issue_rate = 3
 
 class ExecutionUnit(Enum):
-    LOAD0=auto(),
-    LOAD1=auto(),
-    STORE0=auto(),
-    STORE1=auto(),
-    INT0=auto(),
-    INT1=auto(),
-    MINT0=auto(),
-    MINT1=auto(),
-    ASIMD0=auto(),
-    ASIMD1=auto(),
+    """Enumeration of execution units in approximative Cortex-A72 SLOTHY model"""
+    LOAD0=auto()
+    LOAD1=auto()
+    STORE0=auto()
+    STORE1=auto()
+    INT0=auto()
+    INT1=auto()
+    MINT0=auto()
+    MINT1=auto()
+    ASIMD0=auto()
+    ASIMD1=auto()
     def __repr__(self):
         return self.name
-    def ASIMD():
+    @classmethod
+    def ASIMD(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.ASIMD0, ExecutionUnit.ASIMD1]
-    def LOAD():
+    @classmethod
+    def LOAD(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.LOAD0, ExecutionUnit.LOAD1]
-    def STORE():
+    @classmethod
+    def STORE(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.STORE0, ExecutionUnit.STORE1]
-    def INT():
+    @classmethod
+    def INT(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.INT0, ExecutionUnit.INT1]
-    def MINT():
+    @classmethod
+    def MINT(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.MINT0, ExecutionUnit.MINT1]
-    def SCALAR():
+    @classmethod
+    def SCALAR(cls): # pylint: disable=missing-function-docstring,invalid-name
         return ExecutionUnit.INT() + ExecutionUnit.MINT()
 
-    def indentation(unit):
-        return 0
-
 # Opaque function called by SLOTHY to add further microarchitecture-
 # specific constraints which are not encapsulated by the general framework.
 def add_further_constraints(slothy):
-    # _add_slot_constraints(slothy)
-    # _add_st_hazard(slothy)
-    return
-
-def _add_slot_constraints(slothy):
-    if slothy.config.constraints.functional_only:
-        return
-    slothy.restrict_slots_for_instructions_by_property(
-        Instruction.is_Qform_vector_instruction, [0,1])
+    _ = slothy
 
 # Opaque function called by SLOTHY to add further microarchitecture-
 # specific objectives.
 def has_min_max_objective(slothy):
+    _ = slothy
     return False
 def get_min_max_objective(slothy):
-    return
+    _ = slothy
 
 execution_units = {
     (vmul, vmul_lane,
@@ -198,6 +194,8 @@ def get_min_max_objective(slothy):
 }
 
 def get_latency(src, out_idx, dst):
+    _ = out_idx # out_idx unused
+
     instclass_src = find_class(src)
     instclass_dst = find_class(dst)
 
@@ -220,8 +218,7 @@ def get_units(src):
     units = lookup_multidict(execution_units, src)
     if isinstance(units,list):
         return units
-    else:
-        return [units]
+    return [units]
 
 def get_inverse_throughput(src):
     return lookup_multidict(inverse_throughput, src)
diff --git a/targets/aarch64/neoverse_n1_experimental.py b/slothy/targets/aarch64/neoverse_n1_experimental.py
similarity index 83%
rename from targets/aarch64/neoverse_n1_experimental.py
rename to slothy/targets/aarch64/neoverse_n1_experimental.py
index 70d3b2cf..e4e4b8d4 100644
--- a/targets/aarch64/neoverse_n1_experimental.py
+++ b/slothy/targets/aarch64/neoverse_n1_experimental.py
@@ -25,38 +25,45 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
-#
-# Experimental and highly incomplete model capturing an approximation of the
-# frontend limitations and latencies of the Neoverse N1 CPU
-#
+"""
+Experimental and highly incomplete model capturing an approximation of the
+frontend limitations and latencies of the Neoverse N1 CPU
+"""
 
 from enum import Enum
-from .aarch64_neon import *
+from slothy.targets.aarch64.aarch64_neon import *
 
 issue_rate = 4
 
 class ExecutionUnit(Enum):
-    SCALAR_I0=0,
-    SCALAR_I1=1,
-    SCALAR_I2=2,
-    SCALAR_M=2, # Overlaps with third I pipeline
-    LSU0=3,
-    LSU1=4,
-    VEC0=5,
-    VEC1=6,
+    """Enumeration of execution units in approximative Neoverse-N1 SLOTHY model"""
+    SCALAR_I0=0
+    SCALAR_I1=1
+    SCALAR_I2=2
+    SCALAR_M=2 # Overlaps with third I pipeline
+    LSU0=3
+    LSU1=4
+    VEC0=5
+    VEC1=6
     def __repr__(self):
         return self.name
-    def I():
+    @classmethod
+    def I(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.SCALAR_I0, ExecutionUnit.SCALAR_I1, ExecutionUnit.SCALAR_I2]
-    def M():
+    @classmethod
+    def M(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.SCALAR_M]
-    def V():
+    @classmethod
+    def V(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.VEC0, ExecutionUnit.VEC1]
-    def V0():
+    @classmethod
+    def V0(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.VEC0]
-    def V1():
+    @classmethod
+    def V1(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.VEC1]
-    def LSU():
+    @classmethod
+    def LSU(cls): # pylint: disable=missing-function-docstring,invalid-name
         return [ExecutionUnit.LSU0, ExecutionUnit.LSU1]
 
 # Opaque functions called by SLOTHY to add further microarchitecture-
@@ -67,12 +74,13 @@ def add_further_constraints(slothy):
     slothy.restrict_slots_for_instructions_by_property(
         is_neon_instruction, [0,1])
     slothy.restrict_slots_for_instructions_by_property(
-        lambda t: is_neon_instruction(t) == False, [1,2,3])
+        lambda t: is_neon_instruction(t) is False, [1,2,3])
 
 def has_min_max_objective(config):
+    _ = config
     return False
 def get_min_max_objective(slothy):
-    return
+    _ = slothy
 
 execution_units = {
     (Ldp_X, Ldr_X,
@@ -171,8 +179,10 @@ def get_min_max_objective(slothy):
 }
 
 def get_latency(src, out_idx, dst):
-    instclass_src = find_class(src)
-    instclass_dst = find_class(dst)
+    _ = out_idx # out_idx unused
+
+    _ = find_class(src)
+    _ = find_class(dst)
     latency = lookup_multidict(default_latencies, src)
     return latency
 
@@ -180,8 +190,7 @@ def get_units(src):
     units = lookup_multidict(execution_units, src)
     if isinstance(units,list):
         return units
-    else:
-        return [units]
+    return [units]
 
 def get_inverse_throughput(src):
     return lookup_multidict(inverse_throughput, src)
diff --git a/slothy/targets/arm_v81m/__init__.py b/slothy/targets/arm_v81m/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/targets/arm_v81m/arch_v81m.py b/slothy/targets/arm_v81m/arch_v81m.py
similarity index 90%
rename from targets/arm_v81m/arch_v81m.py
rename to slothy/targets/arm_v81m/arch_v81m.py
index a0bbf92c..51868c34 100644
--- a/targets/arm_v81m/arch_v81m.py
+++ b/slothy/targets/arm_v81m/arch_v81m.py
@@ -51,12 +51,13 @@ def __str__(self):
     def __repr__(self):
         return self.name
 
+    @staticmethod
     def list_registers(reg_type, only_extra=False, only_normal=False, with_variants=False):
         """Return the list of all registers of a given type"""
 
         qstack_locations = [ f"QSTACK{i}" for i in range(8) ]
-        stack_locations  = [ f"STACK{i}"  for i in range(8) ] + [ "ROOT0_STACK", "ROOT1_STACK", "ROOT4_STACK", "RPTR_STACK" ]
-
+        stack_locations  = [ f"STACK{i}"  for i in range(8) ] + \
+            [ "ROOT0_STACK", "ROOT1_STACK", "ROOT4_STACK", "RPTR_STACK" ]
 
         gprs_normal  = [ f"r{i}" for i in range(13) ] + ['r14']
         vregs_normal = [ f"q{i}" for i in range(8)  ]
@@ -118,34 +119,38 @@ def end(self,unused,indentation=0):
         yield f"{indent}le lr, {lbl_start}"
 
     def extract(source, lbl):
+        assert isinstance(source, list)
+
         pre  = []
         body = []
         post = []
-        loop_lbl_regexp_txt = f"^\s*(?P<label>\w+)\s*:(?P<remainder>.*)$"
+        loop_lbl_regexp_txt = r"^\s*(?P<label>\w+)\s*:(?P<remainder>.*)$"
         loop_lbl_regexp = re.compile(loop_lbl_regexp_txt)
-        loop_end_regexp_txt = f"^\s*le\s+(lr|r14)\s*,\s*{lbl}"
+        loop_end_regexp_txt = rf"^\s*le\s+(lr|r14)\s*,\s*{lbl}"
         loop_end_regexp = re.compile(loop_end_regexp_txt)
-        lines = iter(source.splitlines())
+        lines = iter(source)
         l = None
         keep = False
         state = 0 # 0: haven't found loop yet, 1: extracting loop, 2: after loop
         while True:
             if not keep:
                 l = next(lines, None)
+                l_str = str(l)
             keep = False
-            if l == None:
+            if l is None:
                 break
+            assert isinstance(l, str) is False
             if state == 0:
-                p = loop_lbl_regexp.match(l)
+                p = loop_lbl_regexp.match(l_str)
                 if p is not None and p.group("label") == lbl:
-                    l = p.group("remainder")
+                    l.set_text(p.group("remainder"))
                     keep = True
                     state = 1
                 else:
                     pre.append(l)
                 continue
             if state == 1:
-                p = loop_end_regexp.match(l)
+                p = loop_end_regexp.match(l_str)
                 if p is not None:
                     state = 2
                     continue
@@ -201,7 +206,7 @@ def isinstancelist(l, c):
         self.args_in_out_combinations = None
         self.args_in_combinations = None
 
-    def global_parsing_cb(self,a):
+    def global_parsing_cb(self,a, log=None):
         return False
 
     def write(self):
@@ -245,15 +250,16 @@ def parse(self, src):
         mnemonic = Instruction.unfold_abbrevs(self.mnemonic)
 
         expected_args = self.num_in + self.num_out + self.num_in_out
-        regexp_txt  = f"^\s*{mnemonic}"
+        regexp_txt  = rf"^\s*{mnemonic}"
         if expected_args > 0:
-            regexp_txt += "\s+"
-        regexp_txt += ','.join(["\s*(\w+)\s*" for _ in range(expected_args)])
+            regexp_txt += r"\s+"
+        regexp_txt += ','.join([r"\s*(\w+)\s*" for _ in range(expected_args)])
         regexp = re.compile(regexp_txt)
 
         p = regexp.match(src)
         if p is None:
-            raise Instruction.ParsingException(f"Doesn't match basic instruction template {regexp_txt}")
+            raise Instruction.ParsingException(
+                f"Doesn't match basic instruction template {regexp_txt}")
 
         operands = list(p.groups())
         if have_dt:
@@ -279,13 +285,17 @@ def parse(self, src):
         self.args_in = operands[idx_args_in:]
 
         if not len(self.args_in) == self.num_in:
-            raise Exception(f"Something wrong parsing {src}: Expect {self.num_in} input, but got {len(self.args_in)} ({self.args_in})")
+            raise Exception(f"Something wrong parsing {src}: Expect {self.num_in} \
+                            input, but got {len(self.args_in)} ({self.args_in})")
 
-    def parser(src):
+    @staticmethod
+    def parser(src_line):
         insts = []
         exceptions = {}
         instnames = []
 
+        src = src_line.text.strip()
+
         # Iterate through all derived classes and call their parser
         # until one of them hopefully succeeds
         for inst_class in Instruction.__subclasses__():
@@ -297,14 +307,14 @@ def parser(src):
             except Instruction.ParsingException as e:
                 exceptions[inst_class.__name__] = e
         if len(insts) == 0:
-            logging.error(f"Failed to parse instruction {src}")
+            logging.error("Failed to parse instruction %s", src)
             logging.error("A list of attempted parsers and their exceptions follows.")
             for i,e in exceptions.items():
-                logging.error(f"* {i + ':':20s} {e}")
+                logging.error("* %s %s", f"{i + ':':20s}", e)
             raise Instruction.ParsingException(
                 f"Couldn't parse {src}\nYou may need to add support for a new instruction (variant)?")
 
-        logging.debug(f"Parsing result for '{src}': {instnames}")
+        logging.debug("Parsing result for %s: %s", src, instnames)
         return insts
 
     def __repr__(self):
@@ -458,20 +468,20 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR, RegisterType.GPR])
 
     def _simplify(self):
-        if self.increment != None:
+        if self.increment is not None:
             self.increment = simplify(self.increment)
-        if self.post_index != None:
+        if self.post_index is not None:
             self.post_index = simplify(self.post_index)
-        if self.pre_index != None:
+        if self.pre_index is not None:
             self.pre_index = simplify(self.pre_index)
 
     def parse(self, src):
 
-        addr_regexp_txt    = "\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
-        postinc_regexp_txt = "\s*(?:,\s*#(?P<postinc>.*))?"
+        addr_regexp_txt    = r"\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
+        postinc_regexp_txt = r"\s*(?:,\s*#(?P<postinc>.*))?"
 
-        ldrd_regexp_txt  = "\s*ldrd\s+"
-        ldrd_regexp_txt += "(?P<dest0>\w+),\s*(?P<dest1>\w+),\s*"
+        ldrd_regexp_txt  = r"\s*ldrd\s+"
+        ldrd_regexp_txt += r"(?P<dest0>\w+),\s*(?P<dest1>\w+),\s*"
         ldrd_regexp_txt += addr_regexp_txt
         ldrd_regexp_txt += postinc_regexp_txt
         ldrd_regexp_txt  = Instruction.unfold_abbrevs(ldrd_regexp_txt)
@@ -528,20 +538,20 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR])
 
     def _simplify(self):
-        if self.increment != None:
+        if self.increment is not None:
             self.increment = simplify(self.increment)
-        if self.post_index != None:
+        if self.post_index is not None:
             self.post_index = simplify(self.post_index)
-        if self.pre_index != None:
+        if self.pre_index is not None:
             self.pre_index = simplify(self.pre_index)
 
     def parse(self, src):
 
-        addr_regexp_txt    = "\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
-        postinc_regexp_txt = "\s*(?:,\s*#(?P<postinc>.*))?"
+        addr_regexp_txt    = r"\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
+        postinc_regexp_txt = r"\s*(?:,\s*#(?P<postinc>.*))?"
 
-        ldr_regexp_txt  = "\s*ldr\s+"
-        ldr_regexp_txt += "(?P<dest>\w+),\s*"
+        ldr_regexp_txt  = r"\s*ldr\s+"
+        ldr_regexp_txt += r"(?P<dest>\w+),\s*"
         ldr_regexp_txt += addr_regexp_txt
         ldr_regexp_txt += postinc_regexp_txt
         ldr_regexp_txt  = Instruction.unfold_abbrevs(ldr_regexp_txt)
@@ -597,20 +607,20 @@ def __init__(self):
                          arg_types_in=[RegisterType.GPR, RegisterType.GPR, RegisterType.GPR])
 
     def _simplify(self):
-        if self.increment != None:
+        if self.increment is not None:
             self.increment = simplify(self.increment)
-        if self.post_index != None:
+        if self.post_index is not None:
             self.post_index = simplify(self.post_index)
-        if self.pre_index != None:
+        if self.pre_index is not None:
             self.pre_index = simplify(self.pre_index)
 
     def parse(self, src):
 
-        addr_regexp_txt    = "\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
-        postinc_regexp_txt = "\s*(?:,\s*#(?P<postinc>.*))?"
+        addr_regexp_txt    = r"\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
+        postinc_regexp_txt = r"\s*(?:,\s*#(?P<postinc>.*))?"
 
-        strd_regexp_txt  = "\s*strd\s+"
-        strd_regexp_txt += "(?P<dest0>\w+),\s*(?P<dest1>\w+),\s*"
+        strd_regexp_txt  = r"\s*strd\s+"
+        strd_regexp_txt += r"(?P<dest0>\w+),\s*(?P<dest1>\w+),\s*"
         strd_regexp_txt += addr_regexp_txt
         strd_regexp_txt += postinc_regexp_txt
         strd_regexp_txt  = Instruction.unfold_abbrevs(strd_regexp_txt)
@@ -667,7 +677,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vrshr_regexp_txt = "vrshr\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
+        vrshr_regexp_txt = r"vrshr\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
         vrshr_regexp_txt = Instruction.unfold_abbrevs(vrshr_regexp_txt)
         vrshr_regexp = re.compile(vrshr_regexp_txt)
         p = vrshr_regexp.match(src)
@@ -689,7 +699,7 @@ def __init__(self):
                          arg_types_in=[RegisterType.GPR])
 
     def parse(self, src):
-        vrshl_regexp_txt = "vrshl\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)"
+        vrshl_regexp_txt = r"vrshl\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)"
         vrshl_regexp_txt = Instruction.unfold_abbrevs(vrshl_regexp_txt)
         vrshl_regexp = re.compile(vrshl_regexp_txt)
         p = vrshl_regexp.match(src)
@@ -711,7 +721,7 @@ def __init__(self):
                 arg_types_in_out=[RegisterType.MVE, RegisterType.GPR])
 
     def parse(self, src):
-        vshlc_regexp_txt = "vshlc\s+(?P<vec>\w+)\s*,\s*(?P<gpr>\w+)\s*,\s*(?P<shift>#.*)"
+        vshlc_regexp_txt = r"vshlc\s+(?P<vec>\w+)\s*,\s*(?P<gpr>\w+)\s*,\s*(?P<shift>#.*)"
         vshlc_regexp_txt = Instruction.unfold_abbrevs(vshlc_regexp_txt)
         vshlc_regexp = re.compile(vshlc_regexp_txt)
         p = vshlc_regexp.match(src)
@@ -734,7 +744,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vmov_regexp_txt = "vmov\.<dt>\s+(?P<dst>\w+)\s*,\s*#(?P<immediate>\w*)"
+        vmov_regexp_txt = r"vmov\.<dt>\s+(?P<dst>\w+)\s*,\s*#(?P<immediate>\w*)"
         vmov_regexp_txt = Instruction.unfold_abbrevs(vmov_regexp_txt)
         vmov_regexp = re.compile(vmov_regexp_txt)
         p = vmov_regexp.match(src)
@@ -757,7 +767,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vmullbt_regexp_txt = "vmull(?P<bt>\w+)\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+),\s*(?P<src1>\w*)"
+        vmullbt_regexp_txt = r"vmull(?P<bt>\w+)\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+),\s*(?P<src1>\w*)"
         vmullbt_regexp_txt = Instruction.unfold_abbrevs(vmullbt_regexp_txt)
         vmullbt_regexp = re.compile(vmullbt_regexp_txt)
         p = vmullbt_regexp.match(src)
@@ -783,7 +793,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vdup_regexp_txt = "vdup\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<gpr0>\w*)"
+        vdup_regexp_txt = r"vdup\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<gpr0>\w*)"
         vdup_regexp_txt = Instruction.unfold_abbrevs(vdup_regexp_txt)
         vdup_regexp = re.compile(vdup_regexp_txt)
         p = vdup_regexp.match(src)
@@ -806,7 +816,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR, RegisterType.GPR])
 
     def parse(self, src):
-        vmov_regexp_txt = "vmov\s+(?P<gpr0>\w+)\s*,\s*(?P<gpr1>\w+)\s*,\s*(?P<vec0>\w+)\s*\[\s*(?P<idx0>[23])\s*\]\s*,\s*(?P<vec1>\w+)\s*\[\s*(?P<idx1>[01])\s*\]\s*"
+        vmov_regexp_txt = r"vmov\s+(?P<gpr0>\w+)\s*,\s*(?P<gpr1>\w+)\s*,\s*(?P<vec0>\w+)\s*\[\s*(?P<idx0>[23])\s*\]\s*,\s*(?P<vec1>\w+)\s*\[\s*(?P<idx1>[01])\s*\]\s*"
         vmov_regexp_txt = Instruction.unfold_abbrevs(vmov_regexp_txt)
         vmov_regexp = re.compile(vmov_regexp_txt)
         p = vmov_regexp.match(src)
@@ -838,7 +848,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR])
 
     def parse(self, src):
-        mov_regexp_txt = "mov\s+(?P<dst>\w+)\s*,\s*#(?P<immediate>\w*)"
+        mov_regexp_txt = r"mov\s+(?P<dst>\w+)\s*,\s*#(?P<immediate>\w*)"
         mov_regexp = re.compile(mov_regexp_txt)
         p = mov_regexp.match(src)
         if p is None:
@@ -858,7 +868,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR])
 
     def parse(self, src):
-        mvn_regexp_txt = "mvn\s+(?P<dst>\w+)\s*,\s*#(?P<immediate>\w*)"
+        mvn_regexp_txt = r"mvn\s+(?P<dst>\w+)\s*,\s*#(?P<immediate>\w*)"
         mvn_regexp = re.compile(mvn_regexp_txt)
         p = mvn_regexp.match(src)
         if p is None:
@@ -878,7 +888,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR])
 
     def parse(self, src):
-        pkhbt_regexp_txt = "pkhbt\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*lsl\s*#(?P<shift>.*)"
+        pkhbt_regexp_txt = r"pkhbt\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*lsl\s*#(?P<shift>.*)"
         pkhbt_regexp_txt = Instruction.unfold_abbrevs(pkhbt_regexp_txt)
         pkhbt_regexp = re.compile(pkhbt_regexp_txt)
         p = pkhbt_regexp.match(src)
@@ -907,7 +917,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR])
 
     def parse(self, src):
-        add_imm_regexp_txt = "add\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*#(?P<shift>.*)"
+        add_imm_regexp_txt = r"add\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*#(?P<shift>.*)"
         add_imm_regexp_txt = Instruction.unfold_abbrevs(add_imm_regexp_txt)
         add_imm_regexp = re.compile(add_imm_regexp_txt)
         p = add_imm_regexp.match(src)
@@ -929,7 +939,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.GPR])
 
     def parse(self, src):
-        sub_imm_regexp_txt = "sub\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*#(?P<shift>.*)"
+        sub_imm_regexp_txt = r"sub\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*#(?P<shift>.*)"
         sub_imm_regexp_txt = Instruction.unfold_abbrevs(sub_imm_regexp_txt)
         sub_imm_regexp = re.compile(sub_imm_regexp_txt)
         p = sub_imm_regexp.match(src)
@@ -951,7 +961,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vshr_regexp_txt = "vshr\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
+        vshr_regexp_txt = r"vshr\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
         vshr_regexp_txt = Instruction.unfold_abbrevs(vshr_regexp_txt)
         vshr_regexp = re.compile(vshr_regexp_txt)
         p = vshr_regexp.match(src)
@@ -974,7 +984,7 @@ def __init__(self):
                          arg_types_in_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vshrn_regexp_txt = "v(?P<round>r)?shrn(?P<bt>\w+)\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
+        vshrn_regexp_txt = r"v(?P<round>r)?shrn(?P<bt>\w+)\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
         vshrn_regexp_txt = Instruction.unfold_abbrevs(vshrn_regexp_txt)
         vshrn_regexp = re.compile(vshrn_regexp_txt)
         p = vshrn_regexp.match(src)
@@ -999,7 +1009,7 @@ def __init__(self):
                          arg_types_in_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vshll_regexp_txt = "vshll(?P<bt>\w+)\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
+        vshll_regexp_txt = r"vshll(?P<bt>\w+)\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)\s*,\s*(?P<shift>#.*)"
         vshll_regexp_txt = Instruction.unfold_abbrevs(vshll_regexp_txt)
         vshll_regexp = re.compile(vshll_regexp_txt)
         p = vshll_regexp.match(src)
@@ -1024,7 +1034,7 @@ def __init__(self):
                          arg_types_in_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vmovl_regexp_txt = "vmovl(?P<bt>\w+)\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)\s*"
+        vmovl_regexp_txt = r"vmovl(?P<bt>\w+)\.<dt>\s+(?P<vec>\w+)\s*,\s*(?P<src>\w+)\s*"
         vmovl_regexp_txt = Instruction.unfold_abbrevs(vmovl_regexp_txt)
         vmovl_regexp = re.compile(vmovl_regexp_txt)
         p = vmovl_regexp.match(src)
@@ -1048,7 +1058,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vrev_regexp_txt = "vrev(?P<dt0>\w+)\.(?P<dt1>\w+)\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)"
+        vrev_regexp_txt = r"vrev(?P<dt0>\w+)\.(?P<dt1>\w+)\s+(?P<dst>\w+)\s*,\s*(?P<src>\w+)"
         vrev_regexp_txt = Instruction.unfold_abbrevs(vrev_regexp_txt)
         vrev_regexp = re.compile(vrev_regexp_txt)
         p = vrev_regexp.match(src)
@@ -1095,7 +1105,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vshl_regexp_txt = "vshl\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)"
+        vshl_regexp_txt = r"vshl\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)"
         vshl_regexp_txt = Instruction.unfold_abbrevs(vshl_regexp_txt)
         vshl_regexp = re.compile(vshl_regexp_txt)
         p = vshl_regexp.match(src)
@@ -1118,7 +1128,7 @@ def __init__(self):
                 arg_types_in_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vfma_regexp_txt = "vfma\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)"
+        vfma_regexp_txt = r"vfma\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)"
         vfma_regexp_txt = Instruction.unfold_abbrevs(vfma_regexp_txt)
         vfma_regexp = re.compile(vfma_regexp_txt)
         p = vfma_regexp.match(src)
@@ -1203,21 +1213,21 @@ def __init__(self):
                 arg_types_in=[RegisterType.MVE, RegisterType.GPR])
 
     def _simplify(self):
-        if self.increment != None:
+        if self.increment is not None:
             self.increment = simplify(self.increment)
-        if self.post_index != None:
+        if self.post_index is not None:
             self.post_index = simplify(self.post_index)
-        if self.pre_index != None:
+        if self.pre_index is not None:
             self.pre_index = simplify(self.pre_index)
 
     def parse(self, src):
         src = re.sub("//.*$","",src)
 
-        addr_regexp_txt = "\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
-        postinc_regexp_txt = "\s*(?:,\s*#(?P<postinc>.*))?"
+        addr_regexp_txt = r"\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
+        postinc_regexp_txt = r"\s*(?:,\s*#(?P<postinc>.*))?"
 
-        vldr_regexp_txt  = "\s*vstr(?P<width>[bB]|[hH]|[wW])\.<dt>\s+"
-        vldr_regexp_txt += "(?P<dest>\w+),\s*"
+        vldr_regexp_txt  = r"\s*vstr(?P<width>[bB]|[hH]|[wW])\.<dt>\s+"
+        vldr_regexp_txt += r"(?P<dest>\w+),\s*"
         vldr_regexp_txt += addr_regexp_txt
         vldr_regexp_txt += postinc_regexp_txt
         vldr_regexp_txt = Instruction.unfold_abbrevs(vldr_regexp_txt)
@@ -1291,21 +1301,21 @@ def __init__(self):
                 arg_types_out=[RegisterType.MVE])
 
     def _simplify(self):
-        if self.increment != None:
+        if self.increment is not None:
             self.increment = simplify(self.increment)
-        if self.post_index != None:
+        if self.post_index is not None:
             self.post_index = simplify(self.post_index)
-        if self.pre_index != None:
+        if self.pre_index is not None:
             self.pre_index = simplify(self.pre_index)
 
     def parse(self, src):
         src = re.sub("//.*$","",src)
 
-        addr_regexp_txt = "\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
-        postinc_regexp_txt = "\s*(?:,\s*#(?P<postinc>.*))?"
+        addr_regexp_txt = r"\[\s*(?P<addr>\w+)\s*(?:,\s*#(?P<addroffset>[^\]]*))?\](?P<writeback>!?)"
+        postinc_regexp_txt = r"\s*(?:,\s*#(?P<postinc>.*))?"
 
-        vldr_regexp_txt  = "\s*vldr(?P<width>[bB]|[hH]|[wW])\.<dt>\s+"
-        vldr_regexp_txt += "(?P<dest>\w+),\s*"
+        vldr_regexp_txt  = r"\s*vldr(?P<width>[bB]|[hH]|[wW])\.<dt>\s+"
+        vldr_regexp_txt += r"(?P<dest>\w+),\s*"
         vldr_regexp_txt += addr_regexp_txt
         vldr_regexp_txt += postinc_regexp_txt
         vldr_regexp_txt = Instruction.unfold_abbrevs(vldr_regexp_txt)
@@ -1378,23 +1388,23 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def _simplify(self):
-        if self.increment != None:
+        if self.increment is not None:
             self.increment = simplify(self.increment)
-        if self.post_index != None:
+        if self.post_index is not None:
             self.post_index = simplify(self.post_index)
-        if self.pre_index != None:
+        if self.pre_index is not None:
             self.pre_index = simplify(self.pre_index)
 
     def parse(self, src):
         src = re.sub("//.*$","",src).strip()
 
-        dest   = "(?P<dest>\w+),\s*"
-        adrgpr = "(?P<addr>\w+)"
-        ofsvec = ",\s*(?P<addrvec>\w+)?"
-        uxtw   = "(?:,\s*(?:uxtw|UXTW)\s+#(?P<uxtw>\w+))?"
-        addr_regexp_txt = f"\[\s*{adrgpr}\s*{ofsvec}\s*{uxtw}\]"
+        dest   = r"(?P<dest>\w+),\s*"
+        adrgpr = r"(?P<addr>\w+)"
+        ofsvec = r",\s*(?P<addrvec>\w+)?"
+        uxtw   = r"(?:,\s*(?:uxtw|UXTW)\s+#(?P<uxtw>\w+))?"
+        addr_regexp_txt = rf"\[\s*{adrgpr}\s*{ofsvec}\s*{uxtw}\]"
 
-        vldr_regexp_txt  = "\s*vldr(?P<width>[bB]|[hH]|[wW])\.<dt>\s+"
+        vldr_regexp_txt  = r"\s*vldr(?P<width>[bB]|[hH]|[wW])\.<dt>\s+"
         vldr_regexp_txt += dest
         vldr_regexp_txt += addr_regexp_txt
         vldr_regexp_txt += "$"
@@ -1440,10 +1450,10 @@ def __init__(self):
 
     def parse(self, src):
 
-        regexp = "\s*(?P<variant>vld2(?P<idx>[0-1])\.<dt>)\s+"\
-                "{\s*(?P<out0>\w+)\s*,"\
-                 "\s*(?P<out1>\w+)\s*}"\
-                 "\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
+        regexp = r"\s*(?P<variant>vld2(?P<idx>[0-1])\.<dt>)\s+"\
+                r"{\s*(?P<out0>\w+)\s*,"\
+                 r"\s*(?P<out1>\w+)\s*}"\
+                 r"\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
         regexp = Instruction.unfold_abbrevs(regexp)
 
         p = re.compile(regexp).match(src)
@@ -1522,12 +1532,12 @@ def __init__(self):
 
     def parse(self, src):
 
-        regexp = "\s*(?P<variant>vld4(?P<idx>[0-3])\.<dt>)\s+"\
-                "{\s*(?P<out0>\w+)\s*,"\
-                 "\s*(?P<out1>\w+)\s*,"\
-                 "\s*(?P<out2>\w+)\s*,"\
-                 "\s*(?P<out3>\w+)\s*}"\
-                 "\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
+        regexp = r"\s*(?P<variant>vld4(?P<idx>[0-3])\.<dt>)\s+"\
+                r"{\s*(?P<out0>\w+)\s*,"\
+                 r"\s*(?P<out1>\w+)\s*,"\
+                 r"\s*(?P<out2>\w+)\s*,"\
+                 r"\s*(?P<out3>\w+)\s*}"\
+                 r"\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
         regexp = Instruction.unfold_abbrevs(regexp)
 
         p = re.compile(regexp).match(src)
@@ -1617,10 +1627,10 @@ def __init__(self):
 
     def parse(self, src):
 
-        regexp = "\s*(?P<variant>vst2(?P<idx>[0-1])\.<dt>)\s+"\
-                "{\s*(?P<out0>\w+)\s*,"\
-                 "\s*(?P<out1>\w+)\s*}"\
-                 "\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
+        regexp = r"\s*(?P<variant>vst2(?P<idx>[0-1])\.<dt>)\s+"\
+                r"{\s*(?P<out0>\w+)\s*,"\
+                 r"\s*(?P<out1>\w+)\s*}"\
+                 r"\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
         regexp = Instruction.unfold_abbrevs(regexp)
 
         p = re.compile(regexp).match(src)
@@ -1698,12 +1708,12 @@ def __init__(self):
 
     def parse(self, src):
 
-        regexp = "\s*(?P<variant>vst4(?P<idx>[0-3])\.<dt>)\s+"\
-                "{\s*(?P<out0>\w+)\s*,"\
-                 "\s*(?P<out1>\w+)\s*,"\
-                 "\s*(?P<out2>\w+)\s*,"\
-                 "\s*(?P<out3>\w+)\s*}"\
-                 "\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
+        regexp = r"\s*(?P<variant>vst4(?P<idx>[0-3])\.<dt>)\s+"\
+                 r"{\s*(?P<out0>\w+)\s*,"\
+                 r"\s*(?P<out1>\w+)\s*,"\
+                 r"\s*(?P<out2>\w+)\s*,"\
+                 r"\s*(?P<out3>\w+)\s*}"\
+                 r"\s*,\s*\[\s*(?P<reg>\w+)\s*\](?P<writeback>!?)\s*"
         regexp = Instruction.unfold_abbrevs(regexp)
 
         p = re.compile(regexp).match(src)
@@ -1805,7 +1815,7 @@ def __init__(self):
                          arg_types_in_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vcmla_regexp_txt = "vcmla\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vcmla_regexp_txt = r"vcmla\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vcmla_regexp_txt = Instruction.unfold_abbrevs(vcmla_regexp_txt)
         vcmla_regexp = re.compile(vcmla_regexp_txt)
         p = vcmla_regexp.match(src)
@@ -1831,7 +1841,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vcmul_regexp_txt = "vcmul\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vcmul_regexp_txt = r"vcmul\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vcmul_regexp_txt = Instruction.unfold_abbrevs(vcmul_regexp_txt)
         vcmul_regexp = re.compile(vcmul_regexp_txt)
         p = vcmul_regexp.match(src)
@@ -1858,7 +1868,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vcadd_regexp_txt = "vcadd\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vcadd_regexp_txt = r"vcadd\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vcadd_regexp_txt = Instruction.unfold_abbrevs(vcadd_regexp_txt)
         vcadd_regexp = re.compile(vcadd_regexp_txt)
         p = vcadd_regexp.match(src)
@@ -1886,7 +1896,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vhcadd_regexp_txt = "vhcadd\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vhcadd_regexp_txt = r"vhcadd\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vhcadd_regexp_txt = Instruction.unfold_abbrevs(vhcadd_regexp_txt)
         vhcadd_regexp = re.compile(vhcadd_regexp_txt)
         p = vhcadd_regexp.match(src)
@@ -1913,7 +1923,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vhcsub_regexp_txt = "vhcsub\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vhcsub_regexp_txt = r"vhcsub\.<dt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vhcsub_regexp_txt = Instruction.unfold_abbrevs(vhcsub_regexp_txt)
         vhcsub_regexp = re.compile(vhcsub_regexp_txt)
         p = vhcsub_regexp.match(src)
@@ -1940,7 +1950,7 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vcaddf_regexp_txt = "vcadd\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vcaddf_regexp_txt = r"vcadd\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vcaddf_regexp_txt = Instruction.unfold_abbrevs(vcaddf_regexp_txt)
         vcaddf_regexp = re.compile(vcaddf_regexp_txt)
         p = vcaddf_regexp.match(src)
@@ -1959,7 +1969,8 @@ def parse(self, src):
 
 
     def write(self):
-        return f"vcadd.{self.datatype} {self.args_out[0]}, {self.args_in[0]}, {self.args_in[1]}, {self.rotation}"
+        return f"vcadd.{self.datatype} {self.args_out[0]}, "\
+            f"{self.args_in[0]}, {self.args_in[1]}, {self.rotation}"
 
 class vcsubf(Instruction):
     def __init__(self):
@@ -1968,7 +1979,8 @@ def __init__(self):
                          arg_types_out=[RegisterType.MVE])
 
     def parse(self, src):
-        vhcsub_regexp_txt = "vcsub\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
+        vhcsub_regexp_txt = r"vcsub\.<fdt>\s+(?P<dst>\w+)\s*,\s*(?P<src0>\w+)\s*,"\
+            r"\s*(?P<src1>\w+)\s*,\s*(?P<rotation>#.*)"
         vhcsub_regexp_txt = Instruction.unfold_abbrevs(vhcsub_regexp_txt)
         vhcsub_regexp = re.compile(vhcsub_regexp_txt)
         p = vhcsub_regexp.match(src)
@@ -1983,7 +1995,8 @@ def parse(self, src):
 
         if self.datatype == "f32":
             # First index: output, Second index: Input
-            self.args_in_out_different = [(0,0),(0,1)] # Output must not be the same as any of the inputs
+            # Output must not be the same as any of the inputs
+            self.args_in_out_different = [(0,0),(0,1)] 
 
     def write(self):
         return f"vcsub.{self.datatype} {self.args_out[0]}, {self.args_in[0]}, {self.args_in[1]}, {self.rotation}"
@@ -2007,7 +2020,7 @@ def write(self):
 #
 # And change out to an output argument in this case (rather than input/output)
 def vqdmlsdh_vqdmladhx_parsing_cb(this_class, other_class):
-    def core(inst,t):
+    def core(inst,t, log=None):
         assert isinstance(inst, this_class)
         succ = None
 
@@ -2065,7 +2078,7 @@ def match(x):
         for lp in l:
             if match(lp):
                 return v
-    if default == None:
+    if default is None:
         raise Exception(f"Couldn't find {k}")
     return default
 
diff --git a/targets/arm_v81m/cortex_m55r1.py b/slothy/targets/arm_v81m/cortex_m55r1.py
similarity index 99%
rename from targets/arm_v81m/cortex_m55r1.py
rename to slothy/targets/arm_v81m/cortex_m55r1.py
index e7476bcf..cdbe4ba1 100644
--- a/targets/arm_v81m/cortex_m55r1.py
+++ b/slothy/targets/arm_v81m/cortex_m55r1.py
@@ -36,11 +36,8 @@
 ########################################################################################
 ########################################################################################
 
-import logging
-import re
-
 from enum import Enum
-from .arch_v81m import *
+from slothy.targets.arm_v81m.arch_v81m import *
 
 issue_rate = 1
 
diff --git a/targets/arm_v81m/cortex_m85r1.py b/slothy/targets/arm_v81m/cortex_m85r1.py
similarity index 99%
rename from targets/arm_v81m/cortex_m85r1.py
rename to slothy/targets/arm_v81m/cortex_m85r1.py
index 483fa5ad..0bdb2d85 100644
--- a/targets/arm_v81m/cortex_m85r1.py
+++ b/slothy/targets/arm_v81m/cortex_m85r1.py
@@ -36,11 +36,8 @@
 ########################################################################################
 ########################################################################################
 
-import logging
-import re
-
 from enum import Enum
-from .arch_v81m import *
+from slothy.targets.arm_v81m.arch_v81m import *
 
 issue_rate = 1
 
diff --git a/targets/arm_v81m/helium_experimental.py b/slothy/targets/arm_v81m/helium_experimental.py
similarity index 99%
rename from targets/arm_v81m/helium_experimental.py
rename to slothy/targets/arm_v81m/helium_experimental.py
index 6dc67978..a813a03e 100644
--- a/targets/arm_v81m/helium_experimental.py
+++ b/slothy/targets/arm_v81m/helium_experimental.py
@@ -29,11 +29,9 @@
 ### The baseline is just a copy of the Cortex-M55 model without ST-LD hazard
 ###
 
-import logging
-import re
-
 from enum import Enum
-from .arch_v81m import *
+
+from slothy.targets.arm_v81m.arch_v81m import *
 
 issue_rate = 1
 
@@ -282,7 +280,7 @@ def lookup_multidict(d, k, default=None):
             return v
         if isinstance(l,tuple) and k in l:
             return v
-    if default == None:
+    if default is None:
         raise Exception(f"Couldn't find {k}")
     return default
 
diff --git a/targets/query.py b/slothy/targets/query.py
similarity index 68%
rename from targets/query.py
rename to slothy/targets/query.py
index 5f445791..cea90cd1 100644
--- a/targets/query.py
+++ b/slothy/targets/query.py
@@ -25,16 +25,25 @@
 # Author: Hanno Becker <hannobecker@posteo.de>
 #
 
-import targets.arm_v81m.arch_v81m as Arch_Armv81M
-import targets.arm_v81m.cortex_m55r1 as Target_CortexM55r1
-import targets.arm_v81m.cortex_m85r1 as Target_CortexM85r1
-import targets.arm_v81m.helium_experimental as Target_Helium_Experimental
+"""
+Convenience module for querying available architecture
+and microarchitecture models for SLOTHY.
+"""
 
-import targets.aarch64.aarch64_neon as AArch64_Neon
-import targets.aarch64.cortex_a55 as Target_CortexA55
-import targets.aarch64.cortex_a72_frontend as Target_CortexA72_Frontend
-import targets.aarch64.neoverse_n1_experimental as Target_NeoverseN1_Experimental
-import targets.aarch64.aarch64_big_experimental as Target_Big_Experimental
+from slothy.targets.arm_v81m import arch_v81m as Arch_Armv81M
+from slothy.targets.arm_v81m import cortex_m55r1 as Target_CortexM55r1
+from slothy.targets.arm_v81m import cortex_m85r1 as Target_CortexM85r1
+from slothy.targets.arm_v81m import helium_experimental as Target_Helium_Experimental
+
+from slothy.targets.aarch64 import aarch64_neon as AArch64_Neon
+from slothy.targets.aarch64 import cortex_a55 as Target_CortexA55
+from slothy.targets.aarch64 import cortex_a72_frontend as Target_CortexA72_Frontend
+from slothy.targets.aarch64 import neoverse_n1_experimental as Target_NeoverseN1_Experimental
+from slothy.targets.aarch64 import aarch64_big_experimental as Target_Big_Experimental
+
+class UnknownTarget(Exception):
+    """Exception raised when an unknown architecture or microarchitecture
+    is requested."""
 
 class Archery:
     """This is a small helper class for querying architectures"""
@@ -50,26 +59,31 @@ class Archery:
                  "Arm_Neoverse_N1_experimental" : Target_NeoverseN1_Experimental,
                  "Arm_Big_experimental" : Target_Big_Experimental,
                  }
+
+    @staticmethod
     def list_archs():
         """Lists all available architectures"""
         return list(Archery._archs.keys())
 
+    @staticmethod
     def list_targets():
         """Lists all available targets"""
         return list(Archery._targets.keys())
 
+    @staticmethod
     def get_arch(name):
         """Query an architecture by name"""
         arch = Archery._archs.get(name,None)
-        if arch == None:
-            raise Exception(f"Could not find architecture {name}. "\
+        if arch is None:
+            raise UnknownTarget(f"Could not find architecture {name}. "\
                             f"Known architectures are {list(Archery._archs.keys())}")
         return arch
 
+    @staticmethod
     def get_target(name):
         """Query a target by name"""
         target = Archery._targets.get(name,None)
-        if target == None:
-            raise Exception(f"Could not find target {name}. "\
+        if target is None:
+            raise UnknownTarget(f"Could not find target {name}. "\
                             f"Known targets are {list(Archery._targets.keys())}")
         return target
diff --git a/submodules/0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch b/submodules/0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
new file mode 100644
index 00000000..ad76ac36
--- /dev/null
+++ b/submodules/0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
@@ -0,0 +1,24 @@
+From 3b6f6999c042322268eb3ba84e829097014b7428 Mon Sep 17 00:00:00 2001
+From: Hanno Becker <beckphan@amazon.co.uk>
+Date: Tue, 19 Dec 2023 21:24:57 +0000
+Subject: [PATCH] Pin pybind11_protobuf commit in cmake files
+
+---
+ cmake/dependencies/CMakeLists.txt | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/cmake/dependencies/CMakeLists.txt b/cmake/dependencies/CMakeLists.txt
+index c39a44fb89..27923ccedb 100644
+--- a/cmake/dependencies/CMakeLists.txt
++++ b/cmake/dependencies/CMakeLists.txt
+@@ -177,7 +177,7 @@ if(BUILD_PYTHON AND BUILD_pybind11_protobuf)
+   FetchContent_Declare(
+     pybind11_protobuf
+     GIT_REPOSITORY "https://github.com/pybind/pybind11_protobuf.git"
+-    GIT_TAG "main"
++    GIT_TAG "5baa2dc9d93e3b608cde86dfa4b8c63aeab4ac78"
+     PATCH_COMMAND git apply --ignore-whitespace "${CMAKE_CURRENT_LIST_DIR}/../../patches/pybind11_protobuf.patch"
+   )
+   FetchContent_MakeAvailable(pybind11_protobuf)
+--
+2.39.3 (Apple Git-145)
diff --git a/submodules/setup-ortools.sh b/submodules/setup-ortools.sh
index 92fdb12e..55ae7c35 100755
--- a/submodules/setup-ortools.sh
+++ b/submodules/setup-ortools.sh
@@ -1,9 +1,16 @@
 #!/usr/bin/env sh
 
+# Install some dependencies
+apt install -y git build-essential python3-pip cmake swig
+
 git submodule init
 git submodule update
 
 cd or-tools
+
+# Work around https://github.com/google/or-tools/issues/4027
+git apply ../0001-Pin-pybind11_protobuf-commit-in-cmake-files.patch
+
 mkdir build
 cmake -S. -Bbuild -DBUILD_PYTHON:BOOL=ON
 cd build
diff --git a/targets/aarch64/aarch64_neon.py b/targets/aarch64/aarch64_neon.py
deleted file mode 100644
index 2146e859..00000000
--- a/targets/aarch64/aarch64_neon.py
+++ /dev/null
@@ -1,3034 +0,0 @@
-#
-# Copyright (c) 2022 Arm Limited
-# Copyright (c) 2022 Hanno Becker
-# Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer
-# SPDX-License-Identifier: MIT
-#
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-#
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-# SOFTWARE.
-#
-# Author: Hanno Becker <hannobecker@posteo.de>
-#
-
-###
-### WARNING: This module is highly incomplete and does not constitute a complete
-###          parser for AArch64 -- in fact, so far, only a handful instructions are
-###          modelled, with a strongly simplified syntax. For now, this is only to
-###          allow experimentation to get an idea of performance of SLOTHY for AArch64.
-###
-
-import logging
-import inspect
-import re
-import math
-
-from sympy import simplify
-from enum import Enum
-from functools import cache
-
-class RegisterType(Enum):
-    GPR = 1,
-    Neon = 2,
-    StackNeon = 3,
-    StackGPR = 4,
-    StackAny = 5,
-    Flags = 6,
-    Hint = 7,
-
-    def __str__(self):
-        return self.name
-    def __repr__(self):
-        return self.name
-
-    @cache
-    @staticmethod
-    def list_registers(reg_type, only_extra=False, only_normal=False, with_variants=False):
-        """Return the list of all registers of a given type"""
-
-        qstack_locations = [ f"QSTACK{i}" for i in range(8) ]
-        stack_locations  = [ f"STACK{i}"  for i in range(8) ]
-
-        # TODO: this is needed for X25519; as we use the same stack space
-        # for Neon and GPR; It would be great to unify. Ideally, one should
-        # be able to just use STACK_ without having to define it here
-        stackany_locations = [
-            "STACK_MASK1",
-            "STACK_MASK2",
-            "STACK_A_0",
-            "STACK_A_8",
-            "STACK_A_16",
-            "STACK_A_24",
-            "STACK_A_32",
-            "STACK_B_0",
-            "STACK_B_8",
-            "STACK_B_16",
-            "STACK_B_24",
-            "STACK_B_32",
-            "STACK_CTR",
-            "STACK_LASTBIT",
-            "STACK_SCALAR",
-            "STACK_X_0",
-            "STACK_X_8",
-            "STACK_X_16",
-            "STACK_X_24",
-            "STACK_X_32"
-        ]
-
-
-        gprs_normal  = [ f"x{i}" for i in range(31) ] + ["sp"]
-        vregs_normal = [ f"v{i}" for i in range(32) ]
-
-        gprs_extra  = []
-        vregs_extra = []
-
-        gprs_variants = [ f"w{i}" for i in range(31) ]
-        vregs_variants = [ f"q{i}" for i in range(32) ]
-
-        gprs  = []
-        vregs = []
-        hints = [ f"t{i}" for i in range(100) ] + \
-                [ f"t{i}{j}" for i in range(8) for j in range(8) ] + \
-                [ f"t{i}_{j}" for i in range(16) for j in range(16) ]
-
-        flags = ["flags"]
-        if not only_extra:
-            gprs  += gprs_normal
-            vregs += vregs_normal
-        if not only_normal:
-            gprs  += gprs_extra
-            vregs += vregs_extra
-        if with_variants:
-            gprs += gprs_variants
-            vregs += vregs_variants
-
-        return { RegisterType.GPR      : gprs,
-                 RegisterType.StackGPR : stack_locations,
-                 RegisterType.StackNeon : qstack_locations,
-                 RegisterType.StackAny  : stackany_locations,
-                 RegisterType.Neon      : vregs,
-                 RegisterType.Hint      : hints,
-                 RegisterType.Flags     : flags}[reg_type]
-
-    @staticmethod
-    def find_type(r):
-        for ty in RegisterType:
-            if r in RegisterType.list_registers(ty):
-                return ty
-        raise UnknownRegister(f"Unknown architectural register {r}")
-
-    @staticmethod
-    def from_string(string):
-        string = string.lower()
-        return { "qstack"   : RegisterType.StackNeon,
-                 "stack"    : RegisterType.StackGPR,
-                 "stackany" : RegisterType.StackAny,
-                 "neon"     : RegisterType.Neon,
-                 "gpr"      : RegisterType.GPR,
-                 "hint"     : RegisterType.Hint,
-                 "flags"    : RegisterType.Flags}.get(string,None)
-
-    @staticmethod
-    def default_reserved():
-        """Return the list of registers that should be reserved by default"""
-        return set(["flags", "sp",
-            "STACK_MASK1",
-            "STACK_MASK2",
-            "STACK_A_0",
-            "STACK_A_8",
-            "STACK_A_16",
-            "STACK_A_24",
-            "STACK_A_32",
-            "STACK_B_0",
-            "STACK_B_8",
-            "STACK_B_16",
-            "STACK_B_24",
-            "STACK_B_32",
-            "STACK_CTR",
-            "STACK_LASTBIT",
-            "STACK_SCALAR",
-            "STACK_X_0",
-            "STACK_X_8",
-            "STACK_X_16",
-            "STACK_X_24",
-            "STACK_X_32"
-                    ] + RegisterType.list_registers(RegisterType.Hint))
-
-    @staticmethod
-    def default_aliases():
-        return {}
-
-class Loop:
-    """Helper functions for parsing and writing simple loops in AArch64
-    
-    TODO: Generalize; current implementation too specific about shape of loop"""
-
-    def __init__(self, lbl_start="1", lbl_end="2", loop_init="lr"):
-        self.lbl_start = lbl_start
-        self.lbl_end   = lbl_end
-        self.loop_init = loop_init
-
-    def start(self, indentation=0, fixup=0, unroll=1, jump_if_empty=None):
-        """Emit starting instruction(s) and jump label for loop"""
-        indent = ' ' * indentation
-        if unroll > 1:
-            assert unroll in [1,2,4,8,16,32]
-            yield f"{indent}lsr count, count, #{int(math.log2(unroll))}"
-        if fixup != 0:
-            yield f"{indent}sub count, count, #{fixup}"
-        if jump_if_empty is not None:
-            yield f"cbz count, {jump_if_empty}"
-        yield f"{self.lbl_start}:"
-
-    def end(self, other, indentation=0):
-        """Emit compare-and-branch at the end of the loop"""
-        (reg0, reg1, imm) = other
-        indent = ' ' * indentation
-        lbl_start = self.lbl_start
-        if lbl_start.isdigit():
-            lbl_start += "b"
-
-        yield f"{indent}sub {reg0}, {reg1}, {imm}"
-        yield f"{indent}cbnz {reg0}, {lbl_start}"
-
-    @staticmethod
-    def extract(source, lbl):
-        """Locate a loop with start label `lbl` in `source`.
-
-        We currently only support the following loop forms:
-
-           ```
-           loop_lbl:
-               {code}
-               sub[s] <cnt>, <cnt>, #1
-               (cbnz|bnz|bne) <cnt>, loop_lbl
-           ```
-
-        """
-
-        pre  = []
-        body = []
-        post = []
-        loop_lbl_regexp_txt = r"^\s*(?P<label>\w+)\s*:(?P<remainder>.*)$"
-        loop_lbl_regexp = re.compile(loop_lbl_regexp_txt)
-
-        # TODO: Allow other forms of looping
-
-        loop_end_regexp_txt = (r"^\s*sub[s]?\s+(?P<reg0>\w+),\s*(?P<reg1>\w+),\s*(?P<imm>#1)",
-                               rf"^\s*(cbnz|bnz|bne)\s+(?P<reg0>\w+),\s*{lbl}")
-        loop_end_regexp = [re.compile(txt) for txt in loop_end_regexp_txt]
-        lines = iter(source.splitlines())
-        l = None
-        keep = False
-        state = 0 # 0: haven't found loop yet, 1: extracting loop, 2: after loop
-        while True:
-            if not keep:
-                l = next(lines, None)
-            keep = False
-            if l is None:
-                break
-            if state == 0:
-                p = loop_lbl_regexp.match(l)
-                if p is not None and p.group("label") == lbl:
-                    l = p.group("remainder")
-                    keep = True
-                    state = 1
-                else:
-                    pre.append(l)
-                continue
-            if state == 1:
-                p = loop_end_regexp[0].match(l)
-                if p is not None:
-                    reg0 = p.group("reg0")
-                    reg1 = p.group("reg1")
-                    imm = p.group("imm")
-                    state = 2
-                    continue
-                body.append(l)
-                continue
-            if state == 2:
-                p = loop_end_regexp[1].match(l)
-                if p is not None:
-                    state = 3
-                    continue
-                body.append(l)
-                continue
-            if state == 3:
-                post.append(l)
-                continue
-        if state < 3:
-            raise FatalParsingException(f"Couldn't identify loop {lbl}")
-        return pre, body, post, lbl, (reg0, reg1, imm)
-
-class FatalParsingException(Exception):
-    """A fatal error happened during instruction parsing"""
-
-class UnknownInstruction(Exception):
-    """The parent instruction class for the given object could not be found"""
-
-class UnknownRegister(Exception):
-    """The register could not be found"""
-
-class Instruction:
-
-    class ParsingException(Exception):
-        """An attempt to parse an assembly line as a specific instruction failed
-
-        This is a frequently encountered exception since assembly lines are parsed by
-        trial and error, iterating over all instruction parsers."""
-        def __init__(self, err=None):
-            super().__init__(err)
-
-    def __init__(self, *, mnemonic,
-                 arg_types_in= None, arg_types_in_out = None, arg_types_out = None):
-
-        if arg_types_in is None:
-            arg_types_in = []
-        if arg_types_out is None:
-            arg_types_out = []
-        if arg_types_in_out is None:
-            arg_types_in_out = []
-
-        self.mnemonic = mnemonic
-
-        self.args_out_combinations = None
-        self.args_in_combinations = None
-        self.args_in_out_combinations = None
-        self.args_in_out_different = None
-        self.args_in_inout_different = None
-
-        self.arg_types_in     = arg_types_in
-        self.arg_types_out    = arg_types_out
-        self.arg_types_in_out = arg_types_in_out
-        self.num_in           = len(arg_types_in)
-        self.num_out          = len(arg_types_out)
-        self.num_in_out       = len(arg_types_in_out)
-
-        self.args_out_restrictions    = [ None for _ in range(self.num_out)    ]
-        self.args_in_restrictions     = [ None for _ in range(self.num_in)     ]
-        self.args_in_out_restrictions = [ None for _ in range(self.num_in_out) ]
-
-        self.args_in     = []
-        self.args_out    = []
-        self.args_in_out = []
-
-        self.addr = None
-        self.increment = None
-        self.pre_index = None
-        self.immediate = None
-
-    def global_parsing_cb(self, a):
-        """Parsing callback triggered after DataFlowGraph parsing which allows modification
-        of the instruction in the context of the overall computation.
-        
-        This is primarily used to remodel input-outputs as outputs in jointly destructive
-        instruction patterns (See Section 4.4, https://eprint.iacr.org/2022/1303.pdf)."""
-        return False
-
-    def write(self):
-        """Write the instruction"""
-        args = self.args_out + self.args_in_out + self.args_in
-        return self.mnemonic + ' ' + ', '.join(args)
-
-    @staticmethod
-    def unfold_abbrevs(mnemonic):
-        if mnemonic.count("<dt") > 1:
-            for i in range(mnemonic.count("<dt")):
-                mnemonic = re.sub(f"<dt{i}>",  f"(?P<datatype{i}>(?:2|4|8|16)(?:B|H|S|D))", mnemonic)
-        else:
-            mnemonic = re.sub("<dt>",  "(?P<datatype>(?:|i|u|s)(?:8|16|32|64))", mnemonic)
-            mnemonic = re.sub("<fdt>", "(?P<datatype>(?:f)(?:8|16|32))", mnemonic)
-        return mnemonic
-
-    def _is_instance_of(self, inst_list):
-        for inst in inst_list:
-            if isinstance(self,inst):
-                return True
-        return False
-
-    # vector
-    def is_Qform_vector_instruction(self):
-        if not hasattr(self, "datatype"):
-            return self._is_instance_of([
-                                      vmul, vmul_lane,
-                                      vmla, vmla_lane,
-                                      vmls, vmls_lane,
-                                      vqrdmulh, vqrdmulh_lane,
-                                      vqdmulh_lane,
-                                      vsrshr,
-                                      Str_Q, Ldr_Q,
-                                      stack_vld1r])
-        dt = getattr(self, "datatype")
-        if dt == "":
-            return False
-        if dt[0].lower() in ["2d", "4s", "8h", "16b"]:
-            return True
-        if dt[0].lower() in ["1d", "2s", "4h", "8b"]:
-            return False
-        raise FatalParsingException(f"unknown datatype {dt}")
-
-    def is_vector_mul(self):
-        return self._is_instance_of([ vmul, vmul_lane,
-                                      vmla, vmls_lane, vmls,
-                                      vqrdmulh, vqrdmulh_lane, vqdmulh_lane,
-                                      vmull, vmlal ])
-    def is_vector_add_sub(self):
-        return self._is_instance_of([ vadd, vsub ])
-    def is_vector_load(self):
-        return self._is_instance_of([ Ldr_Q ]) # TODO: Ld4 missing?
-    def is_vector_store(self):
-        return self._is_instance_of([ Str_Q, St4, stack_vstp_dform, stack_vstr_dform])
-    def is_vector_stack_load(self):
-        return self._is_instance_of([stack_vld1r, stack_vldr_bform, stack_vldr_dform,
-        stack_vld2_lane])
-    def is_vector_stack_store(self):
-        return self._is_instance_of([])
-
-    # scalar
-    def is_scalar_load(self):
-        return self._is_instance_of([ Ldr_X, Ldp_X ])
-    def is_scalar_store(self):
-        return  self._is_instance_of([ Stp_X, Str_X ])
-    def is_scalar_stack_store(self):
-        return self._is_instance_of([save, qsave, stack_stp, stack_stp_wform, stack_str])
-    def is_scalar_stack_load(self):
-        return self._is_instance_of([restore, qrestore, stack_ldr])
-
-    # scalar or vector
-    def is_load(self):
-        return self.is_vector_load() or self.is_scalar_load()
-    def is_store(self):
-        return self.is_vector_store() or self.is_scalar_store()
-    def is_stack_load(self):
-        return self.is_vector_stack_load() or self.is_scalar_stack_load()
-    def is_stack_store(self):
-        return self.is_vector_stack_store() or self.is_scalar_stack_store()
-    def is_load_store_instruction(self):
-        return self.is_load() or self.is_store()
-
-    @classmethod
-    def make(cls, src):
-        """Abstract factory method parsing a string into an instruction instance."""
-
-    @staticmethod
-    def build(c, src, mnemonic, **kwargs):
-        if src.split(' ')[0] != mnemonic:
-            raise Instruction.ParsingException("Mnemonic does not match")
-
-        obj = c(mnemonic=mnemonic, **kwargs)
-
-        # Replace <dt> by list of all possible datatypes
-        mnemonic = Instruction.unfold_abbrevs(obj.mnemonic)
-
-        expected_args = obj.num_in + obj.num_out + obj.num_in_out
-        regexp_txt  = rf"^\s*{mnemonic}"
-        if expected_args > 0:
-            regexp_txt += r"\s+"
-        regexp_txt += ','.join([r"\s*(\w+)\s*" for _ in range(expected_args)])
-        regexp = re.compile(regexp_txt)
-
-        p = regexp.match(src)
-        if p is None:
-            raise Instruction.ParsingException(
-                f"Doesn't match basic instruction template {regexp_txt}")
-
-        operands = list(p.groups())
-        obj.datatype = ""
-
-        if obj.num_out > 0:
-            obj.args_out = operands[:obj.num_out]
-            idx_args_in = obj.num_out
-        elif obj.num_in_out > 0:
-            obj.args_in_out = operands[:obj.num_in_out]
-            idx_args_in = obj.num_in_out
-        else:
-            idx_args_in = 0
-
-        obj.args_in = operands[idx_args_in:]
-
-        if not len(obj.args_in) == obj.num_in:
-            raise FatalParsingException(f"Something wrong parsing {src}: Expect {obj.num_in} input,"
-                f" but got {len(obj.args_in)} ({obj.args_in})")
-
-        return obj
-
-    @staticmethod
-    def parser(src):
-        """Global factory method parsing an assembly line into an instance
-        of a subclass of Instance"""
-        insts = []
-        exceptions = {}
-        instnames = []
-
-        src = src.strip()
-
-        # Iterate through all derived classes and call their parser
-        # until one of them hopefully succeeds
-        for inst_class in Instruction.all_subclass_leaves:
-            try:
-                inst = inst_class.make(src)
-                instnames = [inst_class.__name__]
-                insts = [inst]
-                break
-            except Instruction.ParsingException as e:
-                exceptions[inst_class.__name__] = e
-
-        if len(insts) == 0:
-            logging.error("Failed to parse instruction %s", src)
-            logging.error("A list of attempted parsers and their exceptions follows.")
-            for i,e in exceptions.items():
-                msg = f"* {i + ':':20s} {e}"
-                logging.error(msg)
-            raise Instruction.ParsingException(
-                f"Couldn't parse {src}\nYou may need to add support for a new instruction (variant)?")
-
-        logging.debug("Parsing result for '%s': %s", src, instnames)
-        return insts
-
-    def __repr__(self):
-        return self.write()
-
-class AArch64Instruction(Instruction):
-
-    PARSERS = {}
-
-    @staticmethod
-    def _unfold_pattern(src):
-
-        src = re.sub(r"\.",  "\\\\s*\\\\.\\\\s*", src)
-        src = re.sub(r"\[", "\\\\s*\\\\[\\\\s*", src)
-        src = re.sub(r"\]", "\\\\s*\\\\]\\\\s*", src)
-
-        def pattern_transform(g):
-            return \
-                f"([{g.group(1).lower()}{g.group(1)}]" +\
-                f"(?P<raw_{g.group(1)}{g.group(2)}>[0-9_][0-9_]*)|" +\
-                f"([{g.group(1).lower()}{g.group(1)}]<(?P<symbol_{g.group(1)}{g.group(2)}>\\w+)>))"
-        src = re.sub("<([BHWXVQTD])(\w+)>", pattern_transform, src)
-
-        # Replace <key> or <key0>, <key1>, ... with pattern
-        def replace_placeholders(src, mnemonic_key, regexp, group_name):
-            prefix = f"<{mnemonic_key}"
-            pattern = f"<{mnemonic_key}>"
-            def pattern_i(i):
-                return f"<{mnemonic_key}{i}>"
-
-            cnt = src.count(prefix)
-            if cnt > 1:
-                for i in range(cnt):
-                    src = re.sub(pattern_i(i),  f"(?P<{group_name}{i}>{regexp})", src)
-            else:
-                src = re.sub(pattern, f"(?P<{group_name}>{regexp})", src)
-
-            return src
-
-        flaglist = ["eq","ne","cs","hs","cc","lo","mi","pl","vs","vc","hi","ls","ge","lt","gt","le"]
-
-        flag_pattern = '|'.join(flaglist)
-        dt_pattern = "(?:|2|4|8|16)(?:B|H|S|D|b|h|s|d)"
-        imm_pattern = "#(\\\\w|\\\\s|/| |-|\\*|\\+|\\(|\\)|=|,)+"
-        index_pattern = "[0-9]+"
-
-        src = re.sub(" ", "\\\\s+", src)
-        src = re.sub(",", "\\\\s*,\\\\s*", src)
-
-        src = replace_placeholders(src, "imm", imm_pattern, "imm")
-        src = replace_placeholders(src, "dt", dt_pattern, "datatype")
-        src = replace_placeholders(src, "index", index_pattern, "index")
-        src = replace_placeholders(src, "flag", flag_pattern, "flag")
-
-        src = r"\s*" + src + r"\s*(//.*)?\Z"
-        return src
-
-    @staticmethod
-    def _build_parser(src):
-        regexp_txt = AArch64Instruction._unfold_pattern(src)
-        regexp = re.compile(regexp_txt)
-
-        def _parse(line):
-            regexp_result = regexp.match(line)
-            if regexp_result is None:
-                raise Instruction.ParsingException(f"Does not match instruction pattern {src}"\
-                                                   f"[regex: {regexp_txt}]")
-            res = regexp.match(line).groupdict()
-            items = list(res.items())
-            for k, v in items:
-                for l in ["symbol_", "raw_"]:
-                    if k.startswith(l):
-                        del res[k]
-                        if v is None:
-                            continue
-                        k = k[len(l):]
-                        res[k] = v
-            return res
-        return _parse
-
-    @staticmethod
-    def get_parser(pattern):
-        """Build parser for given AArch64 instruction pattern"""
-        if pattern in AArch64Instruction.PARSERS:
-            return AArch64Instruction.PARSERS[pattern]
-        parser = AArch64Instruction._build_parser(pattern)
-        AArch64Instruction.PARSERS[pattern] = parser
-        return parser
-
-    @cache
-    @staticmethod
-    def _infer_register_type(ptrn):
-        if ptrn[0].upper() in ["X","W"]:
-            return RegisterType.GPR
-        if ptrn[0].upper() in ["V","Q","D"]:
-            return RegisterType.Neon
-        if ptrn[0].upper() in ["T"]:
-            return RegisterType.Hint
-        raise FatalParsingException(f"Unknown pattern: {ptrn}")
-
-    def __init__(self, pattern, *, inputs=None, outputs=None, in_outs=None, modifiesFlags=False,
-                 dependsOnFlags=False):
-
-        self.mnemonic = pattern.split(" ")[0]
-
-        if inputs is None:
-            inputs = []
-        if outputs is None:
-            outputs = []
-        if in_outs is None:
-            in_outs = []
-        arg_types_in     = [AArch64Instruction._infer_register_type(r) for r in inputs]
-        arg_types_out    = [AArch64Instruction._infer_register_type(r) for r in outputs]
-        arg_types_in_out = [AArch64Instruction._infer_register_type(r) for r in in_outs]
-
-        if modifiesFlags:
-            arg_types_out += [RegisterType.Flags]
-            outputs       += ["flags"]
-
-        if dependsOnFlags:
-            arg_types_in += [RegisterType.Flags]
-            inputs += ["flags"]
-
-        super().__init__(mnemonic=pattern,
-                     arg_types_in=arg_types_in,
-                     arg_types_out=arg_types_out,
-                     arg_types_in_out=arg_types_in_out)
-
-        self.inputs = inputs
-        self.outputs = outputs
-        self.in_outs = in_outs
-
-
-        self.pattern = pattern
-        self.pattern_inputs = list(zip(inputs, arg_types_in, strict=True))
-        self.pattern_outputs = list(zip(outputs, arg_types_out, strict=True))
-        self.pattern_in_outs = list(zip(in_outs, arg_types_in_out, strict=True))
-
-    @staticmethod
-    def _to_reg(ty, s):
-        if ty == RegisterType.GPR:
-            c = "x"
-        elif ty == RegisterType.Neon:
-            c = "v"
-        elif ty == RegisterType.Hint:
-            c = "t"
-        else:
-            assert False
-        if s.replace('_','').isdigit():
-            return f"{c}{s}"
-        else:
-            return s
-
-    @staticmethod
-    def _build_pattern_replacement(s, ty, arg):
-        if ty == RegisterType.GPR:
-            if arg[0] != "x":
-                return f"{s[0].upper()}<{arg}>"
-            return s[0].lower() + arg[1:]
-        if ty == RegisterType.Neon:
-            if arg[0] != "v":
-                return f"{s[0].upper()}<{arg}>"
-            return s[0].lower() + arg[1:]
-        if ty == RegisterType.Hint:
-            if arg[0] != "t":
-                return f"{s[0].upper()}<{arg}>"
-            return s[0].lower() + arg[1:]
-        raise FatalParsingException(f"Unknown register type ({s}, {ty}, {arg})")
-
-    @staticmethod
-    def _instantiate_pattern(s, ty, arg, out):
-        if ty == RegisterType.Flags:
-            return out
-        rep = AArch64Instruction._build_pattern_replacement(s, ty, arg)
-        res = out.replace(f"<{s}>", rep)
-        if res == out:
-            raise FatalParsingException(f"Failed to replace <{s}> by {rep} in {out}!")
-        return res
-
-    @staticmethod
-    def build(c, src, pattern, **kwargs):
-        if src.split(' ')[0] != pattern.split(' ')[0]:
-            raise Instruction.ParsingException("Mnemonic does not match")
-
-        res = AArch64Instruction.get_parser(pattern)(src)
-        obj = c(pattern, **kwargs)
-
-        obj.args_in = []
-        obj.args_in_out = []
-        obj.args_out = []
-
-        def group_to_attribute(group_name, attr_name, f=None):
-            def f_default(x):
-                return x
-            def group_name_i(i):
-                return f"{group_name}{i}"
-            if f is None:
-                f = f_default
-            if group_name in res.keys():
-                setattr(obj, attr_name, f(res[group_name]))
-            else:
-                idxs = [ i for i in range(4) if group_name_i(i) in res.keys() ]
-                if len(idxs) == 0:
-                    return
-                assert idxs == list(range(len(idxs)))
-                setattr(obj, attr_name,
-                        list(map(lambda i: f(res[group_name_i(i)]), idxs)))
-
-        group_to_attribute('datatype', 'datatype', lambda x: x.lower())
-        group_to_attribute('imm', 'immediate', lambda x:x[1:]) # Strip '#'
-        group_to_attribute('index', 'index', int)
-        group_to_attribute('flag', 'flag')
-
-        for s, ty in obj.pattern_inputs:
-            if ty == RegisterType.Flags:
-                obj.args_in.append("flags")
-            else:
-                obj.args_in.append(AArch64Instruction._to_reg(ty, res[s]))
-        for s, ty in obj.pattern_outputs:
-            if ty == RegisterType.Flags:
-                obj.args_out.append("flags")
-            else:
-                obj.args_out.append(AArch64Instruction._to_reg(ty, res[s]))
-
-        for s, ty in obj.pattern_in_outs:
-            obj.args_in_out.append(AArch64Instruction._to_reg(ty, res[s]))
-
-        return obj
-
-    def write(self):
-        out = self.pattern
-        l = list(zip(self.args_in, self.pattern_inputs))     + \
-            list(zip(self.args_out, self.pattern_outputs))   + \
-            list(zip(self.args_in_out, self.pattern_in_outs))
-        for arg, (s, ty) in l:
-            out = AArch64Instruction._instantiate_pattern(s, ty, arg, out)
-
-        def replace_pattern(txt, attr_name, mnemonic_key, t=None):
-            def t_default(x):
-                return x
-            if t is None:
-                t = t_default
-            if not hasattr(self, attr_name):
-                return txt
-            a = getattr(self, attr_name)
-            if not isinstance(a, list):
-                txt = txt.replace(f"<{mnemonic_key}>", t(a))
-                return txt
-            for i, v in enumerate(a):
-                txt = txt.replace(f"<{mnemonic_key}{i}>", t(v))
-            return txt
-
-        out = replace_pattern(out, "immediate", "imm", lambda x: f"#{x}")
-        out = replace_pattern(out, "datatype", "dt", lambda x: x.upper())
-        out = replace_pattern(out, "flag", "flag")
-        out = replace_pattern(out, "index", "index", str)
-
-        out = out.replace("\\[", "[")
-        out = out.replace("\\]", "]")
-        return out
-
-####################################################################################
-#                                                                                  #
-# Virtual instruction to model pushing to stack locations without modelling memory #
-#                                                                                  #
-####################################################################################
-
-class qsave(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="qsave",
-                               arg_types_in=[RegisterType.Neon],
-                               arg_types_out=[RegisterType.StackNeon])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class qrestore(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="qrestore",
-                               arg_types_in=[RegisterType.StackNeon],
-                               arg_types_out=[RegisterType.Neon])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class save(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="save",
-                               arg_types_in=[RegisterType.GPR],
-                               arg_types_out=[RegisterType.StackGPR])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class restore(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="restore",
-                               arg_types_in=[RegisterType.StackGPR],
-                               arg_types_out=[RegisterType.GPR])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-# TODO: Need to unify these
-class stack_vstp_dform(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_vstp_dform",
-                               arg_types_in=[RegisterType.Neon, RegisterType.Neon],
-                               arg_types_out=[RegisterType.StackAny, RegisterType.StackAny])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_vstr_dform(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_vstr_dform",
-                               arg_types_in=[RegisterType.Neon],
-                               arg_types_out=[RegisterType.StackAny])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_stp(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_stp",
-                               arg_types_in=[RegisterType.GPR, RegisterType.GPR],
-                               arg_types_out=[RegisterType.StackAny, RegisterType.StackAny])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_stp_wform(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_stp_wform",
-                               arg_types_in=[RegisterType.GPR, RegisterType.GPR],
-                               arg_types_out=[RegisterType.StackAny, RegisterType.StackAny])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_str(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_str",
-                               arg_types_in=[RegisterType.GPR],
-                               arg_types_out=[RegisterType.StackAny])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_ldr(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_ldr",
-                               arg_types_in=[RegisterType.StackAny],
-                               arg_types_out=[RegisterType.GPR])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_vld1r(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_vld1r",
-                               arg_types_in=[RegisterType.StackAny],
-                               arg_types_out=[RegisterType.Neon])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_vldr_bform(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_vldr_bform",
-                               arg_types_in=[RegisterType.StackAny],
-                               arg_types_out=[RegisterType.Neon])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_vldr_dform(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_vldr_dform",
-                               arg_types_in=[RegisterType.StackAny],
-                               arg_types_out=[RegisterType.Neon])
-        obj.addr = "sp"
-        obj.increment = None
-        return obj
-
-class stack_vld2_lane(Instruction):
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.detected_stack_vld2_lane_pair = None
-        self.lane = None
-
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="stack_vld2_lane",
-                               arg_types_in=[RegisterType.StackAny],
-                               arg_types_in_out=[RegisterType.Neon, RegisterType.Neon, RegisterType.GPR])
-
-        regexp_txt = r"stack_vld2_lane\s+(?P<dst1>\w+)\s*,\s*(?P<dst2>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*"\
-            r"(?P<src2>\w+)\s*,\s*(?P<lane>.*),\s*(?P<immediate>.*)"
-        regexp_txt = Instruction.unfold_abbrevs(regexp_txt)
-        regexp = re.compile(regexp_txt)
-        p = regexp.match(src)
-        if p is None:
-            raise Instruction.ParsingException("Does not match pattern")
-        obj.args_in     = [p.group("src2")]
-        obj.args_in_out = [p.group("dst1"), p.group("dst2"), p.group("src1")]
-        obj.args_out = []
-
-        obj.lane = p.group("lane")
-        obj.immediate = p.group("immediate")
-
-        obj.args_in_out_combinations = [
-                ( [0,1], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,31) ] )
-            ]
-
-        obj.addr = p.group("src1")
-        obj.increment = obj.immediate
-        obj.detected_stack_vld2_lane_pair = False
-        return obj
-
-    def write(self):
-        if not self.detected_stack_vld2_lane_pair:
-            return f"stack_vld2_lane {self.args_in_out[0]}, {self.args_in_out[1]}, {self.args_in_out[2]}, {self.args_in[0]}, {self.lane}, {self.immediate}"
-        else:
-            return f"stack_vld2_lane {self.args_out[0]}, {self.args_out[1]}, {self.args_in_out[0]}, {self.args_in[0]}, {self.lane}, {self.immediate}"
-
-class nop(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "nop")
-
-class vadd(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>",
-                                       inputs=["Vb", "Vc"],
-                                       outputs=["Va"])
-
-class vsub(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sub <Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>",
-                                       inputs=["Vb", "Vc"],
-                                       outputs=["Va"])
-
-############################
-#                          #
-# Some LSU instructions    #
-#                          #
-############################
-
-class Ldr_Q(AArch64Instruction):
-    pass
-
-class d_ldp_sp_imm(Ldr_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Da>, <Db>, [sp, <imm>]",
-                                      # TODO: model stack dependency
-                                      outputs=["Da", "Db"])
-
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-class q_ldr(Ldr_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Qa>, [<Xc>]",
-                                      inputs=["Xc"],
-                                      outputs=["Qa"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class q_ldr_with_inc_hint(Ldr_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldrh <Qa>, <Xc>, <imm>, <Th>",
-                                      inputs=["Xc", "Th"],
-                                      outputs=["Qa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class q_ldr_with_inc(Ldr_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Qa>, [<Xc>, <imm>]",
-                                      inputs=["Xc"],
-                                      outputs=["Qa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class q_ldr_with_inc_writeback(Ldr_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Qa>, [<Xc>, <imm>]!",
-                                      inputs=["Xc"],
-                                      outputs=["Qa"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class q_ldr_with_postinc(Ldr_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Qa>, [<Xc>], <imm>",
-                                      inputs=["Xc"],
-                                      outputs=["Qa"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class Str_Q(AArch64Instruction):
-    pass
-
-class d_stp_sp_imm(Str_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stp <Da>, <Db>, [sp, <imm>]",
-                                      # TODO: model stack dependency
-                                      inputs=["Da", "Db"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-class q_str(Str_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Qa>, [<Xc>]",
-                                      inputs=["Qa", "Xc"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = obj.args_in[1]
-        return obj
-
-class q_str_with_inc_hint(Str_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "strh <Qa>, <Xc>, <imm>, <Th>",
-                                      inputs=["Qa", "Xc"],
-                                      outputs=["Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[1]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class q_str_with_inc(Str_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Qa>, [<Xc>, <imm>]",
-                                      inputs=["Qa", "Xc"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[1]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class q_str_with_inc_writeback(Str_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Qa>, [<Xc>, <imm>]!",
-                                      inputs=["Qa", "Xc"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[1]
-        return obj
-
-class q_str_with_postinc(Str_Q):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Qa>, [<Xc>], <imm>",
-                                      inputs=["Qa", "Xc"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[1]
-        return obj
-
-class Ldr_X(AArch64Instruction):
-    pass
-
-class x_ldr(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Xa>, [<Xc>]",
-                                      inputs=["Xc"],
-                                      outputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        # For now, assert that no fixup has happened
-        # Eventually, this instruction should be merged
-        # into the LDP with increment.
-        assert self.pre_index == None
-        return super().write()
-
-class x_ldr_with_imm(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Xa>, [<Xc>, <imm>]",
-                                      inputs=["Xc"],
-                                      outputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldr_with_postinc(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Xa>, [<Xc>], <imm>",
-                                      inputs=["Xc"],
-                                      outputs=["Xa"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class x_ldr_stack(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Xa>, [sp]", # TODO: Model sp dependency
-                                      outputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        # For now, assert that no fixup has happened
-        # Eventually, this instruction should be merged
-        # into the LDP with increment.
-        assert self.pre_index == None
-        return super().write()
-
-class x_ldr_stack_imm(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldr <Xa>, [sp, <imm>]",
-                                      outputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldr_stack_imm_with_hint(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldrh <Xa>, sp, <imm>, <Th>", # TODO: Model sp dependency
-                                      outputs=["Xa"], inputs=["Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldr_imm_with_hint(Ldr_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldrh <Xa>, <Xb>, <imm>, <Th>",
-                                      outputs=["Xa"], inputs=["Xb","Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class Ldp_X(AArch64Instruction):
-    pass
-
-class x_ldp(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Xa>, <Xb>, [<Xc>]",
-                                      inputs=["Xc"],
-                                      outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        # For now, assert that no fixup has happened
-        # Eventually, this instruction should be merged
-        # into the LDP with increment.
-        assert self.pre_index == None
-        return super().write()
-
-class x_ldp_with_imm_sp_xzr(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Xa>, xzr, [sp, <imm>]",
-                                      outputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldp_with_imm_sp(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Xa>, <Xb>, [sp, <imm>]",
-                                      outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldp_with_inc(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Xa>, <Xb>, [<Xc>, <imm>]",
-                                      inputs=["Xc"],
-                                      outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldp_with_inc_writeback(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Xa>, <Xb>, [<Xc>, <imm>]!",
-                                      inputs=["Xc"],
-                                      outputs=["Xa", "Xb"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class x_ldp_with_postinc_writeback(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldp <Xa>, <Xb>, [<Xc>], <imm>",
-                                      inputs=["Xc"],
-                                      outputs=["Xa", "Xb"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class x_ldp_with_inc_hint(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldph <Xa>, <Xb>, <Xc>, <imm>, <Th>",
-                         inputs=["Xc", "Th"],
-                         outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldp_sp_with_inc_hint(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldph <Xa>, <Xb>, sp, <imm>, <Th>",
-                         inputs=["Th"],
-                         outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldp_sp_with_inc_hint2(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldphp <Xa>, <Xb>, sp, <imm>, <Th0>, <Th1>",
-                         inputs=["Th0", "Th1"],
-                         outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_ldp_with_inc_hint2(Ldp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ldphp <Xa>, <Xb>, <Xc>, <imm>, <Th0>, <Th1>",
-                         inputs=["Xc", "Th0", "Th1"],
-                         outputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class ldr_sxtw_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ldr <Wd>, [<Xa>, <Wb>, SXTW <imm>]",
-                            inputs=["Xa", "Wb"],
-                            outputs=["Wd"])
-
-############################
-#                          #
-# Some scalar instructions #
-#                          #
-############################
-
-class lsr_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "lsr <Wd>, <Wa>, <Wb>",
-                           inputs=["Wa", "Wb"],
-                           outputs=["Wd"])
-
-class asr_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "asr <Wd>, <Wa>, <imm>",
-                         inputs=["Wa"],
-                         outputs=["Wd"])
-
-class eor_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "eor <Wd>, <Wa>, <Wb>",
-                         inputs=["Wa", "Wb"],
-                         outputs=["Wd"])
-
-class AArch64BasicArithmetic(AArch64Instruction):
-    pass
-
-class subs_wform(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "subs <Wd>, <Wa>, <imm>",
-                         inputs=["Wa"],
-                         outputs=["Wd"],
-                         modifiesFlags=True)
-
-class subs_imm(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "subs <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class sub_imm(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sub <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class add_imm(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class add_sp_imm(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, sp, <imm>", # TODO Model dependency on sp
-                         outputs=["Xd"])
-
-class neg(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "neg <Xd>, <Xa>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class adds(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adds <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class adds_to_zero(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adds xzr, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         modifiesFlags=True)
-
-class adds_imm_to_zero(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adds xzr, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         modifiesFlags=True)
-
-class subs_twoarg(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "subs <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class adds_twoarg(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adds <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class adcs(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adcs <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class sbcs(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sbcs <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class sbcs_zero(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sbcs <Xd>, <Xa>, xzr",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class sbc(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sbc <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class sbc_zero_r(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sbc <Xd>, <Xa>, xzr",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class adcs_zero_r(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adcs <Xd>, <Xa>, xzr",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class adcs_zero_l(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adcs <Xd>, xzr, <Xb>",
-                         inputs=["Xb"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class adc(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adc <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class adc_zero2(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adc <Xd>, xzr, xzr",
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class adc_zero_r(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adc <Xd>, <Xa>, xzr",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class adc_zero_l(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adc <Xd>, xzr, <Xa>",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class add(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class add2(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <Xb>, <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class add_w_imm(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Wd>, <Wa>, <imm>",
-                         inputs=["Wa"],
-                         outputs=["Wd"])
-
-class sub(AArch64BasicArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sub <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class AArch64ShiftedArithmetic(AArch64Instruction):
-    pass
-
-class add_lsl(AArch64ShiftedArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <Xb>, lsl <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class add_lsr(AArch64ShiftedArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <Xb>, lsr <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class adds_lsl(AArch64ShiftedArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adds <Xd>, <Xa>, <Xb>, lsl <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class adds_lsr(AArch64ShiftedArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "adds <Xd>, <Xa>, <Xb>, lsr <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class add_asr(AArch64ShiftedArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <Xb>, asr <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class add_imm_lsl(AArch64ShiftedArithmetic):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "add <Xd>, <Xa>, <imm0>, lsl <imm1>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class AArch64Shift(AArch64Instruction):
-    pass
-
-class lsr(AArch64Shift):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "lsr <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-# TODO: This likely has different perf characteristics!
-class lsr_variable(AArch64Shift):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "lsr <Xd>, <Xa>, <Xc>",
-                         inputs=["Xa", "Xc"],
-                         outputs=["Xd"])
-
-class lsl(AArch64Shift):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "lsl <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class asr(AArch64Shift):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "asr <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class AArch64Logical(AArch64Instruction):
-    pass
-
-class rev_w(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "rev <Wd>, <Wa>",
-                         inputs=["Wa"],
-                         outputs=["Wd"])
-
-class eor(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "eor <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class orr(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "orr <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class orr_w(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "orr <Wd>, <Wa>, <Wb>",
-                         inputs=["Wa","Wb"],
-                         outputs=["Wd"])
-
-class bfi(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "bfi <Xd>, <Xa>, <imm0>, <imm1>",
-                         inputs=["Xa"],
-                         in_outs=["Xd"])
-
-class and_imm(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "and <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class ands_imm(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ands <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"],
-                         modifiesFlags=True)
-
-class ands_xzr_imm(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ands xzr, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         modifiesFlags=True)
-
-class and_twoarg(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "and <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"])
-
-class bic(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "bic <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class orr_imm(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "orr <Xd>, <Xa>, <imm>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class sbfx(AArch64Logical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sbfx <Xd>, <Xa>, <imm0>, <imm1>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class extr(AArch64Logical): ### TODO! Review this...
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "extr <Xd>, <Xa>, <Xb>, <imm>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Xd"])
-
-class AArch64LogicalShifted(AArch64Instruction):
-    pass
-
-class orr_shifted(AArch64LogicalShifted):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "orr <Xd>, <Xa>, <Xb>, lsl <imm>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class AArch64ConditionalCompare(AArch64Instruction):
-    pass
-
-class ccmp_xzr(AArch64ConditionalCompare):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ccmp <Xa>, xzr, <imm>, <flag>",
-                         inputs=["Xa"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class ccmp(AArch64ConditionalCompare):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ccmp <Xa>, <Xb>, <imm>, <flag>",
-                         inputs=["Xa", "Xb"],
-                         dependsOnFlags=True,
-                         modifiesFlags=True)
-
-class AArch64ConditionalSelect(AArch64Instruction):
-    pass
-
-class cneg(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cneg <Xd>, <Xe>, <flag>",
-                         outputs=["Xd"],
-                         inputs=["Xe"],
-                         dependsOnFlags=True)
-
-class csel_xzr_ne(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "csel <Xd>, <Xe>, xzr, <flag>",
-                         outputs=["Xd"],
-                         inputs=["Xe"],
-                         dependsOnFlags=True)
-
-class csel_ne(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "csel <Xd>, <Xe>, <Xf>, <flag>",
-                         outputs=["Xd"],
-                         inputs=["Xe", "Xf"],
-                         dependsOnFlags=True)
-
-class cinv(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cinv <Xd>, <Xe>, <flag>",
-                         outputs=["Xd"],
-                         inputs=["Xe"],
-                         dependsOnFlags=True)
-
-class cinc(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cinc <Xd>, <Xe>, <flag>",
-                         outputs=["Xd"],
-                         inputs=["Xe"],
-                         dependsOnFlags=True)
-
-class csetm(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "csetm <Xd>, <flag>",
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class cset(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cset <Xd>, <flag>",
-                         outputs=["Xd"],
-                         dependsOnFlags=True)
-
-class cmn_imm(AArch64ConditionalSelect):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cmn <Xd>, <imm>",
-                         inputs=["Xd"],
-                         modifiesFlags=True)
-
-class ldr_const(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ldr <Xd>, <imm>",
-                         inputs=[],
-                         outputs=["Xd"])
-
-class movk_imm(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "movk <Xd>, <imm>",
-                         inputs=[],
-                         in_outs=["Xd"])
-
-class mov(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Wd>, <Wa>",
-                         inputs=["Wa"],
-                         outputs=["Wd"])
-
-class AArch64Move(AArch64Instruction):
-    pass
-
-class mov_imm(AArch64Move):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Xd>, <imm>",
-                         inputs=[],
-                         outputs=["Xd"])
-
-class mvn_xzr(AArch64Move):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mvn <Xd>, xzr",
-                         inputs=[],
-                         outputs=["Xd"])
-
-class mov_xform(AArch64Move):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Xd>, <Xa>",
-                         inputs=["Xa"],
-                         outputs=["Xd"])
-
-class umull_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "umull <Xd>, <Wa>, <Wb>",
-                         inputs=["Wa","Wb"],
-                         outputs=["Xd"])
-
-class umaddl_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "umaddl <Xn>, <Wa>, <Wb>, <Xacc>",
-                         inputs=["Wa","Wb","Xacc"],
-                         outputs=["Xn"])
-
-class mul_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mul <Wd>, <Wa>, <Wb>",
-                         inputs=["Wa","Wb"],
-                         outputs=["Wd"])
-
-class AArch64HighMultiply(AArch64Instruction):
-    pass
-
-class umulh_xform(AArch64HighMultiply):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "umulh <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class smulh_xform(AArch64HighMultiply):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "smulh <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class AArch64Multiply(AArch64Instruction):
-    pass
-
-class mul_xform(AArch64Multiply):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mul <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class madd_xform(AArch64Multiply):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "madd <Xd>, <Xacc>, <Xa>, <Xb>",
-                         inputs=["Xacc", "Xa","Xb"],
-                         outputs=["Xd"])
-
-class mneg_xform(AArch64Multiply):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mneg <Xd>, <Xa>, <Xb>",
-                         inputs=["Xa","Xb"],
-                         outputs=["Xd"])
-
-class msub_xform(AArch64Multiply):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "msub <Xd>, <Xacc>, <Xa>, <Xb>",
-                         inputs=["Xacc", "Xa","Xb"],
-                         outputs=["Xd"])
-
-class and_imm_wform(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "and <Wd>, <Wa>, <imm>",
-                         inputs=["Wa"],
-                         outputs=["Wd"])
-
-class Tst(AArch64Instruction):
-    pass
-
-class tst_wform(Tst):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "tst <Wa>, <imm>",
-                         inputs=["Wa"],
-                         modifiesFlags=True)
-
-class tst_imm_xform(Tst):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "tst <Xa>, <imm>",
-                         inputs=["Xa"],
-                         modifiesFlags=True)
-
-class tst_xform(Tst):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "tst <Xa>, <Xb>",
-                         inputs=["Xa", "Xb"],
-                         modifiesFlags=True)
-
-class cmp_xzr(Tst):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cmp <Xa>, xzr",
-                         inputs=["Xa"],
-                         modifiesFlags=True)
-
-class cmp_imm(Tst):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "cmp <Xa>, <imm>",
-                         inputs=["Xa"],
-                         modifiesFlags=True)
-
-######################################################
-#                                                    #
-# Some 'wrappers' around AArch64 Neon instructions   #
-#                                                    #
-######################################################
-
-# We don't model the sometimes complex syntax of AArch64 Neon instructions here,
-# but instead use simpler syntax forms which are translated into the actual AArch64
-# instructions through assembly `.macro`s.
-#
-# We use the Helium/AArch32 Neon naming for those wrappers.
-
-class vmov(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Vd>.<dt0>, <Va>.<dt1>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class vmovi(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "movi <Vd>.<dt>, <imm>",
-                         outputs=["Vd"])
-
-class vxtn(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "xtn <Vd>.<dt0>, <Va>.<dt1>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class Vrev(AArch64Instruction):
-    pass
-
-class rev64(Vrev):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "rev64 <Vd>.<dt0>, <Va>.<dt1>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class rev32(Vrev):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "rev32 <Vd>.<dt0>, <Va>.<dt1>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class uaddlp(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "uaddlp <Vd>.<dt0>, <Va>.<dt1>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class vand(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "and <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vbic(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "bic <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vzip1(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "zip1 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vzip2(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "zip2 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vuzp1(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "uzp1 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vuzp2(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "uzp2 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vqrdmulh(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sqrdmulh <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vqrdmulh_lane(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "sqrdmulh <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-        if obj.datatype[0] == "8h":
-            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
-                                          [ f"v{i}" for i in range(0,16) ]]
-        return obj
-
-class vqdmulh_lane(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "sqdmulh <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-        if obj.datatype[0] == "8h":
-            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
-                                          [ f"v{i}" for i in range(0,16) ]]
-
-        return obj
-
-class vmul_lane(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "mul <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-        if obj.datatype[0] == "8h":
-            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
-                                          [ f"v{i}" for i in range(0,16) ]]
-
-        return obj
-
-class fcsel_dform(Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = Instruction.build(cls, src, mnemonic="fcsel_dform",
-                         arg_types_in=[RegisterType.Neon, RegisterType.Neon, RegisterType.Flags],
-                         arg_types_out=[RegisterType.Neon])
-
-        regexp_txt = r"fcsel_dform\s+(?P<dst>\w+)\s*,\s*(?P<src1>\w+)\s*,\s*(?P<src2>\w+)\s*,\s*eq"
-        regexp_txt = Instruction.unfold_abbrevs(regexp_txt)
-        regexp = re.compile(regexp_txt)
-        p = regexp.match(src)
-        if p is None:
-            raise Instruction.ParsingException("Does not match pattern")
-        obj.args_in     = [ p.group("src1"), p.group("src2"), "flags" ]
-        obj.args_out    = [ p.group("dst") ]
-        obj.args_in_out = []
-
-        return obj
-
-    def write(self):
-        return f"fcsel_dform {self.args_out[0]}, {self.args_in[0]}, {self.args_in[1]}, eq"
-
-class Vins(AArch64Instruction):
-    pass
-
-class vins_d(Vins):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ins <Vd>.d[<index>], <Xa>",
-                         inputs=["Xa"],
-                         in_outs=["Vd"])
-
-class vins_d_force_output(Vins):
-    @classmethod
-    def make(cls, src, force=False):
-        if force == False:
-            raise Instruction.ParsingException("Instruction ignored")
-        obj = AArch64Instruction.build(cls, src, "ins <Vd>.d[<index>], <Xa>",
-                           inputs=["Xa"],
-                           outputs=["Vd"])
-        return obj
-
-class Mov_xtov_d(AArch64Instruction):
-    pass
-
-class mov_xtov_d(Mov_xtov_d):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Vd>.d[<index>], <Xa>",
-                           inputs=["Xa"],
-                           in_outs=["Vd"])
-
-class mov_xtov_d_xzr(Mov_xtov_d):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Vd>.d[<index>], xzr",
-                         in_outs=["Vd"])
-
-class mov_b00(AArch64Instruction): # TODO: Generalize
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Vd>.b[0], <Va>.b[0]",
-                         inputs=["Va"],
-                         in_outs=["Vd"])
-
-class mov_d01(AArch64Instruction): # TODO: Generalize
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Vd>.d[0], <Va>.d[1]",
-                         inputs=["Va"],
-                         in_outs=["Vd"])
-
-class AArch64NeonLogical(AArch64Instruction):
-    pass
-
-class veor(AArch64NeonLogical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "eor <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vbif(AArch64NeonLogical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "bif <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         in_outs=["Vd"])
-
-# Not sure about the classification as logical... couldn't find it in SWOG
-class vmov_d(AArch64NeonLogical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Dd>, <Va>.d[1]",
-                         inputs=["Va"],
-                         outputs=["Dd"])
-
-class vext(AArch64NeonLogical):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ext <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>, <imm>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vmul(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mul <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vmla(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mla <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         in_outs=["Vd"])
-
-class vmla_lane(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "mla <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]",
-                         inputs=["Va", "Vb"],
-                         in_outs=["Vd"])
-        if obj.datatype[0] == "8h":
-            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
-                                          [ f"v{i}" for i in range(0,16) ]]
-        return obj
-
-class vmls(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mls <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         in_outs=["Vd"])
-
-class vmls_lane(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "mls <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>[<index>]",
-                         inputs=["Va", "Vb"],
-                         in_outs=["Vd"])
-        if obj.datatype[0] == "8h":
-            obj.args_in_restrictions = [ [ f"v{i}" for i in range(0,32) ],
-                                          [ f"v{i}" for i in range(0,16) ]]
-        return obj
-
-class vdup(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "dup <Vd>.<dt>, <Xa>",
-                         inputs=["Xa"],
-                         outputs=["Vd"])
-
-class vmull(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "umull <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class vmlal(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "umlal <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         in_outs=["Vd"])
-
-class vsrshr(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "srshr <Vd>.<dt0>, <Va>.<dt1>, <imm>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class vshl(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "shl <Vd>.<dt0>, <Va>.<dt1>, <imm>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class vshl_d(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "shl <Dd>, <Da>, <imm>",
-                         inputs=["Da"],
-                         outputs=["Dd"])
-
-class vshli(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "sli <Vd>.<dt0>, <Va>.<dt1>, <imm>",
-                         inputs=["Va"],
-                         in_outs=["Vd"])
-
-class vusra(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "usra <Vd>.<dt0>, <Va>.<dt1>, <imm>",
-                         inputs=["Va"],
-                         in_outs=["Vd"])
-
-class vshrn(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "shrn <Vd>.<dt0>, <Va>.<dt1>, <imm>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class VecToGprMov(AArch64Instruction):
-    pass
-
-class umov_d(VecToGprMov):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "umov <Xd>, <Va>.d[<index>]",
-                         inputs=["Va"],
-                         outputs=["Xd"])
-
-class mov_d(VecToGprMov):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "mov <Xd>, <Va>.d[<index>]",
-                         inputs=["Va"],
-                         outputs=["Xd"])
-
-class Fmov(AArch64Instruction):
-    pass
-
-class fmov_0(Fmov):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "fmov <Dd>, <Xa>",
-                         inputs=["Xa"],
-                         in_outs=["Dd"])
-
-class fmov_0_force_output(Fmov):
-    @classmethod
-    def make(cls, src, force=False):
-        if force is False:
-            raise Instruction.ParsingException("Instruction ignored")
-        return AArch64Instruction.build(cls, src, "fmov <Dd>, <Xa>",
-                         inputs=["Xa"],
-                         outputs=["Dd"])
-
-class fmov_1(Fmov):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "fmov <Vd>.d[1], <Xa>",
-                         inputs=["Xa"],
-                         in_outs=["Vd"])
-
-class fmov_1_force_output(Fmov):
-    @classmethod
-    def make(cls, src, force=False):
-        if force is False:
-            raise Instruction.ParsingException("Instruction ignored")
-        return AArch64Instruction.build(cls, src, "fmov <Vd>.d[1], <Xa>",
-                         inputs=["Xa"],
-                         outputs=["Vd"])
-
-class vushr(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "ushr <Vd>.<dt0>, <Va>.<dt1>, <imm>",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class trn1(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "trn1 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class trn2(AArch64Instruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "trn2 <Vd>.<dt0>, <Va>.<dt1>, <Vb>.<dt2>",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-# Wrapper around AESE+AESMC, treated as one instructions in SLOTHY
-# so as to prevent pulling them apart and hindering instruction fusion.
-
-class AESInstruction(AArch64Instruction):
-    pass
-
-class aesr(AESInstruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "aesr <Vd>.16b, <Va>.16b",
-                         inputs=["Va"],
-                         in_outs=["Vd"])
-
-class aese(AESInstruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "aese <Vd>.16b, <Va>.16b",
-                         inputs=["Va"],
-                         in_outs=["Vd"])
-
-class aesmc(AESInstruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "aesmc <Vd>.16b, <Va>.16b",
-                         inputs=["Va"],
-                         outputs=["Vd"])
-
-class pmull1_q(AESInstruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "pmull <Vd>.1q, <Va>.1d, <Vb>.1d",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class pmull2_q(AESInstruction):
-    @classmethod
-    def make(cls, src):
-        return AArch64Instruction.build(cls, src, "pmull2 <Vd>.1q, <Va>.2d, <Vb>.2d",
-                         inputs=["Va", "Vb"],
-                         outputs=["Vd"])
-
-class Str_X(AArch64Instruction):
-    pass
-
-class x_str(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Xa>, [<Xc>]",
-                         inputs=["Xa", "Xc"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = obj.args_in[1]
-        return obj
-
-    def write(self):
-        # For now, assert that no fixup has happened
-        # Eventually, this instruction should be merged
-        # into the LDP with increment.
-        assert self.pre_index == None
-        return super().write()
-
-class x_str_imm(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Xa>, [<Xc>, <imm>]",
-                         inputs=["Xa", "Xc"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[1]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class w_str_imm(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Wa>, [<Xc>, <imm>]",
-                         inputs=["Wa", "Xc"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[1]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_str_postinc(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Xa>, [<Xc>], <imm>",
-                         inputs=["Xa", "Xc"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[1]
-        return obj
-
-class x_str_sp_imm(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "str <Xa>, [sp, <imm>]",
-                         inputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_str_sp_imm_hint(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "strh <Xa>, sp, <imm>, <Th>",
-                         inputs=["Xa"], outputs=["Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_str_imm_hint(Str_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "strh <Xa>, <Xb>, <imm>, <Th>",
-                         inputs=["Xa", "Xb"], outputs=["Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[1]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class Stp_X(AArch64Instruction):
-    pass
-
-class x_stp(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stp <Xa>, <Xb>, [<Xc>]",
-                         inputs=["Xc", "Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return
-
-    def write(self):
-        # For now, assert that no fixup has happened
-        # Eventually, this instruction should be merged
-        # into the STP with increment.
-        assert self.pre_index == None
-        return super().write()
-
-class x_stp_with_imm_xzr_sp(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stp <Xa>, xzr, [sp, <imm>]",
-                         inputs=["Xa"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_stp_with_imm_sp(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stp <Xa>, <Xb>, [sp, <imm>]",
-                         inputs=["Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_stp_with_inc(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stp <Xa>, <Xb>, [<Xc>, <imm>]",
-                         inputs=["Xc", "Xa", "Xb"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_stp_with_inc_writeback(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stp <Xa>, <Xb>, [<Xc>, <imm>]!",
-                         inputs=["Xc", "Xa", "Xb"])
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.addr = obj.args_in[0]
-        return obj
-
-class x_stp_with_inc_hint(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stph <Xa>, <Xb>, <Xc>, <imm>, <Th>",
-                         inputs=["Xc", "Xa", "Xb"],
-                         outputs=["Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[0]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_stp_sp_with_inc_hint(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stph <Xa>, <Xb>, sp, <imm>, <Th>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Th"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_stp_sp_with_inc_hint2(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stphp <Xa>, <Xb>, sp, <imm>, <Th0>, <Th1>",
-                         inputs=["Xa", "Xb"],
-                         outputs=["Th0", "Th1"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = "sp"
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class x_stp_with_inc_hint2(Stp_X):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "stphp <Xa>, <Xb>, <Xc>, <imm>, <Th0>, <Th1>",
-                         inputs=["Xa", "Xb", "Xc"],
-                         outputs=["Th0", "Th1"])
-        obj.increment = None
-        obj.pre_index = obj.immediate
-        obj.addr = obj.args_in[2]
-        return obj
-
-    def write(self):
-        self.immediate = simplify(self.pre_index)
-        return super().write()
-
-class St4(AArch64Instruction):
-    pass
-
-class st4_base(St4):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, 
-            "st4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>]",
-            inputs=["Xc", "Va", "Vb", "Vc", "Vd"])
-
-        obj.addr = obj.args_in[0]
-        obj.args_in_combinations = [
-                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
-            ]
-        return obj
-
-class st4_with_inc(St4):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "st4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>], <imm>",
-                         inputs=["Xc", "Va", "Vb", "Vc", "Vd"])
-        obj.addr = obj.args_in[0]
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.args_in_combinations = [
-                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
-            ]
-        return obj
-
-class St2(AArch64Instruction):
-    pass
-
-class st2_base(St2):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "st2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>]",
-                         inputs=["Xc", "Va", "Vb"])
-        obj.addr = obj.args_in[0]
-        obj.args_in_combinations = [
-                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
-            ]
-        return obj
-
-class st2_with_inc(St2):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "st2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>], <imm>",
-                         inputs=["Xc", "Va", "Vb"])
-        obj.addr = obj.args_in[0]
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.args_in_combinations = [
-                ( [1,2,3,4], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
-            ]
-        return obj
-
-class Ld4(AArch64Instruction):
-    pass
-
-class ld4_base(Ld4):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ld4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>]",
-                         inputs=["Xc"],
-                         outputs=["Va", "Vb", "Vc", "Vd"])
-        obj.addr = obj.args_in[0]
-        obj.args_out_combinations = [
-                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
-            ]
-        return obj
-
-class ld4_with_inc(Ld4):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ld4 {<Va>.<dt0>, <Vb>.<dt1>, <Vc>.<dt2>, <Vd>.<dt3>}, [<Xc>], <imm>",
-                         inputs=["Xc"],
-                         outputs=["Va", "Vb", "Vc", "Vd"])
-
-        obj.addr = obj.args_in[0]
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.args_out_combinations = [
-                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}", f"v{i+2}", f"v{i+3}" ] for i in range(0,28) ] )
-            ]
-        return obj
-
-class Ld2(AArch64Instruction):
-    pass
-
-class ld2_base(Ld2):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ld2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>]",
-                         inputs=["Xc"],
-                         outputs=["Va", "Vb"])
-
-        obj.addr = obj.args_in[0]
-        obj.args_out_combinations = [
-                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
-            ]
-        return obj
-
-class ld2_with_inc(Ld2):
-    @classmethod
-    def make(cls, src):
-        obj = AArch64Instruction.build(cls, src, "ld2 {<Va>.<dt0>, <Vb>.<dt1>}, [<Xc>], <imm>",
-                           inputs=["Xc"],
-                           outputs=["Va", "Vb"])
-
-        obj.addr = obj.args_in[0]
-        obj.increment = obj.immediate
-        obj.pre_index = None
-        obj.args_out_combinations = [
-                ( [0,1,2,3], [ [ f"v{i}", f"v{i+1}" ] for i in range(0,30) ] )
-            ]
-
-        return obj
-
-# In a pair of vins writing both 64-bit lanes of a vector, mark the
-# target vector as output rather than input/output. This enables further
-# renaming opportunities.
-def vins_d_parsing_cb():
-    def core(inst, t):
-        succ = None
-        # Check if this is the first in a pair of vins+vins
-        if len(t.dst_in_out[0]) == 1:
-            r = t.dst_in_out[0][0]
-            if isinstance(r.inst, vins_d):
-                if r.inst.args_in_out == inst.args_in_out and \
-                   {r.inst.index, inst.index} == {0,1}:
-                    succ = r
-        if succ is None:
-            return False
-        # Reparse as instruction-variant treating the input/output as an output
-        inst_txt = t.inst.write()
-        t.inst = vins_d_force_output.make(inst_txt, force=True)
-        return True
-    return core
-vins_d.global_parsing_cb = vins_d_parsing_cb()
-
-# In a pair of fmov writing both 64-bit lanes of a vector, mark the
-# target vector as output rather than input/output. This enables further
-# renaming opportunities.
-def fmov_0_parsing_cb():
-    def core(inst, t):
-        succ = None
-        r = None
-        # Check if this is the first in a pair of fmov's
-        if len(t.dst_in_out[0]) == 1:
-            r = t.dst_in_out[0][0]
-            if isinstance(r.inst, fmov_1):
-                if r.inst.args_in_out == inst.args_in_out:
-                    succ = r
-        if succ is None:
-            return False
-        # Reparse as instruction-variant treating the input/output as an output
-        inst_txt = t.inst.write()
-        t.inst = fmov_0_force_output.make(inst_txt, force=True)
-        return True
-    return core
-fmov_0.global_parsing_cb = fmov_0_parsing_cb()
-
-def fmov_1_parsing_cb():
-    def core(inst, t):
-        succ = None
-        r = None
-        # Check if this is the first in a pair of fmov's
-        if len(t.dst_in_out[0]) == 1:
-            r = t.dst_in_out[0][0]
-            if isinstance(r.inst, fmov_0):
-                if r.inst.args_in_out == inst.args_in_out:
-                    succ = r
-        if succ is None:
-            return False
-        # Reparse as instruction-variant treating the input/output as an output
-        inst_txt = t.inst.write()
-        t.inst = fmov_1_force_output.make(inst_txt, force=True)
-        return True
-    return core
-fmov_1.global_parsing_cb = fmov_1_parsing_cb()
-
-def stack_vld2_lane_parsing_cb():
-    def core(inst,t):
-        succ = None
-
-        if inst.detected_stack_vld2_lane_pair:
-            return False
-
-        # Check if this is the first in a pair of stack_vld2_lane+stack_vld2_lane
-        if len(t.dst_in_out[0]) == 1:
-            r = t.dst_in_out[0][0]
-            if isinstance(r.inst, stack_vld2_lane):
-                if r.inst.args_in_out[:2] == inst.args_in_out[:2] and \
-                   {r.inst.lane, inst.lane} == {'0','1'}:
-                    succ = r
-
-        if succ is None:
-            return False
-
-        # If so, mark in/out as output only, and signal the need for re-building
-        # the dataflow graph
-
-        inst.num_out = 2
-        inst.args_out = [ inst.args_in_out[0], inst.args_in_out[1] ]
-        inst.arg_types_out = [ RegisterType.Neon, RegisterType.Neon ]
-        inst.args_out_restrictions = inst.args_in_out_restrictions[:2]
-        inst.args_out_combinations = inst.args_in_out_combinations[:2]
-
-        inst.num_in_out = 1
-        inst.args_in_out = [ inst.args_in_out[2] ]
-        inst.arg_types_in_out = [ RegisterType.GPR ]
-        inst.args_in_out_restrictions = [None]
-        inst.args_in_out_combinations = None
-
-        inst.detected_stack_vld2_lane_pair = True
-        return True
-
-    return core
-
-stack_vld2_lane.global_parsing_cb  = stack_vld2_lane_parsing_cb()
-
-def iter_aarch64_instructions():
-    yield from all_subclass_leaves(Instruction)
-
-def find_class(src):
-    for inst_class in iter_aarch64_instructions():
-        if isinstance(src,inst_class):
-            return inst_class
-    raise UnknownInstruction(f"Couldn't find instruction class for {src} (type {type(src)})")
-
-def is_dt_form_of(instr_class, dts=None):
-    if not isinstance(instr_class, list):
-        instr_class = [instr_class]
-    def _intersects(lsA,lsB):
-        return len([a for a in lsA if a in lsB]) > 0
-    def _check_instr_dt(src):
-        if find_class(src) in instr_class:
-            if dts is None or _intersects(src.datatype, dts):
-                return True
-        return False
-    return _check_instr_dt
-
-def is_dform_form_of(instr_class):
-    return is_dt_form_of(instr_class, ["1d","2s","4h","8b"])
-def is_qform_form_of(instr_class):
-    return is_dt_form_of(instr_class, ["2d","4s","8h","16b"])
-
-def check_instr_dt(src, instr_classes, dt=None):
-    if not isinstance(instr_classes, list):
-        instr_classes = list(instr_classes)
-    for instr_class in instr_classes:
-        if find_class(src) == instr_class:
-            if dt is None or len(set(dt + src.datatype)) > 0:
-                return True
-    return False
-
-def is_neon_instruction(inst):
-    args = inst.arg_types_in + inst.arg_types_out + inst.arg_types_in_out
-    return RegisterType.Neon in args
-
-
-# Returns the list of all subclasses of a class which don't have
-# subclasses themselves
-def all_subclass_leaves(c):
-
-    def has_subclasses(cl):
-        return len(cl.__subclasses__()) > 0
-    def is_leaf(c):
-        return not has_subclasses(c)
-
-    def all_subclass_leaves_core(leaf_lst, todo_lst):
-        leaf_lst += filter(is_leaf, todo_lst)
-        todo_lst = [ csub
-                     for c in filter(has_subclasses, todo_lst)
-                     for csub in c.__subclasses__() ]
-        if len(todo_lst) == 0:
-            return leaf_lst
-        return all_subclass_leaves_core(leaf_lst, todo_lst)
-
-    return all_subclass_leaves_core([], [c])
-
-Instruction.all_subclass_leaves = all_subclass_leaves(Instruction)
-
-def lookup_multidict(d, inst, default=None):
-    instclass = find_class(inst)
-    for l,v in d.items():
-        # Multidict entries can be the following:
-        # - An instruction class. It matches any instruction of that class.
-        # - A callable. It matches any instruction returning `True` when passed
-        #   to the callable.
-        # - A tuple of instruction classes or callables. It matches any instruction
-        #   which matches at least one element in the tuple.
-        def match(x):
-            if inspect.isclass(x):
-                return isinstance(inst, x)
-            assert callable(x)
-            return x(inst)
-        if not isinstance(l, tuple):
-            l = [l]
-        for lp in l:
-            if match(lp):
-                return v
-    if default == None:
-        raise UnknownInstruction(f"Couldn't find {instclass} for {inst}")
-    return default