Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A large design challenge in FINN is deciding on the correct amount of parallelism to set for each individual node in the dataflow graph (folding) to maximize throughput while minimizing resource consumption.
The current solution to folding in FINN is a simple transformation (SetFolding()) which traverses the dataflow graph node-by-node and increases each parameter responsible for parallelism incrementally until the given throughput target is met.
However, this solution overlooks the many implications which folding has on resource consumption and how two configurations leading to a near-identical throughput can have vastly different resource consumption characteristics.
For example, nodes with two degrees of parallelism could be folded as either (X,Y) or (Y,X) where X and Y are two possible parameter choices and will have the same throughput for the node itself. However, the two variations lead to the input and output stream widths being different (input being equal to the datawidth times X or Y and likewise for the output). This difference in input and output stream widths, as well as transaction processing behavior will affect the Data Width Converters (DWCs) and buffers (FIFOs) being inserted between the nodes. An output stream of a node being folded to the same degree as the input stream of the subsequent node will result in no DWC insertion if the node data types are identical. Furthermore, between two inverse choices for folding a dual-parameter node, the resource consumption of the node itself can differ based on which degree of parallelism has a higher resource cost associated with it. Lastly, when searching for the optimal throughput achievable for a given resource budget, some nodes may need to be folded in a highly specific way to achieve the throughput necessary to not be a bottleneck, while also not over-provisioning resources. For example, if we wanted to speed up a node by 6x to achieve the desired throughput and our folding factor options were [1,3,7] and [1,2,4], the optimal configuration would be the pair (3,2). Any other combination of pairs will be less or more than 6 when multiplied, leading to either becoming a bottleneck or using more resources than necessary.
In light of this, with this PR, we introduce a new SetFolding() transformation which considers the implications of folding to lead to a much more robust result by making use of the resource estimate functions in FINN to handle the folding implications on the dataflow graph without having to introduce complex rulesets.
The new transformation treats the problem of folding as a global optimization problem with bounded integer variables. By generalizing the problem in such a way, we are able to make use of robust minimization algorithms to find optimal or near-optimal solutions while incrementally introducing any necessary heuristics to improve the optimization outcome.
The current minimizer of choice is an implementation of simulated annealing called dual_basin from scipy's library of minimizers.
The cost function has a hard bound on the throughput target while increasing cost for each resource being used (LUT, BRAM, URAM, DSP) proportionately to the selected platform's capacity.
Due to simulated annealing being a time-consuming algorithm to run (many re-computations of the cost function), we introduce a divide and conquer-style breakdown of the problem. More specifically, using a parameter exposed to the user (max_parameters_per_partition) we perform the optimizations on a small subset of nodes of the overarching model one by one, using local resource consumption estimates for the relevant nodes. Once all partitions are optimized, we apply the chosen folding parameters all at once on the full model and compute the total resource consumption and throughput. Based on whether the resultant resource consumption is within the constraints and whether the throughput target was reached, we either relax or tighten the throughput requirements and repeat the local optimizations until we arrive at a globally optimal or near-optimal solution. This way, the runtime scales linearly with model size.
The transformation now features a large set of additional parameters which control how it is performed. The most notable ones are:
folding_style (str) = ["naive", "optimizer"] (either use the old simple automatic folding routine (pro of being extremely fast) or the new optimizer-based routine (slower, but far more efficient)).
folding_maximum_padding = [int] (whether to make use of channel dimension padding to relax folding factor constraints on the optimizer, more on this below)
folding_pad_io_nodes = [bool] (whether to pad the IO nodes, which, if the folding_maximum_padding parameter is set to more than 0 can result in IO node shapes changing ,which would require modifications to the host code (padding of the input and cropping of the output))
enable_folding_dwc_heuristic = [bool] (whether to consider how folding may introduce DWCs when computing the cost function of the optimizer as well as final resource consumption use in the global loop of the transformation)
enable_folding_fifo_heuristic = [bool] (whether to insert FIFOs once during each global pass on the model to compute the final resource consumption)
folding_effort = [int] (how many function calls to allow to the minimizer during folding)
folding_max_attempts = [int] (how many times to attempt folding in sequence. A value of 1 means that only one attempt will be made to hit the target throughput specified. A value greater than 1 means that after the first attempt, the optimizer will try to maximize throughput to the resource limit. We use a binary search to zero in on the highest throughput which still meets the resource constraints).
On padding:
One of the most powerful design introductions with the new folding optimizer is making use of channel dimension padding for relaxing folding factor constraints. More specifically, currently in finn, a folding a factor choice must be a divisor of the bounding parameter (usually channel size) since nodes in finn cannot inherently deal with going out of bounds in their SIMD operations. During the new folding optimizer runs, an assumption can be made that node channel dimensions can be padded by some number of elements from a list. If a folding factor which can only be achieved as a result of a given padding is chosen by the optimizer, the padding is applied to the node in terms of the tensor shapes and bounding parameter node attribute value alongside the actual folding parameter.
However, the functional padding support is only possible if the generalized DWC from the branch https://github.com/lstasytis/finn/tree/feature/generalized-datawidthconverter is also merged into the repository. Without this DWC performing the necessary padding, using folding_maximum_padding values above 0 will result in faulty models and so should not be used.
Architecture:
The minimizer function itself in the folding optimizer currently uses hyper parameters (accept, visit, maxfun, maxiter) chosen empirically during the initial design passes on the transformation. These parameters should be tuned, which will lead to more optimal solutions and a reduction in runtime.
ram_style and resType are currently disabled as 'folding' parameters to optimize, however are useful, especially when it comes to deciding on the correct memory to use for weight storage.