-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compiler: add insert-sync-barrier pass #21
Conversation
f6420df
to
ac38e46
Compare
Hi, we discussed offline what this would look like. There are some things that I like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorendumoulin very nice PR.
Right now you have a lot of comments explaining what you are doing, which is great, but I think in this PR we should be extra careful adding some other comments that highlight the current assumptions maybe? Other than that it is very exciting to see where this is going!
type(op) in memref.MemRef.operations, | ||
type(op_use.operation) in memref.MemRef.operations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you maybe alias memref.Memref.operations
to a variable DMACoreOps
and not memref.Memref.operations
= ComputeCoreOps
or something?
Your comment is quite clear, but I think we can make this more clear in the code itself, and then we can more easily what we mean with each of those later maybe?
"linalg.generic"(%0, %1, %2) <{"indexing_maps" = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], "iterator_types" = [#linalg.iterator_type<parallel>], "operandSegmentSizes" = array<i32: 2, 1>}> ({ | ||
^2(%arg3_1 : i32, %arg4_1 : i32, %arg5_1 : i32): | ||
%4 = "arith.muli"(%arg3_1, %arg4_1) : (i32, i32) -> i32 | ||
"linalg.yield"(%4) : (i32) -> () | ||
}) : (memref<64xi32, 1 : i32>, memref<64xi32, 1 : i32>, memref<64xi32, 1 : i32>) -> () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make a lot of sense to me that this would operate on exactly the same data as before, can we put some different data in the linalg please?
The tests are now suddenly failing, but this is unrelated to this commit (see #25) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Let's await a fix for #25 first?
This pull request implements an xDSL pass to insert snax synchronisation barriers in a program. Synchronisation barriers are required when data is shared between multiple cores in a cluster, and the correct control flow of the program must be maintained. It also includes a first custom operation in the SNAX dialect:
snax.cluster_sync_op
Consider a function on core 1
c1
and a program on function 2c2
, both using operanda
:As the two programs may have a data dependency on the variable a,
c2
can only start afterc1
has finished. Therefore, we need a synchronization barrier between these two functions:The algorithm used for this pass investigates these dependencies across cores to insert synchronization passes at the correct times. This is done by walking through every op in the IR. For every operand used by the operation, we check if it also used by an operation on another core. If this is the case, we must insert a synchronization barrier between the two.
A problem with this implementation is that it could lead to multiple superfluous synchronization barriers. Consider the following program:
The previous algorithm would result in the following output:
The first synchronization barrier before
c2(a)
makes the second one beforec2(b)
unnecessary. A modified version of the algorithm handles this special case as follows:In this adaptation, if it is detected that a synchronization barrier is required, it will add it to the list of operations that need a synchronization barrier
ops_to_sync
. When that op is visited bymodule.walk()
, only then will the synchronization barrier be inserted. If however, since adding the synchronization barrier to the list, another barrier has already been inserted in the code, the barrier is cleared by resetting the listops_to_sync = []
, avoiding unnecessary barriers.This algorithm can be found in the code almost exactly as it is written here, with the following adaptations:
core(op)
function, we need another way to determine which piece of code runs on which core. The simple rule used for now is that allmemref
ops (copy
,alloc
, ...) are handled by the dma core, and all other ops run on the compute core