Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: add insert-sync-barrier pass #21

Merged
merged 19 commits into from
Nov 22, 2023
Merged

Conversation

jorendumoulin
Copy link
Contributor

@jorendumoulin jorendumoulin commented Nov 20, 2023

This pull request implements an xDSL pass to insert snax synchronisation barriers in a program. Synchronisation barriers are required when data is shared between multiple cores in a cluster, and the correct control flow of the program must be maintained. It also includes a first custom operation in the SNAX dialect: snax.cluster_sync_op

Consider a function on core 1 c1 and a program on function 2 c2, both using operand a:

c1(a)
c2(a)

As the two programs may have a data dependency on the variable a, c2 can only start after c1 has finished. Therefore, we need a synchronization barrier between these two functions:

c1(a)
sync()
c2(a)

The algorithm used for this pass investigates these dependencies across cores to insert synchronization passes at the correct times. This is done by walking through every op in the IR. For every operand used by the operation, we check if it also used by an operation on another core. If this is the case, we must insert a synchronization barrier between the two.

for op in module.walk():
    for operand in op.operands:
        for use in operand.uses:
            if core(op) != core(use):
                ## data dependency across cores, insert synchronization before next op
                insert_sync_before(use) 

A problem with this implementation is that it could lead to multiple superfluous synchronization barriers. Consider the following program:

c1(a)
c1(b)
c2(a)
c2(b)

The previous algorithm would result in the following output:

c1(a)
c1(b)
sync()
c2(a)
sync()
c2(b)

The first synchronization barrier before c2(a) makes the second one before c2(b) unnecessary. A modified version of the algorithm handles this special case as follows:

ops_to_sync = []
for op in module.walk():
    if op in ops_to_sync:
        insert_sync_before(op) 
        ops_to_sync = []

    for operand in op.operands:
        for use in operand.uses:
            if core(op) != core(use):
                ## data dependency across cores, insert synchronization before next op
                ops_to_sync.append(use)

In this adaptation, if it is detected that a synchronization barrier is required, it will add it to the list of operations that need a synchronization barrier ops_to_sync. When that op is visited by module.walk(), only then will the synchronization barrier be inserted. If however, since adding the synchronization barrier to the list, another barrier has already been inserted in the code, the barrier is cleared by resetting the list ops_to_sync = [] , avoiding unnecessary barriers.

This algorithm can be found in the code almost exactly as it is written here, with the following adaptations:

  • due to a lack of core(op) function, we need another way to determine which piece of code runs on which core. The simple rule used for now is that all memref ops (copy, alloc, ...) are handled by the dma core, and all other ops run on the compute core

@jorendumoulin jorendumoulin changed the title Joren/insert sync barrier compiler: add insert-sync-barrier pass Nov 20, 2023
Base automatically changed from Joren/raise-dma to main November 20, 2023 12:44
@jorendumoulin jorendumoulin force-pushed the Joren/insert-sync-barrier branch from f6420df to ac38e46 Compare November 20, 2023 12:50
@jorendumoulin jorendumoulin marked this pull request as ready for review November 20, 2023 13:48
@JosseVanDelm
Copy link
Contributor

Hi, we discussed offline what this would look like.
I have a few concerns about this, but I think we both agree that running into problems fast now might be our best move.

There are some things that I like:

  • This is nice way to get a first "raised" representation in MLIR besides C code!
    There are some things that I don't like:
  • We are inserting buffers without looking at producers/consumers. In your example c1(a) and c2(a) only need to be synced if they are altering A in place, but not if they are only reading it, right (would tensor semantics solve this issue? --> Joren: "We are already past tensor semantics at this point probably?) ?.
    Anyhow, inserting more barriers than less is always safer I guess.
  • We are not very explicit about which stuff we are running on which core, but maybe future work will make this more clear.

Copy link
Contributor

@JosseVanDelm JosseVanDelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorendumoulin very nice PR.

Right now you have a lot of comments explaining what you are doing, which is great, but I think in this PR we should be extra careful adding some other comments that highlight the current assumptions maybe? Other than that it is very exciting to see where this is going!

compiler/dialects/snax.py Show resolved Hide resolved
Comment on lines 42 to 43
type(op) in memref.MemRef.operations,
type(op_use.operation) in memref.MemRef.operations,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you maybe alias memref.Memref.operations to a variable DMACoreOps and not memref.Memref.operations = ComputeCoreOps or something?
Your comment is quite clear, but I think we can make this more clear in the code itself, and then we can more easily what we mean with each of those later maybe?

compiler/transforms/insert_sync_barrier.py Show resolved Hide resolved
compiler/transforms/insert_sync_barrier.py Outdated Show resolved Hide resolved
Comment on lines 42 to 46
"linalg.generic"(%0, %1, %2) <{"indexing_maps" = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], "iterator_types" = [#linalg.iterator_type<parallel>], "operandSegmentSizes" = array<i32: 2, 1>}> ({
^2(%arg3_1 : i32, %arg4_1 : i32, %arg5_1 : i32):
%4 = "arith.muli"(%arg3_1, %arg4_1) : (i32, i32) -> i32
"linalg.yield"(%4) : (i32) -> ()
}) : (memref<64xi32, 1 : i32>, memref<64xi32, 1 : i32>, memref<64xi32, 1 : i32>) -> ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make a lot of sense to me that this would operate on exactly the same data as before, can we put some different data in the linalg please?

@jorendumoulin
Copy link
Contributor Author

The tests are now suddenly failing, but this is unrelated to this commit (see #25)

Copy link
Contributor

@JosseVanDelm JosseVanDelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Let's await a fix for #25 first?

@jorendumoulin jorendumoulin merged commit 5feba1d into main Nov 22, 2023
3 of 4 checks passed
@jorendumoulin jorendumoulin deleted the Joren/insert-sync-barrier branch November 22, 2023 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants