This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] Add a utility to sort the operands of commutative ops
ClosedPublic

Authored by srishti-pm on May 1 2022, 10:31 PM.

Details

Summary

Added a commutativity utility pattern and a function to populate it. The pattern sorts the operands of an op in ascending order of the "key" associated with each operand iff the op is commutative. This sorting is stable.

The function is intended to be used inside passes to simplify the matching of commutative operations. After the application of the above-mentioned pattern, since the commutative operands now have a deterministic order in which they occur in an op, the matching of large DAGs becomes much simpler, i.e., requires much less number of checks to be written by a user in her/his pattern matching function.

The "key" associated with an operand is the list of the "AncestorKeys" associated with the ancestors of this operand, in a breadth-first order.

The operand of any op is produced by a set of ops and block arguments. Each of these ops and block arguments is called an "ancestor" of this operand.

Now, the "AncestorKey" associated with:

  1. A block argument is {type: BLOCK_ARGUMENT, opName: ""}.
  2. A non-constant-like op, for example, arith.addi, is {type: NON_CONSTANT_OP, opName: "arith.addi"}.
  3. A constant-like op, for example, arith.constant, is {type: CONSTANT_OP, opName: "arith.constant"}.

So, if an operand, say A, was produced as follows:

`<block argument>`  `<block argument>`
             \          /
              \        /
              `arith.subi`           `arith.constant`
                         \            /
                         `arith.addi`
                                |
                           returns `A`

Then, the block arguments and operations present in the backward slice of A, in the breadth-first order are:
arith.addi, arith.subi, arith.constant, <block argument>, and <block argument>.

Thus, the "key" associated with operand A is:

{
 {type: NON_CONSTANT_OP, opName: "arith.addi"},
 {type: NON_CONSTANT_OP, opName: "arith.subi"},
 {type: CONSTANT_OP, opName: "arith.constant"},
 {type: BLOCK_ARGUMENT, opName: ""},
 {type: BLOCK_ARGUMENT, opName: ""}
}

Now, if "keyA" is the key associated with operand A and "keyB" is the key associated with operand B, then:
"keyA" < "keyB" iff:

  1. In the first unequal pair of corresponding AncestorKeys, the AncestorKey in operand A is smaller, or,
  2. Both the AncestorKeys in every pair are the same and the size of operand A's "key" is smaller.

AncestorKeys of type BLOCK_ARGUMENT are considered the smallest, those of type CONSTANT_OP, the largest, and NON_CONSTANT_OP types come in between. Within the types NON_CONSTANT_OP and CONSTANT_OP, the smaller ones are the ones with smaller op names (lexicographically).


Some examples of such a sorting:

Assume that the sorting is being applied to foo.commutative, which is a commutative op.

Example 1:

%1 = foo.const 0
%2 = foo.mul <block argument>, <block argument>
%3 = foo.commutative %1, %2

Here,

  1. The key associated with %1 is:
{
 {CONSTANT_OP, "foo.const"}
}
  1. The key associated with %2 is:
{
 {NON_CONSTANT_OP, "foo.mul"},
 {BLOCK_ARGUMENT, ""},
 {BLOCK_ARGUMENT, ""}
}

The key of %2 < the key of %1
Thus, the sorted foo.commutative is:

%3 = foo.commutative %2, %1

Example 2:

%1 = foo.const 0
%2 = foo.mul <block argument>, <block argument>
%3 = foo.mul %2, %1
%4 = foo.add %2, %1
%5 = foo.commutative %1, %2, %3, %4

Here,

  1. The key associated with %1 is:
{
 {CONSTANT_OP, "foo.const"}
}
  1. The key associated with %2 is:
{
 {NON_CONSTANT_OP, "foo.mul"},
 {BLOCK_ARGUMENT, ""}
}
  1. The key associated with %3 is:
{
 {NON_CONSTANT_OP, "foo.mul"},
 {NON_CONSTANT_OP, "foo.mul"},
 {CONSTANT_OP, "foo.const"},
 {BLOCK_ARGUMENT, ""},
 {BLOCK_ARGUMENT, ""}
}
  1. The key associated with %4 is:
{
 {NON_CONSTANT_OP, "foo.add"},
 {NON_CONSTANT_OP, "foo.mul"},
 {CONSTANT_OP, "foo.const"},
 {BLOCK_ARGUMENT, ""},
 {BLOCK_ARGUMENT, ""}
}

Thus, the sorted foo.commutative is:

%5 = foo.commutative %4, %3, %2, %1

Signed-off-by: Srishti Srivastava <srishti.srivastava@polymagelabs.com>

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Fixing a comment typo and enhancing the commit summary even further.

Seems like this should be added to canonicalization? The "push constants to the right hand side" is there already.

I also don't understand the complexity of the implementation, I may need an example to understand why you're recursively operating on the producer ops for the operands.
From the revision description: , (1) the operands defined by non-constant-like ops come first, followed by (2) block arguments, and these are followed by (3) the operands defined by constant-like ops which all seems like a fairly local check: a stable-sort on the operands would be deterministic and local to a single operation.

mlir/lib/Transforms/Utils/CommutativityUtils.cpp
236

The string conversion seems unnecessary to me?

272

Is this a leak?

Seems like this should be added to canonicalization? The "push constants to the right hand side" is there already.

I think this was not added to canonicalization because we wanted it to be an independent utility that can be used if needed and not be used if not needed. Yes, the "push constants to the right hand side" is there already and that's actually the reason why this utility also pushes the constants to the right. I didn't want this utility to do something else and clash with the "push constants to the right hand side" canonicalization if and when both were being used together. So, when we decided what our sorting order will be, we made sure that this order kept constants to the right. For more context, you could refer to this RFC's discussion, starting from this comment: https://discourse.llvm.org/t/mlir-pdl-extending-pdl-pdlinterp-bytecode-to-enable-commutative-matching/60798/12?u=srishtisrivastava.

But, note this:
Right now, the code shows that I am sorting the constants alphabetically (as in, theoretically, we can say that arith.constant comes before tf.Const). I will remove this "alphabetical sorting" among the constants. It isn't consistent with the existing "pushing of constants" and moreover adds unnecessary computation too. I'll ensure that the sorting is stable and pushes all the constants to the right.

I also don't understand the complexity of the implementation, I may need an example to understand why you're recursively operating on the producer ops for the operands.
From the revision description: (1) the operands defined by non-constant-like ops come first, followed by (2) block arguments, and these are followed by (3) the operands defined by constant-like ops` which all seems like a fairly local check: a stable-sort on the operands would be deterministic and local to a single operation.

I do this because, firstly, in the description, if you look below this paragraph, you will see the following:
"And, if two operands come from the same op, the function backtracks and
looks even further to sort them. This backtracking is done over the
backward slice of the operand, in a breadth-first traversal."

So, in essence, we are looking at the entire origin of each of the operands, in a breadth-first fashion, to decide the ordering of the operands. Secondly, we need to sort the producers of the operands so that we have a deterministic sorting (because if the producers are not sorted, we don't know if we are doing the right sorting or not, even if we are looking at the entire backward slice). This is because since the producers are not sorted, the breadth-first traversal becomes non-deterministic to a user who is writing a pattern matching function (say, matchAndRewrite()).

srishti-pm updated this revision to Diff 428240.May 9 2022, 4:51 PM

Within constant-like ops, removed the requirement for them being sorted alphabetically. Basically, all constants will be treated as equals by the sorting algorithm and it will not distinguish between, say, arith.constant and tf.Const. This is because multiple canonicalizations exist in various dialects that push the constants to the right but do not make any distinction among constants. So, since we want this utility to not clash with those canonicalizations, this is being done.

Seems like this should be added to canonicalization? The "push constants to the right hand side" is there already.

I think this was not added to canonicalization because we wanted it to be an independent utility that can be used if needed and not be used if not needed.

You're telling me "what" while I'm actually more interested in the "why" here?

I also don't understand the complexity of the implementation, I may need an example to understand why you're recursively operating on the producer ops for the operands.
From the revision description: (1) the operands defined by non-constant-like ops come first, followed by (2) block arguments, and these are followed by (3) the operands defined by constant-like ops` which all seems like a fairly local check: a stable-sort on the operands would be deterministic and local to a single operation.

I do this because, firstly, in the description, if you look below this paragraph, you will see the following:
"And, if two operands come from the same op, the function backtracks and
looks even further to sort them. This backtracking is done over the
backward slice of the operand, in a breadth-first traversal."

Same as before: this does not tell me why, can you provide an example where this matters?

srishti-pm added a comment.EditedMay 9 2022, 6:19 PM

You're telling me "what" while I'm actually more interested in the "why" here?

I'm not sure what your question is, with a "why". Let me think about this a bit. I'll get back to you.

Same as before: this does not tell me why, can you provide an example where this matters?

Sure. This is a bit lengthy. I'm really sorry for that !

So, lets start with some basic understanding here. Let's say I am writing a matchAndRewrite() function, where I take the following INPUT and convert it to the following OUTPUT:

INPUT:
a = div b, c
d = sub e, f
g = add d, a
h = const 0
i = mul h, g

OUTPUT:
i = some_op b, c, e, f

Now, when I'm writing a C++ code to match and rewrite:

If I only sort the i = mul h, g op, I get my canonicalized input as follows:

CANONICALIZED INPUT #1:

a = div b, c
d = sub e, f
g = add d, a
h = const 0
i = mul g, h

So, I'm basically sorting i = mul h, g to i = mul g, h using the utility and then writing the if-else statements to match the CANONICALIZED INPUT #1.
So, a psuedo code will be:

if mul.operand[0].defOp != add OR mul.operand[1].defOp != const 0
  return failure

if add.operand[0].defOp != sub
  if add.operand[0].defOp != div OR add.operand[1].defOp != sub
    return failure
  else
    get the values of b, c, e, and f
else if add.operand[1].defOp == div
  get the values of b, c, e, and f
else
  return failure

rewrite<some_op>(mul, b, c, e, f)

But, if I had sorted the producers as well, my canonicalized input would be:
CANONICALIZED INPUT #2:

a = div b, c
d = sub e, f
g = add a, d
h = const 0
i = mul g, h

and thus my code will reduce to:

if mul.operand[0].defOp !=  add OR mul.operand[1].defOp != const 0
  return failure

if add.operand[0].defOp != div OR add.operand[1].defOp != sub
  return failure

get the values of b, c, e, and f

rewrite<some_op>(mul, b, c, e, f)

So, in essence, we can see that the effort of an end user writing a C++ pattern has reduced if I sort the producers as well. But, one may argue that I could have sorted the add op after seeing it and then my if-else statements would reduce. So, the above illustration doesn't explain why we sort the producers.

The real reason for sorting the producers is that, if such a thing is not done, the sorting and this entire utility will be virtually useless. A deterministic sorting of an op requires its producers to be sorted. Our sorting algorithm is based on the breadth-first traversal of backard slices. On the same level of DAG, the traversal looks at operands from the first to the last. That is how the breadth-first traversal is defined here. Now, if this traversal is non-deterministic, then, the whole use of sorting something will collapse. Maybe, this can be BEST explained with the below example.

If I have this IR:
d = div b, c
s = sub e, f
x = xor k, l
g = add s, d
h = add d, x
i = add g, h

Then, i = add g, h will be sorted to i = add g, h (no change).

But, when I have the below IR (which is functionally the same as the above IR):
d = div b, c
s = sub e, f
x = xor k, l
g = add d, s
h = add d, x
i = add g, h

Then, i = add g, h will be sorted to i = add h, g.

So, we have two functionally same IRs being sorted differently. This is clearly not useful. The sorting depends on what the input IR is. So, even after sorting, we still have functionally same IRs that look different. So, the pattern matcher (a human) still has to write an excessive number of if-else statements to match the input. This is exactly what this sorting was supposed to avoid. This is as good as not having done any sorting at all!

Is the motivation clear now?

srishti-pm updated this revision to Diff 428300.EditedMay 9 2022, 11:51 PM

Made the sorting strictly stable.

srishti-pm edited the summary of this revision. (Show Details)May 9 2022, 11:54 PM

You're telling me "what" while I'm actually more interested in the "why" here?

I'm not sure what your question is, with a "why". Let me think about this a bit. I'll get back to you.

My "why?" question is about canonicalization: could this be a canonicalization and if so why / why not? This is an important thing to answer before looking into the discussion below actually:

Same as before: this does not tell me why, can you provide an example where this matters?

Sure. This is a bit lengthy. I'm really sorry for that !

No worry, thanks for being thorough.

...
So, in essence, we can see that the effort of an end user writing a C++ pattern has reduced if I sort the producers as well.

So far you explained "what is canonicalization and why are we canonicalizing", the same rationale applies to "push constant to the right" that we do already in canonicalization, and this is exactly why I asked before whether we could do what you're doing as a canonicalization.

The real reason for sorting the producers is that, if such a thing is not done, the sorting and this entire utility will be virtually useless.

So: I understand that the producers should be sorted for a pattern to apply, but our disconnect earlier is that my usual approach to see canonicalization is to process an entire block/region, and as such we don't work on slices but on operation in order until fix-point. I'm a bit concerned about efficiency of your approach, because when integrated in a framework that actually visit the entire block you would recompute subset of the same slice over and over and re-attempt to sort everything multiple times:

d = add b, c
s = add d, f
g = add s, d
h = add d, x
i = add g, h

Every time the algorithm would consider a commutative op, that is all the op, it would recurse and try to re-sort the current slice, processing the same ops over and over.

A deterministic sorting of an op requires its producers to be sorted.

This isn't clear to me why based on the sorting criteria you provided, but in any case a local sort with fixed-point iteration on an isolated region should converge (hopefully).

Our sorting algorithm is based on the breadth-first traversal of backard slices. On the same level of DAG, the traversal looks at operands from the first to the last. That is how the breadth-first traversal is defined here. Now, if this traversal is non-deterministic, then, the whole use of sorting something will collapse. Maybe, this can be BEST explained with the below example.

If I have this IR:
d = div b, c
s = sub e, f
x = xor k, l
g = add s, d
h = add d, x
i = add g, h

Then, i = add g, h will be sorted to i = add g, h (no change).

But, when I have the below IR (which is functionally the same as the above IR):
d = div b, c
s = sub e, f
x = xor k, l
g = add d, s
h = add d, x
i = add g, h

Then, i = add g, h will be sorted to i = add h, g.

Why? Again your description of the sort is as follow:

(1) the operands defined by non-constant-like ops come first, followed by
(2) block arguments, and these are followed by
(3) the operands defined by constant-like ops.
In addition to this, within the category (1), the order of operands is alphabetical w.r.t. the dialect name and op name.

In your example:

g = add d, s
h = add d, x
i = add g, h

The operands fits category 1) I believe, and so they should be sorted "alphabetical w.r.t. the dialect name and op name", so a stable sort would never re-order them in any way.

My "why?" question is about canonicalization: could this be a canonicalization and if so why / why not? This is an important thing to answer before looking into the discussion below actually:

I think, yes, it can be a canonicalization. I don't see why not.

So far you explained "what is canonicalization and why are we canonicalizing", the same rationale applies to "push constant to the right" that we do already in canonicalization, and this is exactly why I asked before whether we could do what you're doing as a canonicalization.

You are right. And the answer is yes, we could do this as a canonicalization. In fact, it is a canonicalization because we are converting various forms of an IR to a specific canonicalized form. It just hasn't been added to any op's canonicalizer here.

So: I understand that the producers should be sorted for a pattern to apply, but our disconnect earlier is that my usual approach to see canonicalization is to process an entire block/region, and as such we don't work on slices but on operation in order until fix-point. I'm a bit concerned about efficiency of your approach, because when integrated in a framework that actually visit the entire block you would recompute subset of the same slice over and over and re-attempt to sort everything multiple times:

I completely agree with your concern. I had the same concern myself (refer to discussion starting from here: https://discourse.llvm.org/t/mlir-pdl-extending-pdl-pdlinterp-bytecode-to-enable-commutative-matching/60798/12?u=srishtisrivastava). I think we can find a way to call this utility in the canonicalizers of all the commutative ops and remove the recursion (which won't be needed anymore, given the canonicalization happens from top to bottom). What are your views on this?

That was the sense of my question about canonicalization indeed :)

Every time the algorithm would consider a commutative op, that is all the op, it would recurse and try to re-sort the current slice, processing the same ops over and over.

Yes, exactly.

Why? Again your description of the sort is as follow:

I think I need to update my commit summary and revision summary. I had thought that this was an intuitive way of explaining the utility. But, I understand that it is quite misleading.

I need to look at the algorithm in more detail, but I'm not a fan of using a string key. Concatenating strings to make compound keys is not very efficient and potentially brittle. Can you assign unique IDs and use an array of IDs instead?

mlir/include/mlir/Transforms/CommutativityUtils.h
25

Add documentation? Similar to what you have in the patch description.

Mogball added a comment.EditedMay 10 2022, 7:59 AM

On the matter of whether this should be a canonicalization, my concern with this is that if an operation has its own preferred ordering of operands that conflicts with the sort, then this will cause canonicalization to loop infinitely.

It's not actually the canonicalizer pass that moves constants to the right hand size. It's the folder. And it probably shouldn't be the folder that does this. So I'm open to making this part of canonicalization IF the sorted operand order produced by this utility is the canonical order of operands for commutative operations, so that conflicts are not possible.

(1) the operands defined by non-constant-like ops come first, followed by (2) block arguments, and these are followed by (3) the operands defined by constant-like ops.

I would have thought block-arguments would come first as we don't know their values, while non-constant-like ops could be folded at some point and then become constant-like. Meaning, they seem closer to constant than block arguments.

+1 to Mehdi's question about just stable sorting based on based on 4 criteria (3 buckets + ordering within (1)) and then we should be able to avoid all the string mangling too as Jeff asked about.

+1 on all of the other comments, especially related to the use of strings.

On the matter of whether this should be a canonicalization, my concern with this is that if an operation has its own preferred ordering of operands that conflicts with the sort, then this will cause canonicalization to loop infinitely.

It's not actually the canonicalizer pass that moves constants to the right hand size. It's the folder. And it probably shouldn't be the folder that does this. So I'm open to making this part of canonicalization IF the sorted operand order produced by this utility is the canonical order of operands for commutative operations, so that conflicts are not possible.

We can decide whatever we want the canonical ordering of operands to be for the Commutative trait. We don't have to leave things up to operations if it doesn't make sense.

Improving the documentation of the functionality of this utility to make it more clear.

srishti-pm edited the summary of this revision. (Show Details)May 10 2022, 2:20 PM
srishti-pm edited the summary of this revision. (Show Details)
srishti-pm edited the summary of this revision. (Show Details)May 10 2022, 2:24 PM
srishti-pm edited the summary of this revision. (Show Details)
srishti-pm edited the summary of this revision. (Show Details)

Fixing a minor typo.

srishti-pm marked an inline comment as done.May 10 2022, 2:34 PM

Thanks for improving the doc! Are you moving this to be used in canonicalization next?
I think a first good step would be to make it a pattern and test it with a pass that applies it in the greedy rewriter. I would land this first and then try to enable this in the canonicalized.

Also, have you thought already about how to get rid of string manipulation?

Thanks for improving the doc! Are you moving this to be used in canonicalization next?
I think a first good step would be to make it a pattern and test it with a pass that applies it in the greedy rewriter. I would land this first and then try to enable this in the canonicalized.

Thanks! I hope it is unambiguous and clear now! Yes, I will do the "moving this to be used in canonicalization" now. Thanks for the suggestion. The plan sounds good.

Also, have you thought already about how to get rid of string manipulation?

No, I haven't. I'm still thinking about this. Meanwhile, I'll also address the other comments given here.

srishti-pm added a comment.EditedMay 10 2022, 4:19 PM

Replying to the various comments given in this revision:

1. Regarding string manipulation:

I need to look at the algorithm in more detail, but I'm not a fan of using a string key. Concatenating strings to make compound keys is not very efficient and potentially brittle. Can you assign unique IDs and use an array of IDs instead?

Sure, I'm currently brainstorming to come up with a way to do this. But, @Mogball, do you remember our discussion of including the attibutes, etc. also in the unique ID/key (https://discourse.llvm.org/t/mlir-pdl-extending-pdl-pdlinterp-bytecode-to-enable-commutative-matching/60798/13)? We had decided that this was something that can be added if and when required but probably won't be included in the first revision of this utility. As in, the first revision will contain a TODO comment describing this enhancement. Will we still want to do this when we are using an array of unique IDs? I wish to understand this so that I know if I have to think of an ID which can be easily extended to include the attributes, etc. of an op.

+1 to Mehdi's question about just stable sorting based on based on 4 criteria (3 buckets + ordering within (1)) and then we should be able to avoid all the string mangling too as Jeff asked about.

Actually, that question of Mehdi is now obsolete (I think). It was based on my incorrect documentation of the "sorting rule". I have now corrected the documentation of the rule (in the revision summary, the commit summary, and the code comments). The rule was finalized in a recent RFC (refer to comments starting from here: https://discourse.llvm.org/t/mlir-pdl-extending-pdl-pdlinterp-bytecode-to-enable-commutative-matching/60798/19). Does this sound okay?

2. Regarding reversing non-constant-like op and block argument order:

I would have thought block-arguments would come first as we don't know their values, while non-constant-like ops could be folded at some point and then become constant-like. Meaning, they seem closer to constant than block arguments.

Sure. This sounds much better actually. I think at some point during the implementation, I had also thought of the same :)
I'll do this. Thanks for the suggestion.

I'm open to iterating in tree. Landing this utility first and then try adding it as a canonicalization SGTM.

The string could be replaced with an array of tuples (yuck) for now. An enum (constant, block arg, everything else) plus the OperationName.

Attributes need to be handled somehow possibly by adding the op's dictionary attribute and using a deep compare.

@Mogball @mehdi_amini @jpienaar, sorry there haven't been any updates from my side here for the past 10 or so days. I had been busy in some other tasks. I have started working on this again.

Addressed all the comments.

srishti-pm edited the summary of this revision. (Show Details)Jun 12 2022, 10:29 PM

I haven't thought too hard about the algorithm itself yet. I'm in the camp of "let's move forward if it works". I have mostly trivial comments.

clang/docs/tools/clang-formatted-files.txt
8452

I don't think this file needs to be modified.

mlir/include/mlir/Transforms/CommutativityUtils.h
25

Why do all of these need to be exposed publicly? I think this file should only contain SortCommutativeOperands.

35

using BlockArgumentOrOpKey = std::pair<BlockArgumentOrOpType, StringRef>

The default operator< for std::pair should work for you.

141

ArrayRef<OperandBFS *>

162

Pass these both by ArrayRef

175

ArrayRef<OperandBFS *>

187

ArrayRef<OperandBFS *>

195

SmallVectorImpl

202

ArrayRef<OperandBFS *>

212

Please move the body matchAndRewrite into a source file. It only needs Operation *. And then all the structs/enum/utility functions in the header can be moved there as well.

227

memory leak?

243

sortedOperands(numOperands, nullptr);

276
297
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
18

unused?

52

The doc of a public function shouldn't be repeated above the implementation.

Mogball added inline comments.Jun 12 2022, 10:57 PM
mlir/include/mlir/Transforms/CommutativityUtils.h
212

This could stand to be a static_assert

258

Could you not change getIndicesOfUnassigned... to populate two lists of operands and pass these to assignSortedPositionTo instead of using a set to track the indices. You could put the operand index inside OperandBFS to keep track.

mlir/lib/Transforms/Utils/CommutativityUtils.cpp
25

This function doesn't seem like it pays for itself -- llvm::any_of?

93

These could all be ArrayRefs since you aren't modifying them

168

drop mlir::

202

This function doesn't seem like it pays for itself.

srishti-pm edited the summary of this revision. (Show Details)Jun 12 2022, 11:13 PM
srishti-pm edited the summary of this revision. (Show Details)
srishti-pm edited the summary of this revision. (Show Details)Jun 12 2022, 11:15 PM
srishti-pm edited the summary of this revision. (Show Details)
srishti-pm edited the summary of this revision. (Show Details)Jun 12 2022, 11:17 PM
srishti-pm edited the summary of this revision. (Show Details)Jun 12 2022, 11:24 PM

Minor changes.

mehdi_amini added inline comments.Jun 14 2022, 1:18 PM
mlir/include/mlir/Transforms/CommutativityUtils.h
212

Can we make this not a template? This will be a code bloat otherwise.

333

There seems to me to be far too much code in the public head: I can't isolate the public API here, if this is just about a pattern then the usual way is to have a "populate" method.

srishti-pm marked 20 inline comments as done.

Addressing comments.

Addressed most of the comments. A few remaining.

srishti-pm edited the summary of this revision. (Show Details)Jun 15 2022, 11:04 AM

Increasing pattern benefit + minor typo correction.

Right now I'm not yet understanding all of the algorithm (haven't spent enough time on it), but I'm mostly concerned by the runtime cost of this normalization.

mlir/lib/Transforms/Utils/CommutativityUtils.cpp
372
378–380

Right now I'm not yet understanding all of the algorithm (haven't spent enough time on it), but I'm mostly concerned by the runtime cost of this normalization.

I understand your concern. I'll go through my implementation once and optimize things that need it.

In principle, I think the algorithm is fine. I'm pretty sure you can rewrite bits of it to get rid of the map/sets. I'm only concerned about handling attributes. (e.g. cmpi slt vs cmpi sgt)

srishti-pm marked 6 inline comments as done.

Addressed the final comments.

@mehdi_amini, I have made sure that the algorithm is good in terms of both time and space complexity.

@Mogball, "handling attributes (e.g. cmpi slt vs cmpi sgt)" doesn't seem hard to me. I think this algorithm can be extended with ease to take attributes into account. But, I don't think that it should be a part of this revision (I believe you agree) because it seems like an incremental change which should be added only when the need arises. A new user should first get accustomed to this utility (and how its sorts operands with different backward slices) and then the utility can be extended to differentiate between backward slices containing ops with the same name but different attribute values.

mlir/include/mlir/Transforms/CommutativityUtils.h
35

I have added a constructor to BlockArgumentOrOpKey (now renamed to AncestorKey) and thus I think this comment is obsolete now. Hope that this fine. Adding a constructor made the code look cleaner.

227

Fixed this by adding a struct called "Ancestor" which refers to either an op or a block argument.

258

I think that doing this might not be a good idea. It will increase the space complexity unnecessarily (OperandBFS is a big structure) and not help much with the time complexity because the sets of the indices are expected to be small. At least the number of indices will be <= the total number of operands and each element in these sets will occupy very less space (size of unsigned).

mlir/lib/Transforms/Utils/CommutativityUtils.cpp
272

Fixed this by adding a struct called "Ancestor" which refers to either an op or a block argument.

srishti-pm edited the summary of this revision. (Show Details)Jun 28 2022, 5:28 AM
srishti-pm edited the summary of this revision. (Show Details)
srishti-pm marked 2 inline comments as done.

Fixed memory leak.

Mogball added inline comments.Jun 28 2022, 9:41 PM
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
179

Please remove these. They don't improve readability.

Addressed of all Jeff's comments.

srishti-pm marked an inline comment as done.Jun 29 2022, 9:24 PM

I'm glad the DenseSets are gone, but my three-ish biggest gripes are:

  • The algorithm is conceptually simple, but there is way more code than is necessary to achieve it.
  • More comments (excluding "doc" comments) than code is generally not a good sign
  • The implementation is still inefficient in a lot of obvious ways.
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
11–12
22–39

This class isn't necessary. Ancestor can just be Operation *. If it's null, then we know it's a block argument (the bool flag is redundant).

41
66–67

My biggest complaint with respect to readability is that there are more comments than code. This is fine if the comment has a big explanation about the algorithm and how keys are represented, especially with a nice ASCII diagram as you have below. But if this constructor had 0 comments except maybe "Only non-constant ops are sorted by name", it would be perfectly fine.

77

Constant ops could be sorted by name as well.

89–97

This should behave the same as the manually written comparator. operator< of std::tuple compares the first element and then the next if the first is equal.

111

Since this is a set of pointers expected to be small, you can use SmallPtrSet for a big efficiency boost (linear scan instead of hashing when small).

149–150

Since you are moving sorted operands into their sorted position and tracking the unsorted range, you shouldn't even need this flag because you will always know the sorted and unsorted subranges.

There are multiple loops in which you iterate over the entire operand list but skip those where this flag is set/unset. In those cases, you can always just iterate the subrange of interest.

156–160
163–178

I would drop these helpers. std::queue will already assert if pop or front are called on an empty queue.

201–218
268
270

This flag is not necessary because you can just check bfsOfOperandsWithKey.size() == 1

278
279–280

You don't need to make a copy. In fact, I think you should just track the indices.

299–305

This shouldn't be necessary if you're tracking the unsorted subrange. Just quit when it gets to 1 element.

308–309
313

Argument names in comments are not necessary when the passed variable has a descriptive name.

330–354

There is no way you need this much code. A std::swap between the current operand and the first unsorted position should be enough.

361

This is possibly the longest function name I've ever seen. Please make it more concise.

401

This check could be moved into pushAncestor

536–539

You could just be returning a flag to indicate whether any swapping occurred so that you don't have to track before and after.

Mogball added inline comments.Jun 30 2022, 12:13 AM
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
524

And then you'll never need to check isSorted again.

srishti-pm marked 4 inline comments as done.Jun 30 2022, 12:42 AM
srishti-pm added inline comments.
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
77

The only reason we separated constant ops from the non-constant ops was because the former are canonicalized to the right (as a stable sort) by existing canonicalizations. And, we didn't want our algorithm to conflict with these existing canonicalizations. That is the reason I am not sorting them by name and just keeping them to the right (as a stable sort).

270

.size() is an O(N) operation and that is why I usually try to avoid it. Do you still agree we should use it here? I understand that N is an expectedly small value.

279–280

I agree. But, we had discussed to store operands instead of indices and that's why I did this. I will change this to use indices again (keeping other things unchanged).

330–354

If I do a swap, the sorting will no longer be stable and I believe that there was a discussion that concluded with the fact that "we want stable sorting".

361

Could you please give a suggestion for the name? After a long thought, I came up with this name. It was better than all my other ideas.

Mogball added inline comments.Jun 30 2022, 7:36 AM
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
77

I know. You can sort them separately from regular ops and also by name and the overall behaviour would be the same.

270

Size is constant time

279–280

I mean you can have list (not a set) of indices to shift

330–354

That's true, but shifting like this is very slow as well. At this point, you might want to give std::stable_sort with a custom comparator that does extra BFS iterations on demand a try.

srishti-pm marked 4 inline comments as done.Jun 30 2022, 8:12 AM
srishti-pm added inline comments.
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
330–354

So, this is what I think:

The number of commutative operands is not expected to be huge. So, we can afford to do shifting. In most cases, we wouldn't have to shift more than 1 or 2 positions. But, the custom comparator might cost us a lot, considering that each BFS could potentially be very large, especially for deep learning models. So, doing the BFS traversals again and again for the same operand, even though caching will be involved, doesn't sound like a good idea to me.

What are your views?

Mogball added inline comments.Jun 30 2022, 8:28 AM
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
330–354

Very rough estimate: on its own, this function is N. Finding the smallest key is N, and then finding all matching elements is N. This function is called for each operand that needs to be moved, but the number of such operands decreases. So the sort itself averages out to be 3N^2 iterations over the operand list.

Now for traversals, doing BFS on demand inside the comparator doesn't mean it has to restart every time. It would do extra iterations on top of existing iteration results only when needed to break ties. In your case, you do an extra iteration of BFS for all operands if the current smallest key is identical, not just for the ones needed. It's hard to estimate the number of iterations of BFS, but certainly it's more in your case. Using std::stable_sort would also bring the complexity down to N logN

For this to be a usable canonicalization, it is really the case where the operands are already sorted (no-op) that needs to be heavily optimized (that is no complex data structure to populate, etc.)

I'm not sure how to check the no-op case without constructing at least a queue. Even assuming only 2 commutative operands, if they look the same depth=1, then the comparison has to iterate.

Used the stable_sort function.

srishti-pm marked 21 inline comments as done.Jul 20 2022, 4:28 PM
srishti-pm edited the summary of this revision. (Show Details)
srishti-pm edited the summary of this revision. (Show Details)Jul 20 2022, 4:38 PM
srishti-pm edited the summary of this revision. (Show Details)
Mogball added inline comments.Jul 20 2022, 9:10 PM
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
239

Why can't you sort the OperandBFS directly to avoid the hash map?

287

Can you use unique_ptr so that the memory doesn't leak?

Mogball requested changes to this revision.Jul 20 2022, 9:11 PM
This revision now requires changes to proceed.Jul 20 2022, 9:11 PM
srishti-pm marked an inline comment as done.Jul 20 2022, 9:29 PM
srishti-pm added inline comments.
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
239

Because comparators are not allowed to modify their input arguments.

Mogball added inline comments.Jul 20 2022, 9:35 PM
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
239

The arguments have to be const references? If so, just const_cast

srishti-pm marked an inline comment as done.Jul 21 2022, 3:50 AM
srishti-pm added inline comments.
mlir/lib/Transforms/Utils/CommutativityUtils.cpp
111

This set isn't expected to be small, right? There can be many ancestors. The only thing that is small in this context is the number of operands, right?

srishti-pm marked 5 inline comments as done.
  1. Addressed all the comments.
  2. Refactored the code.
srishti-pm edited the summary of this revision. (Show Details)Jul 27 2022, 10:36 AM

Updated an outdated comment.

Mogball accepted this revision.Jul 29 2022, 10:25 AM

Got a few small nits but otherwise LGTM. Thanks for all the hard work! This looks really solid now. I haven't thought too hard about the performance of that while loop but it seems good enough to land for now.

mlir/lib/Transforms/Utils/CommutativityUtils.cpp
247
253
256
This revision is now accepted and ready to land.Jul 29 2022, 10:25 AM

. I haven't thought too hard about the performance of that while loop but it seems good enough to land for now.

What's the finality of it? That is: outside of canonicalization what is its purpose?

I'm referring to the nitty gritty details of the while loop inside the comparator. It looks pretty tight to me right now. If the operands are already sorted, worst-case each operand is compared against only its neighbours. Unfortunately, without extra bookkeeping, BFS will still need to iterate when necessary

srishti-pm marked 3 inline comments as done.

Addressed the final NITs.