This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Replace iterOperand with a neutral element
AbandonedPublic

Authored by ayzhuang on Jun 10 2022, 6:27 PM.

Details

Summary
  1. Fix a bug in unroll-and-jam utility. Currently we would get wrong result if any

of the iterOperands of the loop being unroll-and-jammed is not a neutral element
for the corresponding reduction kind. This patch replaces such iterOperand with
a neutral element and adds an op after the loop to combine the original iterOperand
and the related result of the loop.

  1. Modify getIdentityValue to support non-scalar type.

Diff Detail

Event Timeline

ayzhuang created this revision.Jun 10 2022, 6:27 PM
ayzhuang requested review of this revision.Jun 10 2022, 6:27 PM
ayzhuang updated this revision to Diff 442648.Jul 6 2022, 11:35 AM

Update comments.

Can you please improve the commit summary to state what the rationale for the change is? For eg., is it fixing a bug, is it doing something functionally equivalent in a different way or a better way, etc.? This is important to specify here.

ayzhuang edited the summary of this revision. (Show Details)Aug 25 2022, 6:16 PM
ayzhuang added a comment.EditedAug 25 2022, 6:25 PM

@bondhugula Thank you for the suggestion. I have modified the summary. Example: if the original loop is doing a reduction add and its iterOperand is not 0, we modify the original loop to use 0 as its iterOperand and add the original iterOperand back to the result of the original loop. As a result, the unroll-and jammed loop is using 0 as its iterOperands. Otherwise, we would have added the original iterOperand unroll-factor times instead of once.

bondhugula requested changes to this revision.EditedAug 26 2022, 12:10 AM

@bondhugula Thank you for the suggestion. I have modified the summary. Example: if the original loop is doing a reduction add and its iterOperand is not 0, we modify the original loop to use 0 as its iterOperand and add the original iterOperand back to the result of the original loop. As a result, the unroll-and jammed loop is using 0 as its iterOperands. Otherwise, we would have added the original iterOperand unroll-factor times instead of once.

I still don't see why you need to replace the init value by a neutral element. Let's take this example (unroll-and-jam here is the same as unroll). Any cleanup loop will start with the return value of the main loop. Within the body, the former yield of iteration i would provide the "init" value for i + 1.

%sum = affine.for %iv = 0 to 9 iter_args(%sum_iter = %c5) -> (f32) {
    %next = arith.addf %sum_iter, %arg1 : f32
    affine.yield %next : f32
  }

It looks like the change is circumventing an existing bug? . What output do you see for this currently? We don't perform any such replacement for unrolling in the presence of iter args.

This revision now requires changes to proceed.Aug 26 2022, 12:10 AM
  1. original loop:
%c5 = arith.constant 5.0 : f32
%sum = affine.for %iv = 0 to 9 iter_args(%sum_iter = %c5) -> (f32) {
    %next = arith.addf %sum_iter, %arg1 : f32
    affine.yield %next : f32
}
  1. without this patch:
%c5 = arith.constant 5.000000e+00 : f32
%sum:4 = affine.for %iv = 0 to 8 step 4 iter_args(%sum_iter0 = %c5, %sum_iter1 = %c5, %sum_iter2 = %c5, %sum_iter3 = %c5) -> (f32, f32, f32, f32) {
  %next0 = arith.addf %sum_iter0, %arg1 : f32
  %next1 = arith.addf %sum_iter1, %arg1 : f32
  %next2 = arith.addf %sum_iter2, %arg1 : f32
  %next3 = arith.addf %sum_iter3, %arg1 : f32
  affine.yield %next0, %next1, %next2, %next3 : f32, f32, f32, f32
}
%1 = arith.addf %sum#0, %sum#3 : f32
%2 = arith.addf %1, %sum#2 : f32
%3 = arith.addf %2, %sum#1 : f32
%4 = arith.addf %3, %arg1 : f32

3.with this patch:

%c5 = arith.constant 5.000000e+00 : f32
%cst_0 = arith.constant 0.000000e+00 : f32
%sum:4 = affine.for %iv = 0 to 8 step 4 iter_args(%arg2 = %cst_0, %arg3 = %cst_0, %arg4 = %cst_0, %arg5 = %cst_0) -> (f32, f32, f32, f32) {
  %next0 = arith.addf %sum_iter0, %arg1 : f32
  %next1 = arith.addf %sum_iter1, %arg1 : f32
  %next2 = arith.addf %sum_iter2, %arg1 : f32
  %next3 = arith.addf %sum_iter3, %arg1 : f32
  affine.yield %next0, %next1, %next2, %next3 : f32, f32, f32, f32
}
%1 = arith.addf %sum#0, %sum#3 : f32
%2 = arith.addf %1, %sum#2 : f32
%3 = arith.addf %2, %sum#1 : f32
%4 = arith.addf %3, %arg1 : f32
%5 = arith.addf %4, %c5 : f32

In the example you provide, we add %arg1 9 times and %c5 1 time. Without this patch we add %arg1 9 times and %c5 4 times after unroll-and-jam. With this patch we add %arg1 9 times and %c5 1 time after unroll-and-jam.

@bondhugula @dcaballe Please review, thanks. Please read my last comment for an example. We don't need this for unroll because unroll does not widen the loop. This is the test I copied from unroll.mlir:

// UNROLL-BY-4-LABEL: unroll_with_iter_args_and_promotion
func.func @unroll_with_iter_args_and_promotion(%arg0 : f32, %arg1 : f32) -> f32 {
  %from = arith.constant 0 : index
  %to = arith.constant 10 : index
  %step = arith.constant 1 : index
  %sum = affine.for %iv = 0 to 9 iter_args(%sum_iter = %arg0) -> (f32) {
    %next = arith.addf %sum_iter, %arg1 : f32
    affine.yield %next : f32
  }
  // UNROLL-BY-4:      %[[SUM:.*]] = affine.for %{{.*}} = 0 to 8 step 4 iter_args(%[[V0:.*]] =
  // UNROLL-BY-4-NEXT:   %[[V1:.*]] = arith.addf %[[V0]]
  // UNROLL-BY-4-NEXT:   %[[V2:.*]] = arith.addf %[[V1]]
  // UNROLL-BY-4-NEXT:   %[[V3:.*]] = arith.addf %[[V2]]
  // UNROLL-BY-4-NEXT:   %[[V4:.*]] = arith.addf %[[V3]]
  // UNROLL-BY-4-NEXT:   affine.yield %[[V4]]
  // UNROLL-BY-4-NEXT: }
  // UNROLL-BY-4-NEXT: %[[RES:.*]] = arith.addf %[[SUM]],
  // UNROLL-BY-4-NEXT: return %[[RES]]
  return %sum : f32
}

There is only one iterArg which is only used once in the unrolled loop.

ayzhuang updated this revision to Diff 506613.Mar 20 2023, 9:03 AM
kuhar resigned from this revision.Jun 11 2023, 1:14 PM
ayzhuang abandoned this revision.Jun 23 2023, 2:07 PM