This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Consider widening instructions in cast cost calculation
ClosedPublic

Authored by mssimpso on May 1 2017, 11:04 AM.

Details

Summary

The AArch64 instruction set has a few "widening" instructions (e.g., uaddl, saddl, uaddw, etc.) that take one or more doubleword operands and produce quadword results. The operands are automatically sign- or zero-extended as appropriate. However, In LLVM IR, these extends are explicit. This patch updates TTI to consider these widening instructions when determining the cost of sign- and zero-extends. If an extend is a part of a widening operation, it will be eliminated during code generation, so the reported cost should be zero.

Event Timeline

mssimpso created this revision.May 1 2017, 11:04 AM
mssimpso updated this revision to Diff 97324.May 1 2017, 11:58 AM
kristof.beyls added inline comments.May 2 2017, 12:31 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
195–196

If the idea is that this function only return true if the lengthening/widening is zero-cost; maybe the function name should be isLengtheningFree instead of isLengthening?
Or something similar indicating the expected cost of widening is zero?

lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47

I think the term "widening" is used for this operation, both in the LLVM code base, as well as in the ARMARM.
So, probably best to use isWidening here too.

test/Analysis/CostModel/AArch64/lengthening-casts.ll
2–15

Next to tests that verify that a 0 cost is returned, maybe it'd also be useful to explicitly test that the widening instruction variant actually gets generated?

Thanks Kristof! Responses inline.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
195–196

Sounds good! I like isLengtheningFree.

lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47

From what I've read, ARM uses "lengthening" to refer to the case where both operands are doublewords (e.g., uaddl) and "widening" to refer to the case when only one operand is a doubleword (e.g., uaddw). This patch only deals with the lengthening cases. We might handle the widening cases in the future, but they may not be as straightforward because the operand order matters. I can add a comment here explaining all of this.

test/Analysis/CostModel/AArch64/lengthening-casts.ll
2–15

We actually already have code generation tests for all of these cases over in test/CodeGen/AArch64 (see arm64-vadd.ll for the uaddl cases, for example). I just copied them over here to test the cost model. I'm fine adding an additional "llc" run-line to this test, though, that verifies the code generation property in this context. I checked and there are other cost model tests that do this (e.g., X86/scalarize.ll), so this wouldn't be that unusual.

kristof.beyls added inline comments.May 2 2017, 8:01 AM
lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47

I was making my comment based on Table C3-67 which contains UADDL in the ARMARM copy at https://static.docs.arm.com/ddi0487/b/DDI0487B_a_armv8_arm.pdf.
The table header is "SIMD widening and narrowing arithmetic instructions".
The table entry for UADDL has "UADDL, UADDL2 Unsigned add long (vector form)".
Therefore, I am assuming the L in the mnemonic stands for "long" not "lengthening".

Maybe there is other documentation that specifically calls this specific case "lengthening"?

test/Analysis/CostModel/AArch64/lengthening-casts.ll
2–15

Oh, we already have the tests to check the actual code generation. Than I think this is fine.
I think that you could argue both for and against the actual code generation test being in the same location as the cost model test.
I don't feel strongly enough either way to argue.

mssimpso added inline comments.May 2 2017, 8:19 AM
lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47

I was looking at different documentation (possibly outdated): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch07s03.html. Yours looks better to me, so let's go with isWideningFree, like you suggested.

mssimpso added inline comments.May 2 2017, 8:38 AM
lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47

Also, it looks like we already handle the W variants as well, where only one operand is extended. I was thinking the operand order in the IR might affect our ability to match these instructions, but it doesn't.

I'll update the patch to consider W variants (this will actually make it simpler since we don't need to care about the other operand of the binop), and add W tests.

mssimpso updated this revision to Diff 97481.May 2 2017, 11:52 AM
mssimpso marked 3 inline comments as done.
mssimpso retitled this revision from [AArch64] Consider lengthening instructions in cast cost calculation to [AArch64] Consider widening instructions in cast cost calculation.
mssimpso edited the summary of this revision. (Show Details)

Addressed Kristof's comments.

  • Changed wording to refer to "widening" instructions rather than "lengthening" instructions.
  • Changed function name to is isWideningFree.
  • Updated the logic to consider the W instruction variants in addition to the L variants.
  • Added code generation tests along side the cost test cases.
  • Added additional tests for the W variants.
kristof.beyls edited edge metadata.May 3 2017, 12:42 AM

Addressed Kristof's comments.

  • Changed wording to refer to "widening" instructions rather than "lengthening" instructions.
  • Changed function name to is isWideningFree.
  • Updated the logic to consider the W instruction variants in addition to the L variants.
  • Added code generation tests along side the cost test cases.
  • Added additional tests for the W variants.

Thanks Matt.
I'm not familiar enough with this code to feel comfortable approving, so I think someone else should do that.
Other than that, I only have one more remark, now that I've looked at the patch again: maybe there should also be a "negative" test, i.e. verifying that a non-zero cost is computed when a widening instruction cannot be generated, e.g. when the type operated on cannot map to a native instruction?

Thanks Matt.
I'm not familiar enough with this code to feel comfortable approving, so I think someone else should do that.
Other than that, I only have one more remark, now that I've looked at the patch again: maybe there should also be a "negative" test, i.e. verifying that a non-zero cost is computed when a widening instruction cannot be generated, e.g. when the type operated on cannot map to a native instruction?

Thanks Kristof! I'll add some negative tests and wait for more reviews.

mssimpso updated this revision to Diff 97693.May 3 2017, 11:30 AM

Added some negative tests.

evandro edited edge metadata.May 3 2017, 1:54 PM

It shouldn't be assumed that such widening instructions are always free. Their latency may be longer than the equivalent instruction sequences. Preferably, the cost of both choices should weighed after consulting the backend.

It shouldn't be assumed that such widening instructions are always free. Their latency may be longer than the equivalent instruction sequences. Preferably, the cost of both choices should weighed after consulting the backend.

Hi Evandro,

Thanks for the comment.

We mean "free" in the sense that the extends considered in this patch don't exist in the generated code. They're only there in the IR to make it well-typed (i.e., an IR "add" can't operate on <2 x i32> vectors and produce a <2 x i64> result). This is consistent with the way we use "free" in other places in the cost model, when applied to extends.

If you're suggesting, for example, that an ADD and a UADDL might in theory have different max latencies, I suppose this could be true for some hardware, but in a quick look through the sub-target scheduling models, none of them seem make a distinction that I saw. I think we generally prefer to omit sub-target details from the cost-model unless necessary.

I'm not sure what you mean by "both choices" since there's only once code sequence that is generated. Do you mean that the backed should decide whether to generate, for example, an extend plus ADD vs. UADDL based on cost? I don't think that's something we can do today,

We mean "free" in the sense that the extends considered in this patch don't exist in the generated code. They're only there in the IR to make it well-typed (i.e., an IR "add" can't operate on <2 x i32> vectors and produce a <2 x i64> result). This is consistent with the way we use "free" in other places in the cost model, when applied to extends.

I see.

If you're suggesting, for example, that an ADD and a UADDL might in theory have different max latencies, I suppose this could be true for some hardware, but in a quick look through the sub-target scheduling models, none of them seem make a distinction that I saw. I think we generally prefer to omit sub-target details from the cost-model unless necessary.

Exynos M1-2 do: 1 cycle for ASIMD ADD and 3 cycles for ASIMD SADDL. I tried this patch on Exynos and noticed minor regressions and no improvements in a handful of workloads.

I'm not sure what you mean by "both choices" since there's only once code sequence that is generated. Do you mean that the backed should decide whether to generate, for example, an extend plus ADD vs. UADDL based on cost? I don't think that's something we can do today,

I'll have to think more about this.

Exynos M1-2 do: 1 cycle for ASIMD ADD and 3 cycles for ASIMD SADDL. I tried this patch on Exynos and noticed minor regressions and no improvements in a handful of workloads.

OK, yes I see that now. I missed it on the first look! What do you think about adding a new sub-target feature for this then? Something like "FastWideningOps"? We could condition this TTI change on the feature to prevent it from hurting performance on Exynos. And regarding your last point, you could then later use it to guide instruction selection to not generate the widening ops if the non-widening versions (with added extends) are more profitable.

OK, yes I see that now. I missed it on the first look! What do you think about adding a new sub-target feature for this then? Something like "FastWideningOps"? We could condition this TTI change on the feature to prevent it from hurting performance on Exynos. And regarding your last point, you could then later use it to guide instruction selection to not generate the widening ops if the non-widening versions (with added extends) are more profitable.

Please, not yet another feature. It should be possible, methinks, to getSubtarget().getSchedModel() to compare the relative costs.

Do you have a publicly available benchmark where this change is meaningful so that I can evaluate it better? For I haven't seen any change in our workloads after this change. Maybe, in the end, it doesn't matter; or maybe it's not that common.

Please, not yet another feature. It should be possible, methinks, to getSubtarget().getSchedModel() to compare the relative costs.

I can't really comment on whether that would be possible or not, because I haven't looked at it. It seems to be somewhat tangential to this patch, though, since we currently always generate the widening instruction variants if possible, eliminating the extends. So I'm not sure how that would affect TTI. It seems like you're wanting to change instruction selection?

Do you have a publicly available benchmark where this change is meaningful so that I can evaluate it better? For I haven't seen any change in our workloads after this change. Maybe, in the end, it doesn't matter; or maybe it's not that common.

Yes, this change is primarily to encourage more aggressive SLP in spec2006/h264ref. It provides a ~17% improvement on Falkor.

mssimpso updated this revision to Diff 97995.May 5 2017, 12:14 PM
mssimpso added a subscriber: gberry.

Addressed Evandro's concerns.

Hey Evando,

I appreciate all the feedback. After discussing your concerns with @gberry, he suggested an alternative solution, which I've implemented in this updated version of the patch. The basic idea is to treat the widening instructions as single operations (rather than operations composed of two or three instructions), whose cost is attached to the arithmetic instruction. This design conceptually better matches the code that we generate.

I've rearranged the patch so that we (1) mark the extends as "free" and (2) apply a sub-target specified overhead to the arithmetic instructions that are widening. I've left the default overhead zero (so this version of the patch is functionally the same as the last one), but sub-targets will now have the flexibility to increase this overhead to better match the capabilities of their hardware. Thus, an "add" ultimately mapping to UADDL can have a different cost than an "add" mapping to ADD, for example. Does this make sense?

Please let me know what you think.

evandro accepted this revision.May 5 2017, 2:20 PM

@mssimpso, yes, it makes a lot of sense! This is a more sensible approach, methinks.

This revision is now accepted and ready to land.May 5 2017, 2:20 PM

Thanks Evandro!

This revision was automatically updated to reflect the committed changes.