This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Correct lane zero optimization in insert/extract costs
AbandonedPublic

Authored by mssimpso on May 3 2017, 1:14 PM.

Details

Summary

In the TTI calculation of vector insert and extract costs, we have an optimization that returns a cost of zero if we are inserting into or extracting from vector lane zero. All other inserts and extracts cost the base amount specified by the sub-target. However, the lane zero optimization only makes sense for floating-point types (i.e., within-class moves). For integer types, we should incur a cost for moving data from vector to general purpose registers, even for lane zero.

This patch modifies the lane zero optimization so that it applies only to floating-point types. Additionally, we now fall back to the base TTI implementation for all other floating-point inserts and extracts. The existing sub-target specified insert/extract costs are used only for the cross-class moves, which I think was probably the original intent. Since the existing code looks like a bug to me, I checked the X86 target, and it implements something similar to what is in this patch.

I've added a new cost model test case in Analysis/CostModel/AArch64/inserts-extracts.ll. All other test case changes are trivial (e.g., they lower the SLP threshold to ensure tests still vectorize).

Event Timeline

mssimpso created this revision.May 3 2017, 1:14 PM
anemet edited edge metadata.May 3 2017, 1:39 PM

Hi Matt,

In the TTI calculation of vector insert and extract costs, we have an optimization that returns a cost of zero if we are inserting into or extracting from vector lane zero. All other inserts and extracts cost the base amount specified by the sub-target. However, the lane zero optimization only makes sense for floating-point types (i.e., within-class moves). For integer types, we should incur a cost for moving data from vector to general purpose registers, even for lane zero.

Actually, this is also true if the insert is fed by a load. In this case we can just directly load into the vector register. In my recent experience with SLP this seemed like a pretty important case which is changing with this patch. How does performance look?

This patch modifies the lane zero optimization so that it applies only to floating-point types. Additionally, we now fall back to the base TTI implementation for all other floating-point inserts and extracts. The existing sub-target specified insert/extract costs are used only for the cross-class moves, which I think was probably the original intent. Since the existing code looks like a bug to me, I checked the X86 target, and it implements something similar to what is in this patch.

What does this mean in terms of the change of cost? 3->1? Inserts are I think pretty expensive even within-class because they represent partial-writes. I was surprised to discover recently that extracts and inserts have the same costs.

Thanks,
Adam

Hi Adam,

Actually, this is also true if the insert is fed by a load. In this case we can just directly load into the vector register. In my recent experience with SLP this seemed like a pretty important case which is changing with this patch. How does performance look?

That's right. That actually should be true for all load/insert sequences of legal types, not just ones that insert into lane zero - we should generate LD1s for all lanes. So I think that could be an additional optimization, probably in a separate patch? Regarding performance, this is generally beneficial for our cores (Kryo, Falkor), but our base insert/extract cost (2) is already lower than the default (3), so the effect may be somewhat different.

What does this mean in terms of the change of cost? 3->1? Inserts are I think pretty expensive even within-class because they represent partial-writes. I was surprised to discover recently that extracts and inserts have the same costs.

The changes to the default costs can be seen in the new cost model test I added. Basically, for integer and pointer types, things stay the same at 3 (but now lane zero is also 3). For floating-point, lane zero stays the same at 0 (but now the other lanes are at 1). But I'm happy leaving the float-point non-zero lanes at 3 if you prefer, and considering them instead in a follow-on patch if necessary.

My priority here is fixing the "extracting from lane zero costs nothing" issue, which is not true for integer types. If this change is too scary all at once, we could start with that case and tackle the issues/optimizations one-by-one. What do you think?

anemet added a comment.May 8 2017, 2:32 PM

Hi Adam,

Actually, this is also true if the insert is fed by a load. In this case we can just directly load into the vector register. In my recent experience with SLP this seemed like a pretty important case which is changing with this patch. How does performance look?

That's right. That actually should be true for all load/insert sequences of legal types, not just ones that insert into lane zero - we should generate LD1s for all lanes. So I think that could be an additional optimization, probably in a separate patch?

Sounds good.

Regarding performance, this is generally beneficial for our cores (Kryo, Falkor), but our base insert/extract cost (2) is already lower than the default (3), so the effect may be somewhat different.

For perf changes like this, it would be great if we could have a more details analysis of the changed hotspots (like what I did for the 64-bit SLP). Opt-viewer is a great tool for this especially with opt-diff which will tell you the changes in SLP vectorization in the hotspots (if you run it with PGO). Unfortunately, I didn't have time to clean up my patches that add SLP vectorization remarks so you can't use it yet :(.

What does this mean in terms of the change of cost? 3->1? Inserts are I think pretty expensive even within-class because they represent partial-writes. I was surprised to discover recently that extracts and inserts have the same costs.

The changes to the default costs can be seen in the new cost model test I added. Basically, for integer and pointer types, things stay the same at 3 (but now lane zero is also 3). For floating-point, lane zero stays the same at 0 (but now the other lanes are at 1). But I'm happy leaving the float-point non-zero lanes at 3 if you prefer, and considering them instead in a follow-on patch if necessary.

These make sense to me.

I am wondering if a better abstraction would be if we just had a cost for cross-domain moves rather then using these magic values. I guess it doesn't matter if this is the only place where we have this logic.

My priority here is fixing the "extracting from lane zero costs nothing" issue, which is not true for integer types. If this change is too scary all at once, we could start with that case and tackle the issues/optimizations one-by-one. What do you think?

Yes, we should definitely commit this in two pieces.

evandro added a subscriber: az.May 8 2017, 3:41 PM

Hey Adam,

Thanks for the feedback. It sounds like we can start the with the "extracting from lane zero" piece and then follow up with patches for inserts fed by loads and then within-domain moves. Hopefully this sequence will minimize the affects of any temporary imprecision. And I will benchmark the changes with some of the hardware I have access to (Kryo, Falkor, Cortex-A57), and provide some analysis.

Thanks. I've added opt-remarks to SLP in rL302811. Hopefully you can use those for your analysis too.

mssimpso abandoned this revision.Jun 16 2017, 10:03 AM

I'm abandoning this change for now. The patch resulted in a few performance regressions that need to be investigated. Also, making the insert/extract cost more precise may require that we re-tune the default cost (3) as well.