This is an archive of the discontinued LLVM Phabricator instance.

Generalize strided store pattern in interleave access pass
ClosedPublic

Authored by asbirlea on Aug 17 2016, 11:46 PM.

Details

Summary

This patch aims to generalize matching of the strided store accesses to more general masks.
The more general rule is to have consecutive accesses based on the stride:
[x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...]
and for each start element in each stride (x, y, ... z] to be aligned.
However all elements in the masks need not form a contiguous space, there may be gaps.
As before, undefs are allowed and filled in with adjacent element loads.

Note this patch is not final, but I would like to get feedback on the approach.
There are at least the pending TODOs.

Diff Detail

Repository
rL LLVM

Event Timeline

asbirlea updated this revision to Diff 68484.Aug 17 2016, 11:46 PM
asbirlea retitled this revision from to Generalize strided store pattern in interleave access pass.
asbirlea updated this object.
asbirlea added reviewers: HaoLiu, mssimpso.
asbirlea added subscribers: llvm-commits, delena, mkuper.
mssimpso edited edge metadata.Aug 19 2016, 1:35 PM

Hi Alina,

I think I understand this, but I just want to be sure I get how this differs from what we currently have before going further. Currently, we only match [x, y, ..., z, x+1, y+1, z+1, ...] where each y-x and each z-y equals the number of sub elements for the given factor. Or said another way, if I create a list or all the x's followed by all the y's and then all the z's, the entire list would be consecutive. With your path, the only requirement is that each sub-list be consecutive. Is this right?

The current approach was designed to match the shuffle patterns produced by the loop vectorizer. I'm curious to know where we are generating these more general patterns. Have you run across some code examples?

Also, another high level comment before I start looking at the details: you'll want to include some IR test cases as well (to be run with opt instead of llc).

Matt.

Hi Matt,

Thanks for looking to review this. Please find my answers below.

Hi Alina,

I think I understand this, but I just want to be sure I get how this differs from what we currently have before going further. Currently, we only match [x, y, ..., z, x+1, y+1, z+1, ...] where each y-x and each z-y equals the number of sub elements for the given factor. Or said another way, if I create a list or all the x's followed by all the y's and then all the z's, the entire list would be consecutive. With your path, the only requirement is that each sub-list be consecutive. Is this right?

That's right. Also, from my understanding, x is always 0. So all elements form a consecutive sublist which always starts at 0.
My first approach was actually to generalize this just to add a prefix to remove the "starts with 0" restriction and a more general stride that allowed gaps. But this still didn't cover all the testcases I came across, such as the example I added in "store_general_mask_factor4".

To answer your question below, the usecases I'm looking at are generated by Halide (https://github.com/halide/Halide).
Halide generates LLVM IR and relies on its optimization pipeline and lowering, but they need to generate explicit intrinsics (including strided loads and stores) for arm and aarch64, because their patterns are not lowered to intrinsics by LLVM.
Since this approach was taken before the interleaved-access pass was added, it's quite understandable, but LLVM is more powerful now and I'm trying to make use of this, and in the process, cover the cases missing in LLVM.
For example, for strided loads the interleaved-access pass does cover the code patterns generated by Halide, so the "custom" intrinsic code generation in Halide will soon be removed. My goal is to improve the pass to make this happen for the stores as well.
The tests I will add are actually simplified versions of what Halide is generating.

The current approach was designed to match the shuffle patterns produced by the loop vectorizer. I'm curious to know where we are generating these more general patterns. Have you run across some code examples?

Also, another high level comment before I start looking at the details: you'll want to include some IR test cases as well (to be run with opt instead of llc).

Agreed, the plan is to add more tests, including IR tests.

Matt.

mssimpso added inline comments.Aug 26 2016, 10:42 AM
lib/CodeGen/InterleavedAccessPass.cpp
159–164 ↗(On Diff #68484)

You should probably update this to define the more general pattern.

181–207 ↗(On Diff #68484)

This looks fairly reasonable to me, but the parts dealing with undef are pretty difficult to follow. I think some more high-level comments would help people better understand what's going on here.

asbirlea updated this revision to Diff 70479.Sep 6 2016, 3:33 PM
asbirlea edited edge metadata.

Address comments re. comments.
Complete TODOs.
Requesting help on whether to include TLI.misalignedAccess check and on what's the correct way to do it.

asbirlea added inline comments.Sep 6 2016, 3:37 PM
lib/CodeGen/InterleavedAccessPass.cpp
400 ↗(On Diff #70479)

This is a part that I'm not sure is needed, and how to address it.

The goal was to check for the alignment of each of the strides, i.e. BaseStoreAddress + StartingIncrementInStride, for all stride [0, Factor).
The commented attempt has a series of problems and does not achieve this. Should this check exist and what's the correct way to handle it?

Hi Alina,

Sorry for the delay. I'm not quite sure I understand this patch anymore. I'm adding Tim Northover (ARM/AArch64 code owner) as a reviewer to hopefully get this unstuck.

SG, thank you!

I'm going back to look at the alignment check this afternoon (that's the big commented out block).
I'd really like to understand why some basic alignment checks lead to ARM tests failing and not their AArch64 counterparts.

mssimpso added inline comments.Sep 13 2016, 1:51 PM
lib/CodeGen/InterleavedAccessPass.cpp
400 ↗(On Diff #70479)

I could be wrong about this, but I don't think you need to worry about alignment here. I'm not seeing how the memory behavior with this patch would be different than the current situation.

asbirlea added inline comments.Sep 13 2016, 4:09 PM
lib/CodeGen/InterleavedAccessPass.cpp
400 ↗(On Diff #70479)

I agree with you that there should be no significant difference from the current situation. There is one small difference though...before there could be one misaligned access, now there may be Factor such accesses.
That's why I'd still like to understand the alignment issue - whether the check is needed or not and in what form. Perhaps it would be better to have it in a separate patch though.

asbirlea updated this revision to Diff 71376.Sep 14 2016, 9:52 AM

Remove comment block checking for alignment. Will revisit in a future patch.

asbirlea updated this revision to Diff 73312.Oct 3 2016, 11:34 AM

Pinging patch.

Also, working around the case when masks are larger than 16 elements.
This can happen now if Halide takes advantage of this pass for strided stores.
This is not the only use-case of larger shuffle masks, but the topic is beyond the scope of this patch.

asbirlea updated this revision to Diff 73332.Oct 3 2016, 1:35 PM

Minor edit of temporary variables.

rengolin added inline comments.Oct 14 2016, 7:36 AM
lib/CodeGen/InterleavedAccessPass.cpp
193 ↗(On Diff #73332)

Nit, ij is hard to follow. Try lane or something more expressive. (this is not a matrix :)

204 ↗(On Diff #73332)

This is really confusing. Can you factor the comparison elements out with expressive names, so the if becomes a comparison of obvious terms?

208 ↗(On Diff #73332)

PreviousMask is always used in conjunction with PreviousPos, so you don't need the mask to be signed and you can compare the pos in the block above and get rid of the static casts. Or you could have an additional boolean flag and make them both unsigned.

lib/Target/AArch64/AArch64ISelLowering.cpp
7243 ↗(On Diff #73332)

Nit. use brackets here:

if (...) {
  ...
} else {
7247 ↗(On Diff #73332)

I don't get the - j here.

lib/Target/ARM/ARMISelLowering.cpp
13126 ↗(On Diff #73332)

Better to duplicate the comment, I think. These back-ends evolve at different paces.

test/CodeGen/ARM/arm-interleaved-accesses.ll
321 ↗(On Diff #73332)

is this really guaranteed to reproduce? they don't seem connected to the pcs directly...

asbirlea updated this revision to Diff 74721.Oct 14 2016, 11:27 AM

Address comments.

asbirlea added inline comments.Oct 14 2016, 11:34 AM
lib/CodeGen/InterleavedAccessPass.cpp
193 ↗(On Diff #73332)

Renamed to Lane. I hope changing the NumSubElts to LaneLen makes more sense too.
I'm inclined to change it in the lowering files as well for consistency.

lib/Target/AArch64/AArch64ISelLowering.cpp
7247 ↗(On Diff #73332)

Assuming the mask starts with a few undefs, this computes what the start of the mask would be based on the first non-undef value.
The computation is done first in the pass to make sure the start is a positive value (hence the correctness comment below on "StartMask cannot be negative")

test/CodeGen/ARM/arm-interleaved-accesses.ll
321 ↗(On Diff #73332)

I'm not sure about this TBH, and not sure how to verify it.
Should I replace it by a simple vst4.32 check?

rengolin edited edge metadata.Oct 14 2016, 12:37 PM

Hi Alina,

This is looking much better, thanks!

The code has a lot of undef handling, but not much in the way of testing it. I think we should have at least the following:

  • one and two undefs in the middle
  • one undef at the beginning and one at the end
  • all undefs in one lane
  • one undef in each lane, at different positions

Repeating the pattern of your current tests but adding undefs should be enough.

cheers,
--renato

lib/CodeGen/InterleavedAccessPass.cpp
193 ↗(On Diff #73332)

Nice, much better! I agree with renaming the lowering code, too.

188 ↗(On Diff #74721)

Better to declare I and J inside the for declaration.

200 ↗(On Diff #74721)

A comment here would help...

// If both defined, only sequential values allowed
216 ↗(On Diff #74721)

What about the case where the first lane is undef, but the others aren't?

226 ↗(On Diff #74721)

Instead of using a SavedNonUndef above, you could save the last non-undef value and the number of undefs since that value. That'd make the next-value computation easier:

If (NextValue != SavedValue + NumUndefs)
  break;

and also help get the StartMask here, for free.

237 ↗(On Diff #74721)

"Found an interleaved..."

lib/Target/AArch64/AArch64ISelLowering.cpp
7247 ↗(On Diff #73332)

Right, I agree you could repeat the naming pattern above, here.

test/CodeGen/ARM/arm-interleaved-accesses.ll
321 ↗(On Diff #73332)

Something like:

vst4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [r0]

would do. (I'm not sure of the triple brackets there...)

asbirlea updated this revision to Diff 74748.Oct 14 2016, 4:22 PM
asbirlea edited edge metadata.

Address comments. One pending.

Great point on the lack of testing. But I wasn't happy with the coverage the vst4/st4 had.
I added a pattern for vst3/st3 that covers the undefs in the middle of a lane.

lib/CodeGen/InterleavedAccessPass.cpp
188 ↗(On Diff #74721)

There's a check on I and J following each loop. I could add an additional flag to check that we broke out of the loop early, but it seemed overkill to do that when I and J could be used if declared outside the loop.

216 ↗(On Diff #74721)

Nothing wrong with that (unless I'm missing something)..
It'll check the correctness for the ones that follow and the first one will receive a value based on the following values - that's the start mask value.

226 ↗(On Diff #74721)

I'm still looking into this one.
I can do without SaveNonUndef, and update the condition to a "SavedLaneValue+SavedNoUndefs (+1)".
This needs an additional if clause in the loop to increment the SavedNoUndefs, and at least another check to help with computing the mask. The second check is because right now I only store SavedLaneValue if a value is followed by an undef, but at the end of the mask we'll need this updated too to get the correct StartMask as something like SavedLaneValue+SavedNoUndefs -LaneLen (+/- 1).
Right now I find it easier to just compute the StartMask in the same j loop.
So, yeah, still looking what's the cleanest way to do this.

asbirlea updated this revision to Diff 75329.Oct 20 2016, 11:47 AM

Address remaining comment. Add additional testcases.

asbirlea updated this revision to Diff 75332.Oct 20 2016, 11:54 AM

[clang-format]

Gentle ping.

Re-pinging patch.

Pinging again.

rengolin accepted this revision.Dec 13 2016, 7:31 AM
rengolin edited edge metadata.

Hi,

Sorry to keep you waiting, this completely fell out of my radar.

I think the code looks good now, just need to make sure the test is generic enough on the CHECK line (see inline comment).

cheers,
--renato

test/CodeGen/AArch64/aarch64-interleaved-accesses.ll
285 ↗(On Diff #75332)

Please, also use:

st4.32 {d{{\n+}}, d{{\n+}}, d{{\n+}}, d{{\n+}}}, [x0]

here.

This revision is now accepted and ready to land.Dec 13 2016, 7:31 AM
asbirlea updated this revision to Diff 81255.Dec 13 2016, 10:25 AM
asbirlea edited edge metadata.

Update aarch64 test.

Thank you for the review, Renato!

This revision was automatically updated to reflect the committed changes.