Currently, loading from or storing to a stack location with a structured load
or store crashes in isAArch64FrameOffsetLegal as the opcodes are not handled by
getMemOpInfo. This patch adds the opcodes for structured load/store instructions
with an immediate index to getMemOpInfo & getLoadStoreImmIdx, setting appropriate
values for the scale, width & min/max offsets.
Details
Diff Detail
Unit Tests
Time | Test | |
---|---|---|
60,030 ms | x64 debian > MLIR.Examples/standalone::test.toy | |
60,020 ms | x64 debian > libFuzzer.libFuzzer::large.test |
Event Timeline
Hi @kmclaughlin, this looks like a nice fix! I've got a few comments so far about the load tests. Some of the comments about choosing the right size for the alloca probably apply to the store tests too.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
2935 | nit: I think you can just write TypeSize::Scalable(32) and similarly for the others. | |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
11 | I'm not sure if this is really exercising the code path you've changed because the stack pointer looks wrong here, i.e. we've not actually allocated anything and are just loading random data. I think it's because nothing got stored into the alloca and so it was optimised away. I wonder if it's worth adding some kind of normal store before the load, i.e. something like store [16 x i8] zeroinitializer, [16 x i8]* %alloc, align 16 | |
28 | I think that all the allocas here have to at least match the size of the data you're storing out. The test above has the same problem too I think. I think for all these tests you can just do: %alloc = alloca [NumVecs * 32 x i8], i32 1, align 16 I choose '32' here because you've set vscale_range to (2,2). For ld2 NumVecs=2, ld3 NumVecs=3, etc. | |
72 | Again, here you're indexing into some bigger than the alloca. You need to allocate at least 3 x the space required for <vscale x 32 x i8> here. | |
170 | I think this is a valid test, but perhaps worth adding a comment why it's useful? It's because this exercises the path where isAArch64FrameOffsetLegal returns a non-zero stack offset due to extra alloca. |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
---|---|---|
6 | Tests that rely on vscale_range become more readable if they'd have the vscale_range specified directly with the definition. If the test file is long (like this one), the #<some number> make it really obscure what vscale_range it refers to. I fixed something similar in 11cea7e5ce4d3f6a0d2fac016d503f99c52cdc96 |
- Ensure the correct amount of space is allocated in each test by increasing the size of the allocas.
- Added a store to each of the tests to ensure the allocas aren't optimised away.
- Moved vscale_range into the definitions of the tests.
This is looking good now @kmclaughlin! I just had a few more minor comments on the tests ...
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
---|---|---|
80 | ||
184 | I think this also needs to be: %alloc = alloca [48 x i16], i32 4 | |
205 | This needs to be: %alloca1 = alloca <vscale x 4 x float>, i32 12 due to the GEP index (9) + the fact you're actually loading <vscale x 12 x float> worth of data. | |
314 | This needs to be %alloca1 = alloca <vscale x 2 x double>, i32 13 due to the GEP offset (9) + the data size loaded. | |
llvm/test/CodeGen/AArch64/stN-reg-imm-alloca.ll | ||
19 | Hi @kmclaughlin, I'm afraid I may have sent you down a pointless path here and that's my fault for not explaining this more clearly! For the structured stores you don't actually need the extra store here because the alloca cannot be folded away, since the structured store already prevented that. If it's not too much trouble, I think it's worth removing these extra stores to make the tests simpler. NOTE: The only exception to this is where you have two allocas in the test, in which case having a store for each alloca is right! | |
89 | I think this needs to be: %alloc = alloca [8 x i64], i32 4 to account for the GEP offset + store data. | |
345 | I think this needs to be: %alloc = alloca [32 x i32], i32 2 | |
373 | I think this needs to be: %alloca1 = alloca <vscale x 2 x double>, i32 8 |
- Removed unnecessary stores from tests in stN-reg-imm-alloca.ll which use only one alloca.
- Increased the number of elements for allocas in a number of tests in which a gep was attempting to access data beyond the allocated space.
LGTM! Thanks for making the changes @kmclaughlin.
llvm/test/CodeGen/AArch64/sve-fixed-ld2-alloca.ll | ||
---|---|---|
2 | nit: I think maybe we don't need this file anymore, since all the various loads/stores are now covered in the other tests? |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
---|---|---|
15 | Looking at the code-changes, I actually don't think that the fixed-width property (for the alloca and vscale_range) matters for these tests. I think you only really need the scalable tests. | |
97 | Can you make sure that all these tests have a positive test for the maximum range, and a negative test for one-beyond the maximum range ? | |
201–202 | nit: I'd recommend adding nounwind attribute to avoid the CFI info, as it's not relevant to the test. | |
318 | alloca2 is unused? (also true for other cases) | |
llvm/test/CodeGen/AArch64/sve-fixed-ld2-alloca.ll | ||
2 | Perhaps we can keep this one test if the other fixed-width tests are removed, just to ensure we cover that case too (even if it doesn't necessarily exercise anything specific in this patch that isn't already tested otherwise). If so, then I'd recommend expressing it with an explicit st2 intrinsic instead of a shufflevector+store. | |
3 | redundant if vscale_range is set. |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
---|---|---|
318 | This is something I asked @kmclaughlin to do because it's the only way to expose some of the code changes in this patch. All the tests ending _valid_imm do this for that reason. If you look at isAArch64FrameOffsetLegal we return a StackOffset, which is always zero for all tests in this file except ones like this. Having a non-zero StackOffset helped to ensure we were calculating the remainder/offset correctly using the Scale property set in getMemOpInfo. We can remove the test, but I'm worried we're not fully testing the changes that's all. For example, in ld3b_f32_valid_imm you'll notice the addvl just before the ld3b, which happens precisely because StackOffset is non-zero. |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
---|---|---|
318 | I assumed that was what the gep was for. Maybe it's because of how this is written. If you write: %alloca1 = alloca <vscale x 64 x double>, align 4 %alloca1.bc = bitcast <vscale x 64 x double>* %alloca1 to <vscale x 2 x double>* %base = getelementptr <vscale x 2 x double>, <vscale x 2 x double>* %alloca1.bc, i64 28, i64 0 %ld4 = call <vscale x 8 x double> @llvm.aarch64.sve.ld4.nxv8f64(<vscale x 2 x i1> %pg, double* %base) Then that results in: ld4d { z0.d, z1.d, z2.d, z3.d }, p0/z, [sp, #28, mul vl] Whereas %alloca1 = alloca <vscale x 64 x double>, align 4 %alloca1.bc = bitcast <vscale x 64 x double>* %alloca1 to <vscale x 2 x double>* %base = getelementptr <vscale x 2 x double>, <vscale x 2 x double>* %alloca1.bc, i64 32, i64 0 %ld4 = call <vscale x 8 x double> @llvm.aarch64.sve.ld4.nxv8f64(<vscale x 2 x i1> %pg, double* %base) Results in: <x8 = calculations for sp + 28 * sizeof(VL)> ld4d { z0.d, z1.d, z2.d, z3.d }, p0/z, [x8] |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
---|---|---|
318 | Sure, I'd be happy with that if it works and @kmclaughlin can see it leads to the non-zero StackOffset - if we can avoid the second alloca then all the better! |
- Changed fixed-width allocas used in the tests to scalable.
- Added tests for offsets which are at the min/max range & tests outside the min/max range.
- Added the nounwind attribute to all tests.
- Changed the tests for non-zero offsets to remove the second alloca.
Hi @kmclaughlin thanks for updating the tests to work only on scalable types. I think the tests are still a bit inconsistent, and I've tried to highlight some of the inconsistencies in one of the files. Perhaps iteratively updating this file will just make things more complicated, so maybe it's easier to start with a clean slate and test:
- ld2b valid min offset
- ld2b valid max offset
- ld2b one below min offset
- ld2b one beyond max offset
- ld2h valid min offset
- ld2h valid max offset
- ld2h one below min offset
- ld2h one beyond max offset
- ...
and doing that also for ld3, ld4, st2, st3 and st4.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
2926 | nit: remove redundant newlines here and below. | |
llvm/test/CodeGen/AArch64/ldN-reg-imm-alloca.ll | ||
2 | These files are all specific to SVE, can you prefix them with sve-? | |
7 | vscale range can now be removed. | |
19 | What values do these stores add to these tests? | |
24–39 | personally I don't see a lot of value in tests where it's indexing into the alloca at offset 0. | |
76 | ld2b for loading f16's? (i see the same in other tests) | |
326 | I'm a bit confused why you're mixing tests for floating point and integer element types. We can just use only integer types, because loads/stores don't distinguish the types. It is not relevant for these tests since they're trying to test getMemOpInfo (as opposed to say, ISEL). |
- After some discussion about the tests in this patch offline, I have removed ldN-reg-imm-alloca.ll & stN-reg-imm-alloca.ll in favour of adding mir tests.
- Removed newlines introduced in AArch64InstrInfo.cpp.
Hi @kmclaughlin, thanks for updating the tests as MIR tests, these look really good now! I've still requested changes on the patch, because the scaling doesn't look right yet. It seems like this offsets the base pointer with the wrong value/offset. It should be pretty trivial to fix that though.
llvm/test/CodeGen/AArch64/sve-ldN.mir | ||
---|---|---|
94–95 ↗ | (On Diff #408443) | I don't think this is correct, as I expect this to be a total offset of ptr + 2 + 14. i.e. an ld2w instruction loads 2 vectors worth of data. The immediate is a multiple of 2, so the maximum immediate value to load from is ptr + 14 * sizeof(vector). When we load 2 * sizeof(vector) beyond that, I would expect it to load from ptr + 16, not ptr + 18. It depends on where the scaling is done. In this case, I think the scaling for this operand should be TypeSize::Scalable(16), because the immediate is already expected to be scaled (it is already a multiple of 2/3/4). The encoding of the immediate will then divide the offset by 2, 3 or 4. If we would have implemented the operand as an operand that is not yet scaled, then we'd need to do the scaling here, as well as in the operand printer. |
- Changed Scale to TypeSize::Scalable(16) for all opcodes added to getMemOpInfo, fixing incorrect scaling when the immediate is out of range
- Reverted the previous changes which set Scale to TypeSize::Scalable(16) for all opcodes.
- Corrected the Min & Max values added to getMemOpInfo, as these should be the indices -8 to 7 for all structured loads & stores.
nit: remove redundant newlines here and below.