This is an archive of the discontinued LLVM Phabricator instance.

AARCH64_BE load/store rules fix for ARM ABI
Needs ReviewPublic

Authored by akadlec on Feb 26 2014, 12:30 AM.

Details

Reviewers
t.p.northover
Summary

For Big Endian (BE) systems: Switch from LD1/ST1 loads to LDR/STR for NEON regs.
Apart from having better addressing modes and being specified in the ABI, LDR/STR do correct byte-swapping for BE, as opposed to the "element-swapping" taking place with LD1/ST1.

For Little Endian (LE), nothing changes in this step - although the shorter LDR/STR instructions should be enabled for LE as well - in LE, both instruction types do the same things and can be mixed.

For BE, initialization from literals must use vector load intrinsics - or the literals need to be rearranged before emit.

Diff Detail

Event Timeline

Hi Albrecht,

Thanks for working on this.

I think you've been a bit too liberal with your IsLE predicate, applying it to both patterns that you don't want to disable on BE (if I've understood properly) and to instruction definitions without any patterns (currently harmless, but pointless too).

I've also made one comment about the IsBE use. Is it really necessary?

Finally, there should definitely be regression tests for changes like this. Preferably for each pattern you're introducing or changing. (This is particularly important at the moment because (as Chris said) eventually we'd like to merge Apple's ARM64 LLVM port with this one, which will mean big changes everywhere).

Cheers.

Tim.

lib/Target/AArch64/AArch64InstrInfo.td
4856

Could we have capital letters at the start of sentences?

4858–4859

It'd probably be a good idea to refer people to the AAPCS here for more details.

4860

Why is this only eventually? Couldn't it be enabled now if it's got better addressing-mode properties?

4893

Commented code.

4901

If the CPU doesn't have FPARMv8 then f64 won't be a legal type and the DAG shouldn't contain any instances of it by this time.

4904

LLVM normally uses "FIXME" rather than "TODO". A consistent choice makes grepping a bit easier.

lib/Target/AArch64/AArch64InstrNEON.td
3391

There aren't any patterns in this multiclass so this is superfluous. If patterns *are* added, it's not clear that they'll be wrong for BE either: they could be the int_arm_neon_vldN version which you do want.

3888

No patterns and you probably *would* want any that existed since the layout issue doesn't exist for the duplicating loads.

3907

Ah, here they are. I think these patterns should be endian-independent.

3996

Loading a single lane is also layout independent (and these are not the patterns).

As they say for complex reviews: start early & iterate :-)

I forgot to add the bigger roadmap towards BE support

  1. disable LDx/STx
  2. regain matcher coverage by adding LDR/STR
  3. fix BE calling conventions to gain code correctness in BE (almost there internally)
  4. optionally enable LDR/STR for LE
  5. re-enable some LDn rules with extensive testing for nice interaction with STRed data structures

This patch covers 1-2, 3 is a prerequisite for 5 (inlining printf sucks)
Ideally these would be separate patches each - but matcher fails don't go down well.

YES, I've been liberal with the disabling - anything that looked dangerous had to go for now - for BE only.
I figured that single element duplicating loads may be fine.
But whether v8i8 or just 128bit elements (how about 64 bits??) are fine is still work in progress - needs testing.
Documentation is plentiful, and I haven't found what I'd really need, yet.

Then I added the new patterns conservatively for BE only - don't want to change LE code just now (we're comparing LE to BE, for example). Also not going to ruffle feathers of any LE guys (one pandora's box at the same time)

Right now, the focus is on getting BE to actually work (correctness first).
That needs another upcoming patch to the calling conventions - step 3) above.

Then we have correct code coverage, and we'll extend from there - within project limitations.

lib/Target/AArch64/AArch64InstrInfo.td
4856

Done

4858–4859

Done. the URL just so fits the 80cols limit :-)

4860
  1. we're using LE as reference - so trying not to change that, yet.
  2. people working on LE might oppose the code changes.

the comment was intended to start discussion that eventually leads to enabling.

4893

Yeah - I'm still wondering why there's a v1f64 non terminal, but no v1i32.
Any idea?
Symmetry would suggest, both should exist.

4901

ARM32 ended up having a few options for hard float units.

wasn't sure since there're other uses above - but not at all consistently.
E.g:
let Predicates = [HasFPARMv8] in {
def : Pat<(i32 (fp_to_sint f32:$Rn)), (FCVTZSws $Rn)>; ...

What do we do NOW ?
Cleanup (which way?) / leave as is (-> add guards for new code or not?) ?

4904

DONE

lib/Target/AArch64/AArch64InstrNEON.td
3391

True - It's more a reminder for the guy who adds patterns, that LDn/STn make trouble in BE, while it's fine in LE.
LE implementation is farther ahead - adding patterns without considering BE will be troublesome.
-> Shall I convert that to a comment ?

3888

True for the single-element replications below - first candidate for re-enabling.
Probably not true for the vector replicating loads here (if a single vector is loaded in reverse-element order, the duplication won't fix that ?)

3907

Need to do more reading to fully understand all details of the swapping - but single element reads should be fine, if they do internal byte-swapping within the element (-> v1x nonterminals and scalar nonterminals)

  • so conservatively "not yet" - i.e. until we have working code (CallingConv)
3996

need to check whether ld4ln ({1,2,3,4}) yields the same result on BE & LE

probably also depends on whether you feed that with an array (then yes) or a vector stored by STR (then probably NOT) -> pattern-dependent ?

akadlec updated this revision to Unknown Object (????).Feb 26 2014, 5:46 AM

changes from Tim's comment - PTAL

jmolloy requested changes to this revision.Feb 26 2014, 10:24 AM
jmolloy added inline comments.
lib/Target/AArch64/AArch64InstrInfo.td
4859

As the AAPCS is also not too clear on this, could you please add in the comment the *reason* ld1/st1 can't be used? More specifically than "wrong arg memory layout", it is because the LD1 performs lane-by-lane byte swapping, and LDR swaps the entire D/Q register.

4878

What does this mean? is it a paste from the comment below? is fp16 always available in a64? Also capitalization, as Tim said.

4879

Please add a FIXME and remove trailing "??"

4897

v1f64 is there because some instructions that act on f64's have both NEON and VFP variants, and in order that NEON intrinsics written by the user select the instruction that the user requested, there must be a way to distinguish between the two at the type level.

That is my understanding anyway - Hao or Jiangning will know more about this.

lib/Target/AArch64/AArch64InstrNEON.td
3383

This line of the comment doesn't make sense. LD1 is disallowed, but it has nothing to do with 12bit offset adds! You give the reasoning below.

Also in LLVM we have a tendency to write comments in a fairly prose-like form to make it easier to read. For example, instead of "LD1 disallowed in BE", "LD1 is disallowed in BE mode". Also instead of "reason: ", "This is because..."

3390

The comment says this should work in BE mode... then the predicate stops it working in BE mode. Why?

3461

Failed to reindent this line?

3888

Shouldn't a splatting LD1 still work in BE mode?

3908

This comment is cryptic. Also, why should we care about byte swapping here, we're under the IsLE predicate.

3931

ld1.64 should be fine too, right? Because ld1.64 acts the same as LDR (byte swapping on a 64-bit value).

3996

Will 1-element to 1-lane also work in BE mode?

Hi Albrecht,

  1. disable LDx/STx

I still think this is misguided; the lane and duplicating instructions
are very different beasts, without any suggestion (as far as I can
see) that they're problematic except that they share part of their
mnemonic with the dodgy ones.

Anything disabled should be because you can explain why it goes wrong,
not through guilt by association.

  1. optionally enable LDR/STR for LE

I think this should happen first. I realise your main concern is
big-endian support, but it's best not to get too bogged down in that
if there's an opportunity to improve other parts of LLVM at the same
time. Since it looks like these are more flexible generally, and
identical to ld1 for LE, I'd suggest a separate patch adding these
patterns for all both BE & LE (with tests, obviously).

I'd support that change from what I've heard. *If* people object, then
we can go back to this suggestion (reluctantly). I think that's
unlikely though.

Ideally these would be separate patches each - but matcher fails don't go down well.

Agreed, but I'm not too bothered by that personally. I'm not sure
where you're getting matcher fails with a partial patch though; that
might need looking into if it's in the regression tests.

Then I added the new patterns conservatively for BE only - don't want to change LE
code just now (we're comparing LE to BE, for example). Also not going to ruffle feathers
of any LE guys (one pandora's box at the same time)

I'd say you're more likely to do that by restricting all changes to BE
than by changing LE when it's a benefit to both.

Finally, the newest patch still doesn't have any regression tests. The
other stuff's definitely still up for debate, but those really are
essential.

Cheers.

Tim.

Hmm, uploading a partially updated patch mostly hides previous comments (unexpected for me).
Seems to have killed Tim's and my discussion on the inline-comments for the 1st diff.

So here's the open question again:

Is it ok to guard instruction defs with IsLE to hint that they might not be safe to use in BE ?
Or add a comment instead ?
If somebody adds a valid pattern (e.g. intrinsic) that predicate can be removed,
If somebody adds a pattern valid only for either BE or LE, the predicate must go there.

Fixes included in the next revison (today)

lib/Target/AArch64/AArch64InstrInfo.td
4859

Hmmm, somebody should tell ARM, so they can clarify that section ;-)

I'd written that to the beginning of the AArch64InstrNEON.td - but left it out again.
I'd still prefer to put that next to the NEON load-store instructions, where it belongs.
If you can bear with a larger paragraph of comments ...
DONE

4878

The fp* rules have already been there before I added the vector types. They were unguarded. I kept it that way until we clarify the necessity of these guards.
It's not unimaginable that other FP hardware might emerge besides neon vfp.
I don't know and even manufacturers don't see the future market demands, yet.

4879

I was waiting for a reply from Tim -> remove the comment and do the right thing whatever that is.

I since figured, that if the predicates exist I'd better use them - ignoring the non-consistent picture so far.

Better a fail to match in the presence of upstream errors, than blowing up in the face of a customer's customer.

4897

hmm I don't see the necessity for a separate nonterminal, there.
either
a) Intrinsics are matched directly - so no need there.
b) As long as the semantic isn't different, the user couldn't care less.
(-> pseudo instruction that can be lowered later as fits)

Nonterminals are data-types for registers.
Separate nonterminals are for different in-register representations, e.g. they might be handy for "reversed" vectors, together with the chain rules that do the element-swaps.
They're really handy for high'low positioning of values in register.
But that all requires a PBQP matcher that finds the optimal coverage for the function's DAG.

lib/Target/AArch64/AArch64InstrNEON.td
3383

right - was mixing the BE & LE issues, there
fixed by adding the whole ugly story, right at the start of the store section.
Wish we could separate the stores out from that 10.000 lines file. :-(

ad Comments: - that's probably why there were so many helpful comments in that file :-(
At many places I'd have preferred a short comment over none at all.

3390

first guess at step 5 in our roadmap - and being conservative.

Anyway - fixed, now that I had the time to think through it all.

3461

Done.

3888

At most the ones that only read one element and duplicate that.

The multi-element reads will have unexpected order (that struct was STRed!), so can only be used via intrinsics to read from arrays.

3908

That's the reason for the IsLE - because it doesn't work for BE.
Pulled the comments out.

Now I think it's still wrong, as the elements would be read in ascending address order "array-like", while they have been stored reversed by STR.

3931

yes

3996

added the following comment to the pattern and removed the predicate.

This will not work as intended in BE mode, if the matcher generates it to
load a vector to a lane. (STR q0 stored the elements swapped)
Must always use an intrinsic, so the user knows it's loading from an array
layout.

+let Predicates = [IsLE] in {

// Load single 1-element structure to all lanes of 1 register

James Molloy wrote:

Shouldn't a splatting LD1 still work in BE mode?

At most the ones that only read one element and duplicate that.

The multi-element reads will have unexpected order (that struct was STRed!), so can only be used via intrinsics to read from arrays.

I'm not sure I follow here. Struct's aren't short vectors, so their
layout is dictated by the normal C rules and I think they will have
the expected order on both little and big-endian machines. The example
I'm thinking of might be written as:

#include <arm_neon.h>
typedef struct { uint8_t r, g, b; } RGB;
uint8x8x3_t read(RGB *colours) {
  uint8x8x3_t result;
  result.val[0] = vdup_n_u8(colours->r);
  result.val[1] = vdup_n_u8(colours->g);
  result.val[2] = vdup_n_u8(colours->b);
  return result;
}

I think this would be best implemented as an ld3r on both big and
little-endian systems, and is the intended use of that instruction.

Could you give a snippet of either LLVM IR or C that you think we
might naively use ldNr for, but would be invalid on big-endian
systems? Just so I can get a better idea of what you're thinking of.

-defm LD1LN : LDN_Lane_BHSD<0b0, 0b0, "VOne", "ld1">;
+let Predicates = [IsLE] in {

+ // Load single 1-element structure to one lane of 1 register.

James Molloy wrote:

Will 1-element to 1-lane also work in BE mode?

added the following comment to the pattern and removed the predicate.

This will not work as intended in BE mode, if the matcher generates it to
load a vector to a lane. (STR q0 stored the elements swapped)
Must always use an intrinsic, so the user knows it's loading from an array
layout.

I don't believe this is true either. Consider the alternatives for the IR:

define <4 x i32> @foo(<4 x i32> %vec, i32* %addr) {
  %elt = load i32* %addr
   %newvec = insertelement <4 x i32> %vec, i32 %elt, i32 0
   ret <4 x i32> %newvec
}

This is the obvious, canonical situation where we'd want a pattern for
"ld1 (lane)". And indeed we generate "ld1 {v0.4s}[0], [x0]". But
what's the alternative if the ld1 is disabled? I strongly suspect
you'll find it's

ldr w0, [x0]
ins v0.4s[0], w0

which has exactly the same semantics.

I think the problem will actually come with the intrinsics, where we
probably want to generate this sequence from "vld1_lane_s32(addr, vec,
3)" but I'd strongly suggest approaching that from the front-end since
it should be mapping to that LLVM IR anyway.

Cheers.

Tim.

akadlec updated this revision to Unknown Object (????).Mar 3 2014, 12:19 PM

The promised updated patch including:
all single element LDn/STn allowed
LDR/STR for vector regs for BE & LE
LE tests fixed
BE tests added

PTAL

Hi Albrecht,

AAPCS64 requires to use LDR/STR only for short vectors defined in AAPCS64. The definition of short vector in AAPCS64 requires the monolithic alignment of the whole short vector rather than element alignment.

For some reason, LLVM compiler could generate element alignment short vector for storing array purpose. This type should be different from the short vector defined in AAPCS64. All of the instruction using this data type should fall into LD1/ST1. LD1/ST1 should not make difference for element ordering between LE/BE, and the only difference is the type ordering inside the element. We will be supporting this element alignment short vector access soon. Refer to the example inlined.

There are some other comments inlined.

Thanks,
-Jiangning

lib/Target/AArch64/AArch64InstrNEON.td
107

This comment is misleading. Every instruction should be valid for big-endian, although the same instruction can have different behaviors for LE/BE.

3362

How do we come across a case mixing the uses of LDR and LD1? If it's type casting, end-user should guarantee the correctness by program logic itself rather than by compiler.

3417

This is not the only case. Auto-vectorizer could generate element alignment short vector ld/st. For example, middle-end could generate

store <4 x i16> %val, <4 x i16>* %ptr, align 2

We should generate instruction like st1 v0.4h, [x0].

Unfortunately, we can't generate this instruction yet with trunk. We will get it fixed as soon as possible.

3424

Is this to disable LD1/LD2/LD3/LD4 for big-endian? If yes, why the test cases using those instructions can pass with big-endian configuration?

This piece of code is to define encodings, and LE/BE should always cover them. If we don't want to generate any instruction, we should control them with pattern match.

3483

ST1/ST2/ST3/ST4 essentially use aggregate short vector type like,

typedef struct int16x4x3_t {

int16x4_t val[3];

} int16x4x3_t;

which is defined in arm_neon.h.

With this data type, LE/BE should only make difference for the layout inside element int16. The data layout among different elements should be always the same.

Hi Jiangning,

I'm not sure I understand your comments. Do you mean ARM is intending to add C level types to ACLE & AAPCS that *will* behave as if loaded and stored with ld1/st1 soon?

Cheers.

Tim.

lib/Target/AArch64/AArch64InstrNEON.td
3362

The compiler was mixing them at will previously (e.g. storeRegToStackSlot uses str, but this address could escape and be used in a normal load which we'd use ld1 for). I believe Albrecht's comment is designed to warn against this, and I support it.

3417

I don't believe we're forced to generate either and there are arguments in favour of both, but being consistent is *very* important. As Albrecht said, we can't mix the two kinds of load/store.

I agree that using ld1/st1 exclusively would make LLVM's semantics easier to get right, but it would make getting the AAPCS right harder (bitcasts would become non-trivial operations and be needed at all potentially ABI-visible boundaries).

I suspect (but don't know) that the ldr/str route is capable of producing better code on average.

3424

The "IsBE" predicate is codegen-level rather than an AssemblerPredicate so MC tests won't be affected anyway. And there's only one CodeGen test mentioning them that's not based on intrinsics (which gets more substantial changes), so I think that part's OK.

Your comment about only applying IsBE to patterns is a good one though.

3483

I believe this is incorrect for the simple instructions. "ld1 {v0.4h, v1.4h}, [x0]" is equivalent to "ld1 {v0.4h}, [x0]; ld1 {v1.4h}, [x0, #8]" and different from "ldr d0, [x0]; ldr d1, [x0, #8]" on big-endian systems.

Hi Tim,

I'm not sure I understand your comments. Do you mean ARM is intending to add C level types to ACLE & AAPCS that *will* behave as if loaded and stored with ld1/st1 soon?

No, I didn't mean that. We should follow AAPCS64. AAPCS64 says,

"Elements in a short vector are numbered such that the lowest numbered element (element 0) occupies the lowest numbered bit (bit zero) in the vector and successive elements take on progressively increasing bit positions in the vector. When a short vector transferred between registers and memory it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single STR of the entire register; a short vector is loaded from memory using the corresponding LDR instruction. On a little-endian system this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector."

All these statements are talking about the short vector with total size alignment. However, for the LLVM IR, we have the case of element size alignment short vector, which should not simply fall into this category. It should be treated as an array of elements, and using ld1/st1 to completely match this semantic, and we don't have semantic difference for ld1/st1 between LE and BE except the data layout inside element.

For total size aligned short vector, ld1/st1 have the same semantics as ldr/str on little-endian. We prefer to use ldr/str because they have better addressing modes than ld1/st1. On big-endian, we should only use ldr/str to meet semantic requirement.

Thanks,
-Jiangning

lib/Target/AArch64/AArch64InstrNEON.td
3362

We should avoid mixing the use of ld1 and ldr. storeRegToStackSlot should decide to use ld1 or ldr by checking the alignment. If it is not an element alignment, but a whole short vector alignment, we should use ldr, while for other cases, we should use ld1. This way, we should be able to always keep endianess correctness and we should not have mixing issue.

3417

I don't think we're forced to generate either as well, but we should keep semantic correctness by choosing either in terms of alignment.

Actually we don't really violate AAPCS at all. AAPCS says, "A short vector has a base type that is the fundamental integral or floating-point type from which it is composed, but its alignment is always the same as its total size.".

If the memory address is not total size aligned, it is not a "short vector" definition in AAPCS. It should be treated as an array, which is usually generated from auto vectorizer, so we prefer to generate ld1/st1 for it.

3483

I don't think I meant ld1 and ldr have the same sementic between LE and BE systems. I agree with your statement. What I meant is ld1/st1 should always have the same semantic between LE and BE systems except the data layout inside the element. We should choose ldr or ld1 in terms of alignment on IR. If it is total size aligned access, we use ldr, and otherwise we use ld1.

Hi Jiangning,

I'm afraid I still can't quite see what you're proposing. First, are you sure you mean "alignment" in your post? If so, you seem to be advocating treating these two instructions differently:

%val = load <4 x i16>* %addr, align 8 ; gets ldr
%val = load <4 x i16>* %addr, align 2 ; gets ld1

My opinion is that would be madness, and almost impossible to produce a consistent code from. I'll try to think up some examples if you like, but just want to make sure I understand what you're saying first.

If that's not what you mean, could you give some IR examples and the code you'd like them to generate (particularly showing the distinctions)?

Cheers.

Tim.

Hi Tim,

I'm afraid I still can't quite see what you're proposing. First, are you sure you mean "alignment" in your post? If so, you seem to be advocating treating these two instructions differently:

%val = load <4 x i16>* %addr, align 8 ; gets ldr
%val = load <4 x i16>* %addr, align 2 ; gets ld1
My opinion is that would be madness, and almost impossible to produce a consistent code from. I'll try to think up some examples if you like, but just want to make sure I understand what you're saying first.

Yes. This is my point. Could you please give me some examples to articulate it is "madness". :-)

If we don't use ld1 for "align 2" case, which instruction we should use? ldr requires total size alignment, otherwise exception would be raised if strict alignment is enabled.

Thanks,
-Jiangning

Just a short comment since I don't work for Abix any more (due to differences in promised/actual payment) -> Christian, the other compiler guy at Abix will probably pick this up shortly.

@Jiangning:
alignment is a minimum-attribute of a type. The type can always be better-aligned, and

  1. there's a tendency to do that for many chips to better utilize memory bus cycles.
  2. unioning with a 128bit int will boost a vector's alignment (although you get lucky now in that it's not an HVA any more) -> Any such boost in alignment would suddenly have it stored in a different format (STR). Giving the address of the vector to a function has the function not knowing the actual alignment *) -> it will assume minimum alignment and load via LD1 - unless we have a clean type-separation between HVAs and array-like aggregates.

*) pointer-to-T args have to demote alignemnt to the minimum alignment of T, as its' sufficient that any passed argument might be aligned that low.

The bad example may not work exactly like this, but it's bound to be found eventually.
For me, relying on alignment alone is just asking for disaster:

It's not inconceivable, that someone might write an alignment analysis that tries to supply better-than guaranteed-for-the-type alignment to the backend in order to exploit aligned loads with lower memory bus resource usage (e.g. malloc/new -ed variables are typically 128bit aligned anyway).
For AARCH64 it might not be a bad idea to boost alignment for types similar to vector types anyway, if you want to get performance out of NEON (actually, I think that's what ARM tried to achieve with that definition).
It would be very strange to enforce higher alignment for HVAs and force array-type data to lower alignment, when LD1 also benefits from higher alignment.

Solution:
The frontend must give the backend a totally different type for short vectors (HVAs), so that the memory layouts cannot be mixed.
Then you can re-eanble LD1 for array type loads in BE.

Yes. This is my point.

Oh good, at least we're communicating properly!

Could you please give me some examples to articulate it is "madness". :-)

Well, take a look at the file I attached for example. Running "opt" on
it allows inlining and the real alignment of 8 is propagated to the
load. As a result, you'd get different results depending on
optimisation level (if we used "ld1 {v0.4h}").

If we don't use ld1 for "align 2" case, which instruction we should
use? ldr requires total size alignment, otherwise exception would
be raised if strict alignment is enabled.

If we decided to support strict alignment mode efficiently, we would
probably want to emit an "ld1 {v0.8b}" (i.e. always use the .8b or
.16b version), since that's got the same semantics as ldr. At the
moment neither gets emitted so it's not really a pressing issue (it
would be part of "support strict align" rather than "support
big-endian" in my view).

Cheers.

Tim.

  • {F47028, layout=link}

Hi Tim,
Hi Jiangning,

are you ok with committing the initial submission?

Thanks,
Christian

Hi Christian,

The original patch doesn't work any longer on trunk. Can this be merged
with trunk and sent out again?

Thanks,
-Jiangning

2014-03-31 20:31 GMT+08:00 Christian Pirker <cpirker@a-bix.com>:

Hi Tim,
Hi Jiangning,

are you ok with committing the initial submission?

Thanks,
Christian

http://llvm-reviews.chandlerc.com/D2884

Hi Christian,

I think the most recent patch was still too conservative (w.r.t. duplicating & lane operations) and intrusive (in the testing).

Cheers.

Tim.

Hi,

I restarted a new revision (D3345) so that I can patch new diffs.

Thanks,
Christian

jmolloy removed a subscriber: jmolloy.