This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] vcreateq lane ordering for big endian
ClosedPublic

Authored by tmatheson on Apr 30 2021, 1:36 AM.

Details

Summary

Use of bitcast resulted in lanes being swapped for vcreateq with big
endian. Fix this by using vreinterpret. No code change for little
endian. Adds IR lit test.

For example, the following code will print different results for big and little endian:

#include <arm_mve.h>
extern "C" {
void printf(const char *, ...);
}


int main() {
    int16x8_t x = vcreateq_s16(0x0003000200010000, 0x0007000600050004);

    printf("8x16 lanes:\n");
    printf("%d:%d\n", 0, vgetq_lane_s16(x, 0));
    printf("%d:%d\n", 1, vgetq_lane_s16(x, 1));
    printf("%d:%d\n", 2, vgetq_lane_s16(x, 2));
    printf("%d:%d\n", 3, vgetq_lane_s16(x, 3));
    printf("%d:%d\n", 4, vgetq_lane_s16(x, 4));
    printf("%d:%d\n", 5, vgetq_lane_s16(x, 5));
    printf("%d:%d\n", 6, vgetq_lane_s16(x, 6));
    printf("%d:%d\n", 7, vgetq_lane_s16(x, 7));

    int8x16_t y = vcreateq_s8(0x0706050403020100, 0x0f0e0d0c0b0a0908);
    printf("16x8 lanes:\n");
    printf("%d:%d\n", 0, vgetq_lane_s8(y, 0));
    printf("%d:%d\n", 1, vgetq_lane_s8(y, 1));
    printf("%d:%d\n", 2, vgetq_lane_s8(y, 2));
    printf("%d:%d\n", 3, vgetq_lane_s8(y, 3));
    printf("%d:%d\n", 4, vgetq_lane_s8(y, 4));
    printf("%d:%d\n", 5, vgetq_lane_s8(y, 5));
    printf("%d:%d\n", 6, vgetq_lane_s8(y, 6));
    printf("%d:%d\n", 7, vgetq_lane_s8(y, 7));
    printf("%d:%d\n", 8, vgetq_lane_s8(y, 8));
    printf("%d:%d\n", 9, vgetq_lane_s8(y, 9));
    printf("%d:%d\n", 10, vgetq_lane_s8(y, 10));
    printf("%d:%d\n", 11, vgetq_lane_s8(y, 11));
    printf("%d:%d\n", 12, vgetq_lane_s8(y, 12));
    printf("%d:%d\n", 13, vgetq_lane_s8(y, 13));
    printf("%d:%d\n", 14, vgetq_lane_s8(y, 14));
    printf("%d:%d\n", 15, vgetq_lane_s8(y, 15));
    return 0;
}

For little endian (correct) (clang++ -target arm-none-none-eabi -march=armv8.1-m.main+mve.fp -O0):

8x16 lanes:
0:0
1:1
2:2
3:3
4:4
5:5
6:6
7:7
16x8 lanes:
0:0
1:1
2:2
3:3
4:4
5:5
6:6
7:7
8:8
9:9
10:10
11:11
12:12
13:13
14:14
15:15

For big endian (incorrect) (clang++ -target arm-none-none-eabi -march=armv8.1-m.main+mve.fp -O0 -mbig-endian):

8x16 lanes:
0:3
1:2
2:1
3:0
4:7
5:6
6:5
7:4
16x8 lanes:
0:7
1:6
2:5
3:4
4:3
5:2
6:1
7:0
8:15
9:14
10:13
11:12
12:11
13:10
14:9
15:8

This patch brings the big endian output in line with the little endian output.

Bitcast documentation: https://llvm.org/docs/LangRef.html#bitcast-to-instruction
Helium intrinsics documentation: https://developer.arm.com/architectures/instruction-sets/simd-isas/helium/helium-intrinsics

Diff Detail

Event Timeline

tmatheson created this revision.Apr 30 2021, 1:36 AM
tmatheson requested review of this revision.Apr 30 2021, 1:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2021, 1:36 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tmatheson edited reviewers, added: dmgreen; removed: greened.Apr 30 2021, 1:40 AM

Not sure yet.

clang/test/CodeGen/arm-mve-intrinsics/admin.c
90

Surely there is a problem here also?

158

And a problem here also (with BE)?

tmatheson retitled this revision from [ARM] vcreateq lane ordering for big endian to [ARM][MVE] vcreateq lane ordering for big endian.Apr 30 2021, 1:55 AM
tmatheson edited the summary of this revision. (Show Details)

Sounds good to me.

Whilst we are here, are any of the other uses of bitcast in arm_mve.td potentially a problem? I took a quick look and because they both converting the inputs and the outputs, I believe they will be OK. (Two wrongs make a right, if you will).

clang/test/CodeGen/arm-mve-intrinsics/admin.c
1–2

Is this updated with update_cc_test_checks?

It may make the output more verbose, but it will be more standard.

90

I don't see why these would be a problem. Can you elaborate?

MarkMurrayARM added inline comments.Apr 30 2021, 2:04 AM
clang/test/CodeGen/arm-mve-intrinsics/admin.c
90

I'm wondering if they need to be swapped in the BE case.

tmatheson added inline comments.Apr 30 2021, 2:35 AM
clang/test/CodeGen/arm-mve-intrinsics/admin.c
90

vcreateq is not endianness aware, it just inserts the two given 64 bit values a and b into the low and high lanes respectively. The bit representation of each 64 bit int will be different but that is not shown here. Therefore the IR is the same for big and little endian.

I have also confirmed locally with runtime output:

uint64x2_t w = vcreateq_u64(0x0000000000000001, 0x0000000000000002);
printf("%d:%llu\n", 0, vgetq_lane_u64(w, 0));
printf("%d:%llu\n", 1, vgetq_lane_u64(w, 1));

which gives for both little and bit endian (with this patch):

0:1
1:2
158

See above

tmatheson updated this revision to Diff 341818.Apr 30 2021, 2:43 AM
tmatheson edited the summary of this revision. (Show Details)

Use update_cc_test_checks

dmgreen added inline comments.Apr 30 2021, 2:46 AM
clang/test/CodeGen/arm-mve-intrinsics/admin.c
86

You have to remove the old checks - the script isn't very good at that.

What would probably be even better would be if it used --check-prefixes=CHECK,CHECK-LE. That way it should be able to common the snippets that don't change between LE and BE.

tmatheson updated this revision to Diff 341819.Apr 30 2021, 2:48 AM

remove old check lines that were not automatically removed

tmatheson updated this revision to Diff 341823.Apr 30 2021, 2:55 AM

Use --check-prefixes=CHECK,CHECK-BE etc to combine common blocks.
Sorry for the churn.

dmgreen accepted this revision.Apr 30 2021, 2:58 AM

Thanks for the updates. LGTM.

This revision is now accepted and ready to land.Apr 30 2021, 2:58 AM
tmatheson marked 2 inline comments as done.Apr 30 2021, 3:00 AM

Sounds good to me.

Whilst we are here, are any of the other uses of bitcast in arm_mve.td potentially a problem? I took a quick look and because they both converting the inputs and the outputs, I believe they will be OK. (Two wrongs make a right, if you will).

I had a look and came to the same conclusion, I couldn't find any way to make them break. Worth noting that they are all converting between vectors with the same number of lanes, e.g. typically between the signed and unsigned versions of NxM vectors.

This revision was automatically updated to reflect the committed changes.