The ARM/ARM64 AESE and AESD instructions have a builtin XOR as the first step in the instruction. Therefore, if the AES key is zero and the AES data was previously XORed, it can be combined into a single instruction.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I am not an expert in encryption but how likely it is for 'key' to be zero ?
Other than that, LGTM
test/Transforms/InstCombine/AArch64/aes-intrinsics.ll | ||
---|---|---|
21 ↗ | (On Diff #148138) | Would it be useful to add some negative tests in case the second operand is not all zero? |
Hi, I can try to give a little more background on the motivation behind this patch, and why the key might be zero.
I’m trying to cross compile some code which uses x86 AES intrinsics on Aarch64 with LLVM. However, the x86 AES instructions have slightly different semantics than the ARM instructions. One important difference is that x86 performs an XOR at the end of their AESENC/AESENCLAST instructions, whereas ARM does the XOR at the beginning of their AESE instruction.
To emulate the x86 AES intrinsics on Aarch64, I defined some inline functions as a drop in replacement, to avoid rewriting the algorithm by hand. Here is what they look like:
#include <stdint.h> #include <arm_neon.h> typedef uint8x16_t __m128i; // __m128i is an x86 type, map it to ARM neon type // https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_aesenc_si128&expand=221 static inline __m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey) { return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey; } // https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_aesenclast_si128&expand=222 static inline __m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey) { return vaeseq_u8(a, (__m128i){}) ^ RoundKey; }
To skip the XOR from the ARM AESE instruction, I pass in a key value of zero, and then manually XOR the real key at the end of the function. The only downside to this approach is that when the inline functions are substituted in the algorithm, they have an unnecessary EOR instruction when multiple rounds of AES encryption are done back to back. Consider the following test code:
// Performs 3 rounds of AES encryption // Note: In proper 128-bit AES encryption, 10 rounds are used __m128i aes_block(__m128i data, __m128i k0, __m128i k1, __m128i k2, __m128i k3) { data = data ^ k0; data = _mm_aesenc_si128(data, k1); data = _mm_aesenc_si128(data, k2); data = _mm_aesenclast_si128(data, k3); return data; }
Compiles down to this:
0000000000000024 <aes_block>: 24: 6e201c20 eor v0.16b, v1.16b, v0.16b ; EOR can be combined with AESE 28: 6f00e401 movi v1.2d, #0x0 2c: 4e284820 aese v0.16b, v1.16b 30: 4e286800 aesmc v0.16b, v0.16b 34: 6e221c00 eor v0.16b, v0.16b, v2.16b ; EOR can be combined with AESE 38: 4e284820 aese v0.16b, v1.16b 3c: 4e286800 aesmc v0.16b, v0.16b 40: 6e231c00 eor v0.16b, v0.16b, v3.16b ; EOR can be combined with AESE 44: 4e284820 aese v0.16b, v1.16b 48: 6e241c00 eor v0.16b, v0.16b, v4.16b 4c: d65f03c0 ret
My goal with this intrinsic is to teach LLVM how to eliminate some of these extra EOR instructions. With my patch, LLVM produces this:
0000000000000024 <aes_block>: 24: 4e284820 aese v0.16b, v1.16b 28: 4e286800 aesmc v0.16b, v0.16b 2c: 4e284840 aese v0.16b, v2.16b 30: 4e286800 aesmc v0.16b, v0.16b 34: 4e284860 aese v0.16b, v3.16b 38: 6e241c00 eor v0.16b, v0.16b, v4.16b 3c: d65f03c0 ret