This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Combine XOR and AES insructions on ARM/ARM64
ClosedPublic

Authored by mbrase on May 22 2018, 6:12 PM.

Download Raw Diff

Details

Reviewers

mcrosier
javed.absar
t.p.northover
eli.friedman
sdesmalen
majnemer

Commits

rG274d72faad67: [InstCombine] Combine XOR and AES instructions on ARM/ARM64.
rL333193: [InstCombine] Combine XOR and AES instructions on ARM/ARM64.

Summary

The ARM/ARM64 AESE and AESD instructions have a builtin XOR as the first step in the instruction. Therefore, if the AES key is zero and the AES data was previously XORed, it can be combined into a single instruction.

Diff Detail

Event Timeline

mbrase created this revision.May 22 2018, 6:12 PM

Herald added a reviewer: javed.absar. · View Herald TranscriptMay 22 2018, 6:12 PM

Herald added subscribers: llvm-commits, chrib, kristof.beyls. · View Herald Transcript

mcrosier added reviewers: t.p.northover, eli.friedman, sdesmalen, majnemer.May 23 2018, 10:25 AM

I am not an expert in encryption but how likely it is for 'key' to be zero ?
Other than that, LGTM

sdesmalen added inline comments.May 23 2018, 1:09 PM

test/Transforms/InstCombine/AArch64/aes-intrinsics.ll
22	Would it be useful to add some negative tests in case the second operand is not all zero?

Hi, I can try to give a little more background on the motivation behind this patch, and why the key might be zero.

I’m trying to cross compile some code which uses x86 AES intrinsics on Aarch64 with LLVM. However, the x86 AES instructions have slightly different semantics than the ARM instructions. One important difference is that x86 performs an XOR at the end of their AESENC/AESENCLAST instructions, whereas ARM does the XOR at the beginning of their AESE instruction.

To emulate the x86 AES intrinsics on Aarch64, I defined some inline functions as a drop in replacement, to avoid rewriting the algorithm by hand. Here is what they look like:

#include <stdint.h>
#include <arm_neon.h>

typedef uint8x16_t __m128i;  // __m128i is an x86 type, map it to ARM neon type

// https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_aesenc_si128&expand=221
static inline __m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey)
{
    return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey;
}

// https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_aesenclast_si128&expand=222
static inline __m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey)
{
    return vaeseq_u8(a, (__m128i){}) ^ RoundKey;
}

To skip the XOR from the ARM AESE instruction, I pass in a key value of zero, and then manually XOR the real key at the end of the function. The only downside to this approach is that when the inline functions are substituted in the algorithm, they have an unnecessary EOR instruction when multiple rounds of AES encryption are done back to back. Consider the following test code:

// Performs 3 rounds of AES encryption
// Note: In proper 128-bit AES encryption, 10 rounds are used
__m128i aes_block(__m128i data, __m128i k0, __m128i k1, __m128i k2, __m128i k3)
{
    data = data ^ k0;
    data = _mm_aesenc_si128(data, k1);
    data = _mm_aesenc_si128(data, k2);
    data = _mm_aesenclast_si128(data, k3);
    return data;
}

Compiles down to this:

0000000000000024 <aes_block>:
  24:   6e201c20        eor     v0.16b, v1.16b, v0.16b  ; EOR can be combined with AESE
  28:   6f00e401        movi    v1.2d, #0x0
  2c:   4e284820        aese    v0.16b, v1.16b
  30:   4e286800        aesmc   v0.16b, v0.16b
  34:   6e221c00        eor     v0.16b, v0.16b, v2.16b  ; EOR can be combined with AESE
  38:   4e284820        aese    v0.16b, v1.16b
  3c:   4e286800        aesmc   v0.16b, v0.16b
  40:   6e231c00        eor     v0.16b, v0.16b, v3.16b  ; EOR can be combined with AESE
  44:   4e284820        aese    v0.16b, v1.16b
  48:   6e241c00        eor     v0.16b, v0.16b, v4.16b
  4c:   d65f03c0        ret

My goal with this intrinsic is to teach LLVM how to eliminate some of these extra EOR instructions. With my patch, LLVM produces this:

0000000000000024 <aes_block>:
  24:   4e284820        aese    v0.16b, v1.16b
  28:   4e286800        aesmc   v0.16b, v0.16b
  2c:   4e284840        aese    v0.16b, v2.16b
  30:   4e286800        aesmc   v0.16b, v0.16b
  34:   4e284860        aese    v0.16b, v3.16b
  38:   6e241c00        eor     v0.16b, v0.16b, v4.16b
  3c:   d65f03c0        ret

I updated the testcases to include negative tests (for when the AES key is non-zero)

LGTM! Thanks for adding the negative tests and rationale.

This revision is now accepted and ready to land.May 23 2018, 11:13 PM

Closed by commit rL333193: [InstCombine] Combine XOR and AES instructions on ARM/ARM64. (authored by mcrosier). · Explain WhyMay 24 2018, 8:30 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

17 lines

test/

Transforms/

InstCombine/

AArch64/

aes-intrinsics.ll

44 lines

ARM/

aes-intrinsics.ll

43 lines

Diff 148304

lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 2,960 Lines • ▼ Show 20 Lines	if (Constant *CV1 = dyn_cast<Constant>(Arg1))
if (ConstantInt *Splat =		if (ConstantInt *Splat =
dyn_cast_or_null<ConstantInt>(CV1->getSplatValue()))		dyn_cast_or_null<ConstantInt>(CV1->getSplatValue()))
if (Splat->isOne())		if (Splat->isOne())
return CastInst::CreateIntegerCast(Arg0, II->getType(),		return CastInst::CreateIntegerCast(Arg0, II->getType(),
/isSigned=/!Zext);		/isSigned=/!Zext);

break;		break;
}		}
		case Intrinsic::arm_neon_aesd:
		case Intrinsic::arm_neon_aese:
		case Intrinsic::aarch64_crypto_aesd:
		case Intrinsic::aarch64_crypto_aese: {
		Value *DataArg = II->getArgOperand(0);
		Value *KeyArg = II->getArgOperand(1);

		// Try to use the builtin XOR in AESE and AESD to eliminate a prior XOR
		Value Data, Key;
		if (match(KeyArg, m_ZeroInt()) &&
		match(DataArg, m_Xor(m_Value(Data), m_Value(Key)))) {
		II->setArgOperand(0, Data);
		II->setArgOperand(1, Key);
		return II;
		}
		break;
		}
case Intrinsic::amdgcn_rcp: {		case Intrinsic::amdgcn_rcp: {
Value *Src = II->getArgOperand(0);		Value *Src = II->getArgOperand(0);

// TODO: Move to ConstantFolding/InstSimplify?		// TODO: Move to ConstantFolding/InstSimplify?
if (isa<UndefValue>(Src))		if (isa<UndefValue>(Src))
return replaceInstUsesWith(CI, Src);		return replaceInstUsesWith(CI, Src);

if (const ConstantFP *C = dyn_cast<ConstantFP>(Src)) {		if (const ConstantFP *C = dyn_cast<ConstantFP>(Src)) {
▲ Show 20 Lines • Show All 1,326 Lines • Show Last 20 Lines

test/Transforms/InstCombine/AArch64/aes-intrinsics.ll

This file was added.

				; RUN: opt -S -instcombine < %s \| FileCheck %s
				; ARM64 AES intrinsic variants

				define <16 x i8> @combineXorAeseZeroARM64(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAeseZeroARM64(
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %data, <16 x i8> %key)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %data.xor, <16 x i8> zeroinitializer)
				ret <16 x i8> %data.aes
				}

				define <16 x i8> @combineXorAeseNonZeroARM64(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAeseNonZeroARM64(
				; CHECK-NEXT: %data.xor = xor <16 x i8> %data, %key
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				ret <16 x i8> %data.aes
				}

				sdesmalenUnsubmitted Not Done Reply Inline Actions Would it be useful to add some negative tests in case the second operand is not all zero? sdesmalen: Would it be useful to add some negative tests in case the second operand is not all zero?
				define <16 x i8> @combineXorAesdZeroARM64(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAesdZeroARM64(
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %data, <16 x i8> %key)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %data.xor, <16 x i8> zeroinitializer)
				ret <16 x i8> %data.aes
				}

				define <16 x i8> @combineXorAesdNonZeroARM64(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAesdNonZeroARM64(
				; CHECK-NEXT: %data.xor = xor <16 x i8> %data, %key
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				ret <16 x i8> %data.aes
				}

				declare <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8>, <16 x i8>) #0
				declare <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8>, <16 x i8>) #0

test/Transforms/InstCombine/ARM/aes-intrinsics.ll

This file was added.

				; RUN: opt -S -instcombine < %s \| FileCheck %s
				; ARM AES intrinsic variants

				define <16 x i8> @combineXorAeseZeroARM(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAeseZeroARM(
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.arm.neon.aese(<16 x i8> %data, <16 x i8> %key)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.arm.neon.aese(<16 x i8> %data.xor, <16 x i8> zeroinitializer)
				ret <16 x i8> %data.aes
				}

				define <16 x i8> @combineXorAeseNonZeroARM(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAeseNonZeroARM(
				; CHECK-NEXT: %data.xor = xor <16 x i8> %data, %key
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.arm.neon.aese(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.arm.neon.aese(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				ret <16 x i8> %data.aes
				}

				define <16 x i8> @combineXorAesdZeroARM(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAesdZeroARM(
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.arm.neon.aesd(<16 x i8> %data, <16 x i8> %key)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.arm.neon.aesd(<16 x i8> %data.xor, <16 x i8> zeroinitializer)
				ret <16 x i8> %data.aes
				}

				define <16 x i8> @combineXorAesdNonZeroARM(<16 x i8> %data, <16 x i8> %key) {
				; CHECK-LABEL: @combineXorAesdNonZeroARM(
				; CHECK-NEXT: %data.xor = xor <16 x i8> %data, %key
				; CHECK-NEXT: %data.aes = tail call <16 x i8> @llvm.arm.neon.aesd(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				; CHECK-NEXT: ret <16 x i8> %data.aes
				%data.xor = xor <16 x i8> %data, %key
				%data.aes = tail call <16 x i8> @llvm.arm.neon.aesd(<16 x i8> %data.xor, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
				ret <16 x i8> %data.aes
				}

				declare <16 x i8> @llvm.arm.neon.aese(<16 x i8>, <16 x i8>) #0
				declare <16 x i8> @llvm.arm.neon.aesd(<16 x i8>, <16 x i8>) #0