This is an archive of the discontinued LLVM Phabricator instance.

Add tips for generic IR vs architecture specific code.
Needs RevisionPublic

Authored by asbirlea on Oct 28 2016, 3:54 PM.

Download Raw Diff

Details

Reviewers

Summary

The patch is a start for encouraging frontends to codegen generic IR, while tuning for specific architectures.
Starting off with two examples: strided loads and stores for ARM/AArch64.
I expect the doc to be expand to more patterns, some still under discussion.

The doc could also refer to something more specific, such as "codegen of vector code".
If the content gets too large, it could be moved as a subpage.

Please suggest who is best to review this.
Adding Philip as the doc owner, and Michael as fyi for future AVX512 doc.

Diff Detail

Build Status

Buildable 880
Build 880: arc lint + arc unit

Event Timeline

asbirlea updated this revision to Diff 76264.Oct 28 2016, 3:54 PM

asbirlea retitled this revision from to Add tips for generic IR vs architecture specific code..

asbirlea updated this object.

asbirlea added a reviewer: reames.

asbirlea added subscribers: llvm-commits, mkuper.

Herald added a subscriber: aemerson. · View Herald TranscriptOct 28 2016, 3:54 PM

RKSimon added a subscriber: RKSimon.Oct 29 2016, 6:46 AM

After reading through the draft text a couple of times, I'm really not clear what your message is and why it belongs here. Having target specific lowering details in generic documentation seems strange?

docs/Frontend/PerformanceTips.rst
126	What is the take away from this piece of advice? small edits: e.g. intrinsics or inline asm. define "generic IR" or use alternate phrase
130	From this sentance, I'm not sure what to expect. Are these patterns where generic IR does work, or does not work?
132	Why should this pass get special treatment in target neutral documentation? We don't talk about ISEL here for instance.
143	Lowered by whom, and why does a frontend author care?
205	This sentence does not parse for me.

This revision now requires changes to proceed.Nov 30 2016, 6:02 PM

I tend to agree to that, that's why I suggested this could go into a a separate page. The generic documentation would point to each target specific subpage.
This is part of the feedback I was hoping for; there aren't currently any such pages, so as a draft I dropped the content in here, but I believe it would be better on its own.

docs/Frontend/PerformanceTips.rst
126	I tried to explain more in the last comment. Is it better to replace "generic IR' with "architecture independent (generic) IR"? All suggestion to make the doc clearer are more than welcome.
130	Patterns where generic IR does work. This draft certainly does not cover everything, I'd expect it to be expanded. I found no other documentation (other than the comments in ISEL code) that would help a frontend writer find these.
132	Agreed - separate page for each architecture where we do talk about ISEL?
143	In this example, by ISEL. The purpose is to have the frontend authors not generate custom intrinsics when generating generic IR should give the same asm in the end. The example I've dealt with is Halide, which has special code generation of ARM/AArch64 code in some particular cases. These (used to) ge nerate intrinsics (some still do) for cases where the lowering would not get the right asm instruction. The example of the interleaved access pass is a case where there's no reason for intrinsics to be generated. Changing their code generation to use the right patterns makes the resulting IR architecture independent, gets the same performance on the arm targets and at least the same on x86 and is easier to maintain. The high-level idea I'm trying to convey is: if llvm's lowering passes can get the same performance, try to rely on those and generate architecture independent IR; if not, use target specific IR (intrinsics, inline asm) but please let the LLVM community know about it and perhaps it's something it should be addressed.
205	It was meant to be as a sort of disclaimer; if you could suggest how to make this clearer that would be great. The idea is that such patterns are lowered to a particular asm instruction on this arch, known to be effective there. It may not give the best performance on another architecture. The aim is to give the frontend authors the info on existing patterns, encourage them to generate generic IR whenever possible, while still testing if they get the expected performance on other archs. Then get their feedback when they do see such performance regressions, or when lowering could do a better job.

Revision Contents

Path

Size

docs/

Frontend/

PerformanceTips.rst

87 lines

Diff 76264

docs/Frontend/PerformanceTips.rst

	Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines
	perform transforms that require that alignment). For x86, it doesn’t make			perform transforms that require that alignment). For x86, it doesn’t make
	much difference, as almost all instructions are alignment-independent. For			much difference, as almost all instructions are alignment-independent. For
	MIPS, it can make a big difference.			MIPS, it can make a big difference.

	Note that if your loads and stores are atomic, the backend will be unable to			Note that if your loads and stores are atomic, the backend will be unable to
	lower an under aligned access into a sequence of natively aligned accesses.			lower an under aligned access into a sequence of natively aligned accesses.
	As a result, alignment is mandatory for atomic loads and stores.			As a result, alignment is mandatory for atomic loads and stores.

				Architecture-specific code
				^^^^^^^^^^^^^^^^^^^^^^^^^^
				Whenever possible, the IR generated should be generic IR, instead of architecture
				specific IR (i.e. intrinsics).
				reamesUnsubmitted Not Done Reply Inline Actions What is the take away from this piece of advice? small edits: e.g. intrinsics or inline asm. define "generic IR" or use alternate phrase reames: What is the take away from this piece of advice? small edits: e.g. intrinsics or inline asm.
				asbirleaAuthorUnsubmitted Not Done Reply Inline Actions I tried to explain more in the last comment. Is it better to replace "generic IR' with "architecture independent (generic) IR"? All suggestion to make the doc clearer are more than welcome. asbirlea: I tried to explain more in the last comment. Is it better to replace "generic IR' with…
				If LLVM cannot lower the generic code to the desired intrinsic, start a discussion
				on `llvm-dev <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_
				for the missing lowering opportunity.
				A few known patterns that lead to lowering to intrinsics are listed below.
				reamesUnsubmitted Not Done Reply Inline Actions From this sentance, I'm not sure what to expect. Are these patterns where generic IR does work, or does not work? reames: From this sentance, I'm not sure what to expect. Are these patterns where generic IR does…
				asbirleaAuthorUnsubmitted Not Done Reply Inline Actions Patterns where generic IR does work. This draft certainly does not cover everything, I'd expect it to be expanded. I found no other documentation (other than the comments in ISEL code) that would help a frontend writer find these. asbirlea: Patterns where generic IR does work. This draft certainly does not cover everything, I'd…

				The interleaved access pass performs the following lowerings (tests can be found in CodeGen/ARM/arm-interleaved-accesses.ll and CodeGen/AArch64/aarch64-interleaved-accesses.ll):
				reamesUnsubmitted Not Done Reply Inline Actions Why should this pass get special treatment in target neutral documentation? We don't talk about ISEL here for instance. reames: Why should this pass get special treatment in target neutral documentation? We don't talk…
				asbirleaAuthorUnsubmitted Not Done Reply Inline Actions Agreed - separate page for each architecture where we do talk about ISEL? asbirlea: Agreed - separate page for each architecture where we //do// talk about ISEL?

				#. ARM/AArch64: lower an interleaved/strided load into a vldN/ldN intrinsic.
				* General rule: Factor = F, Lane Length = L:
				::

				%wide.vec = load %ptr
				%v1 = shufflevector %wide.vec, undef, <m1, m1+F, ..., m1+(L-1)*F>
				[...]
				%vF = shufflevector %wide.vec, undef, <m1+F-1, m1+2F-1, ..., m1+LF-1>

				Is lowered to:
				reamesUnsubmitted Not Done Reply Inline Actions Lowered by whom, and why does a frontend author care? reames: Lowered by whom, and why does a frontend author care?
				asbirleaAuthorUnsubmitted Not Done Reply Inline Actions In this example, by ISEL. The purpose is to have the frontend authors not generate custom intrinsics when generating generic IR should give the same asm in the end. The example I've dealt with is Halide, which has special code generation of ARM/AArch64 code in some particular cases. These (used to) ge nerate intrinsics (some still do) for cases where the lowering would not get the right asm instruction. The example of the interleaved access pass is a case where there's no reason for intrinsics to be generated. Changing their code generation to use the right patterns makes the resulting IR architecture independent, gets the same performance on the arm targets and at least the same on x86 and is easier to maintain. The high-level idea I'm trying to convey is: if llvm's lowering passes can get the same performance, try to rely on those and generate architecture independent IR; if not, use target specific IR (intrinsics, inline asm) but please let the LLVM community know about it and perhaps it's something it should be addressed. asbirlea: In this example, by ISEL. The purpose is to have the frontend authors not generate custom…
				::

				%ldF = call @llvm.arm.neon.vldF(%ptr, L)
				; @llvm.aarch64.neon.ldF(%ptr)
				%vec1 = extractvalue %ldF, 0
				[...]
				%vecF = extractvalue %ldF, F-1

				* E.g. Factor = 2, Lane Length = 4:
				.. code-block:: llvm

				%wide.vec = load <8 x i32>, <8 x i32>* %ptr
				%v0 = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6> ; Extract even elements
				%v1 = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7> ; Extract odd elements

				Is lowered to:
				::

				%ld2 = call { <4 x i32>, <4 x i32> } @llvm.arm.neon.vld2(<8 x i32>* %ptr, i32 4)
				; @llvm.aarch64.neon.ld2(<8 x i32>* %ptr)
				%vec0 = extractvalue { <4 x i32>, <4 x i32> } %ld2, 0
				%vec1 = extractvalue { <4 x i32>, <4 x i32> } %ld2, 1

				#. ARM/AArch64: lower an interleaved/strided store into a vstN/stN intrinsic.
				* General rule: Factor = F, Lane Length = L:
				::

				%i.vec = shufflevector %v0, %v1,
				<m1, m2, ..., mF,
				m1+1, m2+1, ..., mF+1,
				...,
				m1+L-1, m2+L-1, ..., mF+L-1>
				store %i.vec, %ptr
				Is lowered to:
				::

				%sub.v1 = shufflevector %v0, %v1, <m1, ..., m1+L-1>
				[...]
				%sub.vF = shufflevector %v0, %v1, <mF, ..., mF+L-1>
				call void @llvm.arm.neon.vstF(%ptr, %sub.v1, ..., %sub.vF, L)
				; @llvm.aarch64.neon.stF(%sub.v1, ..., %sub.vF, %ptr)

				* E.g. Factor = 3, Lane Length = 4:
				.. code-block:: llvm

				%i.vec = shufflevector <8 x i32> %v0, <8 x i32> %v1,
				<i32 0, i32 4, i32 8,
				i32 1, i32 5, i32 9,
				i32 2, i32 6, i32 10,
				i32 3, i32 7, i32 11>
				store <12 x i32> %i.vec, <12 x i32>* %ptr

				Is lowered to:
				.. code-block:: llvm

				%sub.v0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				%sub.v1 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				%sub.v2 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				call void @llvm.arm.neon.vst3(<8 x i32> %ptr, <4 x i32> %sub.v0, <4 x i32> %sub.v1, <4 x i32> %sub.v2, i32 4)
				; @llvm.aarch64.neon.st3(<4 x i32> %sub.v0, <4 x i32> %sub.v1, <4 x i32> %sub.v2, <8 x i32> %ptr)

				LLVM does not promise to be performance aware, so the above patterns, while generic IR, are still recommended for the particular platforms.
				reamesUnsubmitted Not Done Reply Inline Actions This sentence does not parse for me. reames: This sentence does not parse for me.
				asbirleaAuthorUnsubmitted Not Done Reply Inline Actions It was meant to be as a sort of disclaimer; if you could suggest how to make this clearer that would be great. The idea is that such patterns are lowered to a particular asm instruction on this arch, known to be effective there. It may not give the best performance on another architecture. The aim is to give the frontend authors the info on existing patterns, encourage them to generate generic IR whenever possible, while still testing if they get the expected performance on other archs. Then get their feedback when they do see such performance regressions, or when lowering could do a better job. asbirlea: It was meant to be as a sort of disclaimer; if you could suggest how to make this clearer that…
				For more suggestions of architecture specific patterns, please send
				a patch to `llvm-commits
				<http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review.

	Other Things to Consider			Other Things to Consider
	^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^

	#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing			#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing
	analysis), prefer GEPs			analysis), prefer GEPs

	#. Prefer globals over inttoptr of a constant address - this gives you			#. Prefer globals over inttoptr of a constant address - this gives you
	dereferencability information. In MCJIT, use getSymbolAddress to provide			dereferencability information. In MCJIT, use getSymbolAddress to provide
	▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines