This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetPassConfig.h
-
lib/
-
CodeGen/
-
TargetPassConfig.cpp
-
Target/NVPTX/
-
NVPTX/
-
CMakeLists.txt
-
NVPTX.h
2/2
NVPTXAsmPrinter.cpp
2/2
NVPTXFrameLowering.cpp
-
NVPTXFunctionDataSharing.h
1/2
NVPTXFunctionDataSharing.cpp
-
NVPTXInstrInfo.td
-
NVPTXLowerAlloca.cpp
1/1
NVPTXLowerSharedFrameIndicesPass.cpp
-
NVPTXRegisterInfo.h
1/1
NVPTXRegisterInfo.cpp
1/1
NVPTXRegisterInfo.td
-
NVPTXTargetMachine.cpp
-
NVPTXUtilities.h
1/1
NVPTXUtilities.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
2/4
insert-shared-depot.ll
-
lower-alloca-shared.ll
-
no-shared-depot.ll
-
nvptx-function-data-sharing.ll

Differential D38978

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory
AbandonedPublic

Authored by gtbercea on Oct 16 2017, 2:29 PM.

Download Raw Diff

Details

Reviewers

hfinkel
carlo.bertolli
arpith-jacob
ABataev
caomhin

Summary

This patch is part of the development effort to add support in the current OpenMP GPU offloading implementation for implicitly sharing variables between a target region executed by the team master thread and the worker threads within that team.

This patch is the second of three required for successfully performing the implicit sharing of master thread variables with the worker threads within a team:
-Patch D38976 extends the CLANG code generation with code that handles shared variables.
-Patch (coming soon) extends the functionality of libomptarget to maintain a list of references to shared variables.

This patch adds a shared memory stack to the prolog of the kernel function representing the device offloaded OpenMP target region. The new passes along with the changes to existing ones, ensure that any OpenMP variable which needs to be shared across several threads will be allocated in this new stack, in the shared memory of the device. This patch covers the case of sharing variables from the master thread to the worker threads:

#pragma omp target
{
   // master thread only
   int v;
   #pragma omp parallel
   {
      // worker threads
      // use v
   }
}

Diff Detail

Build Status

Buildable 13634
Build 13634: arc lint + arc unit

Event Timeline

gtbercea created this revision.Oct 16 2017, 2:29 PM

Herald added subscribers: mgorny, jholewinski. · View Herald TranscriptOct 16 2017, 2:29 PM

gtbercea mentioned this in D38976: [OpenMP] Add implicit data sharing support when offloading to NVIDIA GPUs using OpenMP device offloading.Oct 16 2017, 2:30 PM

Please add tests for the cases where such local->shaed conversion should and should not happen.
I would appreciate if you could add details on what exactly your passes are supposed to move to shared memory.

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1753	Nit: the name should end with `S` as the L in `SPL` was for 'local' address space. which then gets converted to generic AS. In your case it will be in shared space, hence S would be more appropriate.
lib/Target/NVPTX/NVPTXAssignValidGlobalNames.cpp
68 ↗	(On Diff #119210)	The name cleanup changes in this file should probably be committed by themselves as they have nothing to do with the rest of the patch.
lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp
10	Please add details about what the pass is supposed to do.

Eliminate variable and function name clean-up. That has been moved into a separate patch: D39005

gtbercea marked an inline comment as done.Oct 17 2017, 8:28 AM

Diffusion mentioned this in rL318773: [OpenMP] Add implicit data sharing support when offloading to NVIDIA GPUs using….Nov 21 2017, 7:55 AM

Add regression tests and allow for shared memory lowering to be disabled at function level.

gtbercea marked 2 inline comments as done.Nov 24 2017, 3:24 PM

Hahnfeld edited subscribers, added: llvm-commits; removed: cfe-commits.Nov 26 2017, 6:45 AM

ping

@tra @hfinkel

ping

guansong added a subscriber: guansong.Dec 6 2017, 2:59 PM

yaxunl added a subscriber: yaxunl.Dec 6 2017, 5:14 PM

Here is a question, do we require that the alloca size to be compile time constant?

hfinkel added inline comments.Dec 11 2017, 7:53 PM

lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1737	Line too long.
lib/Target/NVPTX/NVPTXFrameLowering.cpp
71	In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say "the OpenMP runtime" here as well (but just saying "the runtime" seems potentially confusing).
85	Line too long.
lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp
13	Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's the only motivation, can we use the regular PEI now?
lib/Target/NVPTX/NVPTXRegisterInfo.cpp
134	Line too long.
lib/Target/NVPTX/NVPTXUtilities.cpp
322	Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this function? If so, please do.

Use LLVM function for checking if pointer is stored.

gtbercea marked 5 inline comments as done.Dec 19 2017, 4:23 AM

gtbercea marked an inline comment as done.Dec 19 2017, 9:31 AM

ping

Dotting the 'i's on the questions that were not replied to directly.

In D38978#899205, @tra wrote:

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

This may have been addressed by the no-shared-depot.ll test. It would be nice to add few comments in the tests explaining what they do.

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

AFAICT this functionality only applies to functions with has-nvptx-shared-depot attribute. Works for me.

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp
99	Nit: `return false` would match the intent better.
lib/Target/NVPTX/NVPTXRegisterInfo.td
75	Line too long.
test/CodeGen/NVPTX/insert-shared-depot.ll
5–6	You could put common checks under the same label (e.g. `CHECK`) and run tests with `-check-prefixes=PTX32,CHECK`
30	'LABEL' is not a check-prefix and `@linsert_shared_depot` is not this function's name, so I'm puzzled what this line is supposed to do. Did you intend `<prefix>-LABEL: @kernel` ? This appears in all the test cases in the patch.

Address comments.

Harbormaster completed remote builds in B13540: Diff 128725.Jan 5 2018, 2:54 AM

In D38978#967485, @tra wrote:

Dotting the 'i's on the questions that were not replied to directly.

In D38978#899205, @tra wrote:

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

This may have been addressed by the no-shared-depot.ll test. It would be nice to add few comments in the tests explaining what they do.

Done.

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

AFAICT this functionality only applies to functions with has-nvptx-shared-depot attribute. Works for me.

That's right.

test/CodeGen/NVPTX/insert-shared-depot.ll
30	This is modeled after the lower-alloca.ll test which has a similar label. The label is always equal to the name of the test file. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot"

In D38978#968222, @gtbercea wrote:

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

Good luck with that. IMO if your kernel requires all shared memory available per multiprocessor, you are almost guaranteed suboptimal performance because you will not have enough threads running -- neither for peak compute, nor to hide global memory access latency. My bet that you will eventually end up limiting shared memory use to a fairly small fraction of it.

Given that impact is limited to explicitly annotated functions only, this lack of tune-ability is OK with me for now. I'd add a TODO item somewhere to describe that tuning specific limits is WIP.

test/CodeGen/NVPTX/insert-shared-depot.ll
30	This is modeled after the lower-alloca.ll test which has a similar label. lower-alloca.ll indeed has the same problem. The label is always equal to the name of the test file. I don't think FileCheck has such a feature. Nor do I see anything matching this description in the FileCheck documentation. Nor does it work. See below. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot" The line does not check anything right now. In this test FileCheck only pays attention to lines that have CHECK or PTX64/PTX32. This line contains neither and is ignored. You can do an experiment -- replace the line with `; LABEL: this should never match` and run the test. I've tried that on lower-alloca.ll and the test, as expected, passes regardless of the nonsense I put after the `LABEL:`.

In D38978#968565, @tra wrote:

In D38978#968222, @gtbercea wrote:

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

Good luck with that. IMO if your kernel requires all shared memory available per multiprocessor, you are almost guaranteed suboptimal performance because you will not have enough threads running -- neither for peak compute, nor to hide global memory access latency. My bet that you will eventually end up limiting shared memory use to a fairly small fraction of it.

I completely agree, this scheme will be efficient only when modest amounts of shared memory are required, for larger memory footprints, a global memory scheme will be used instead.

Given that impact is limited to explicitly annotated functions only, this lack of tune-ability is OK with me for now. I'd add a TODO item somewhere to describe that tuning specific limits is WIP.

I'll choose a sensible default for the cut-off point/condition and make it tune-able by the user once we have the global memory scheme in place.

Remove LABEL from tests and add TODO comment for shared memory limit.

Not my area of expertise

Herald added a subscriber: hintonda. · View Herald TranscriptJan 28 2018, 2:20 AM

ping

Alternative solution was implemented.

Herald added a subscriber: jdoerfert. · View Herald TranscriptJun 12 2019, 10:13 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

TargetPassConfig.h

4 lines

lib/

CodeGen/

TargetPassConfig.cpp

5 lines

Target/

NVPTX/

CMakeLists.txt

2 lines

NVPTX.h

2 lines

NVPTXAsmPrinter.cpp

20 lines

NVPTXFrameLowering.cpp

32 lines

NVPTXFunctionDataSharing.h

37 lines

NVPTXFunctionDataSharing.cpp

127 lines

NVPTXInstrInfo.td

4 lines

NVPTXLowerAlloca.cpp

44 lines

NVPTXLowerSharedFrameIndicesPass.cpp

296 lines

NVPTXRegisterInfo.h

2 lines

NVPTXRegisterInfo.cpp

5 lines

NVPTXRegisterInfo.td

5 lines

NVPTXTargetMachine.cpp

19 lines

NVPTXUtilities.h

4 lines

NVPTXUtilities.cpp

9 lines

test/

CodeGen/

NVPTX/

insert-shared-depot.ll

42 lines

lower-alloca-shared.ll

32 lines

no-shared-depot.ll

43 lines

nvptx-function-data-sharing.ll

32 lines

Commit	Tree	Parents	Author	Summary	Date
dc541fdf57df	fa545353ac73	5c7c31e94fd4	Doru Bercea	Add TODO comment for data sharing limiter.	Jan 9 2018, 8:23 AM
5c7c31e94fd4	dfa62ec5c6ec	a3113dd25465	Doru Bercea	Delete LABEL from tests.	Jan 9 2018, 7:36 AM
a3113dd25465	d039fd674348	22907abb52cf	Doru Bercea	Address comments.	Jan 5 2018, 2:49 AM
22907abb52cf	1aa1b5e6d080	65708f494c28 cf59225a2ba6	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master2' into patch8-9	Jan 3 2018, 2:47 AM
65708f494c28	1fdfc80aed4d	74b03683d250 fe6f85f7dda7	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master2' into patch8-9	Dec 20 2017, 11:51 AM
74b03683d250	a9edb688acc4	4b7c49e6af06 d900f82d61f4	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master2' into patch8-9	Dec 19 2017, 3:06 AM
4b7c49e6af06	0f3dec141efc	12faf3360ca8	Doru Bercea	Use llvm function for checking if pointer is stored.	Dec 19 2017, 2:26 AM
12faf3360ca8	fb7c20c013e9	07230a90872f	Doru Bercea	Remove pointers.	Dec 18 2017, 5:44 AM
07230a90872f	9b80a911cad3	f9387b64779e 0e72a729ed9c	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master2' into patch8-9	Dec 18 2017, 3:14 AM
f9387b64779e	f82de2bdfc99	e9db4ef8812f 91a466c9cc34	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master2' into patch8-9	Dec 13 2017, 2:18 PM
e9db4ef8812f	0fc28e100b8c	9240ccb2ec46 2b46be6545de	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master2' into patch8-9	Dec 13 2017, 2:00 PM
9240ccb2ec46	a4fe880652c0	c7aaa2f04df0	Doru Bercea	Address comments.	Dec 13 2017, 12:30 PM
c7aaa2f04df0	9d6e4afd0352	08dfa9d57815 c8f8b6708735	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Dec 13 2017, 9:08 AM
08dfa9d57815	db51841ce956	10bae5dd722b 28ea9743f619	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Dec 12 2017, 10:46 AM
10bae5dd722b	6996eba285b5	64abb6d8e7c4	Doru Bercea	More changes.	Dec 12 2017, 10:45 AM
64abb6d8e7c4	39fb6ba5130c	55ecfd5a5285	Doru Bercea	Use variable naming issue fix upstream instead of ibm-devel version.	Dec 12 2017, 10:42 AM
55ecfd5a5285	79af15d27b49	66aa972c20db	Doru Bercea	Special case OpenMP. Add regression tests.	Nov 24 2017, 3:15 PM
66aa972c20db	e2836f351b50	260daf9dd9f7 e4ae05021235	Doru Bercea	Merge branch 'patch8' into patch8-9	Nov 21 2017, 3:10 PM
e4ae05021235	e85b091fe987	1d71ef1137b0	Doru Bercea	Fix test to support new naming scheme.	Nov 21 2017, 3:10 PM
1d71ef1137b0	8a499cde4305	b3aef86f8569 14544e6ee72b	Doru Bercea	Merge branch 'unpatched-master' into patch8	Nov 21 2017, 3:01 PM
260daf9dd9f7	93215585884a	969154ac2c7c	Doru Bercea	Revert previous commit.	Nov 21 2017, 3:00 PM
969154ac2c7c	e2836f351b50	4df0547a4a2f	Doru Bercea	Fix test to support the new renaming scheme from patch8.	Nov 21 2017, 2:34 PM
4df0547a4a2f	93215585884a	72aad3f55eee	Doru Bercea	Fix stack pointer names.	Nov 21 2017, 1:26 PM
72aad3f55eee	09ed0622e61f	65421d735dc3	Doru Bercea	Add description to passes.	Nov 21 2017, 11:18 AM
65421d735dc3	74507ca33c33	864236590076 14544e6ee72b	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Nov 21 2017, 7:36 AM
864236590076	fbb02c13dec8	cc7066eeca7d	Doru Bercea	Add new paths to Target files.	Nov 20 2017, 6:08 PM
cc7066eeca7d	225830da5769	b34d7f621b6d 7480dbd812d2	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Nov 20 2017, 5:35 PM
b34d7f621b6d	f1f3eb77bb08	38aab71f6ab2 b3aef86f8569	Doru Bercea	Merge branch 'patch8' into patch8-9	Oct 17 2017, 8:22 AM
b3aef86f8569	75103b2537e1	e343d44e38db	Doru Bercea	Add cleanup of names.	Oct 17 2017, 8:21 AM
38aab71f6ab2	f1f3eb77bb08	f7a36c50d30e e343d44e38db	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Oct 16 2017, 1:35 PM
f7a36c50d30e	80b2e90a188e	60c4bae6ada5	Doru Bercea	Address comments.	Oct 16 2017, 12:28 PM
60c4bae6ada5	3a111573a038	855802cb5915	Doru Bercea	Address comments.	Oct 13 2017, 1:32 PM
855802cb5915	d1d0a461cd27	f0ca1d8c0c93	Doru Bercea	Fix pointer.	Oct 12 2017, 7:42 AM
f0ca1d8c0c93	5c56caaa0989	0e596dbec143	Doru Bercea	Address latest comments.	Oct 11 2017, 4:42 PM
0e596dbec143	d0948f4a77ac	1d966fa545c3 771d20970160	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Oct 10 2017, 7:35 AM
1d966fa545c3	ea6135e5d72d	38fa5857a883	Doru Bercea	Fix while loop.	Oct 9 2017, 1:37 PM
38fa5857a883	3a8e4e5be102	0c05b86c0ffa	Doru Bercea	Fix while loop.	Oct 9 2017, 1:35 PM
0c05b86c0ffa	d4a32c2ab0e4	39a13294f90b	Doru Bercea	Fix comment.	Oct 9 2017, 1:00 PM
39a13294f90b	f0b5b7cd1ac5	d0332d91fa8f	Doru Bercea	Rename function for computing offsets.	Oct 9 2017, 8:12 AM
d0332d91fa8f	3c12ecaeeedc	dd83f2f67173	Doru Bercea	Fix	Oct 9 2017, 7:57 AM
dd83f2f67173	ffd8d9fcd847	8a16877ab8dc	Doru Bercea	Reinstate offset computation in the PrologEpilog pass.	Oct 9 2017, 7:51 AM
8a16877ab8dc	e22dfc3edeb4	3c408caafabe	Doru Bercea	Fix.	Oct 6 2017, 1:49 PM
3c408caafabe	b0d1dada4846	b6bdf84b6d5b	Doru Bercea	Fix.	Oct 6 2017, 11:58 AM
b6bdf84b6d5b	23c26a6038c9	bda447135c2f	Doru Bercea	Address comments.	Oct 6 2017, 11:24 AM
bda447135c2f	e447feedb3a7	0a02a3cf5999	Doru Bercea	Refactor code.	Oct 4 2017, 1:49 PM
0a02a3cf5999	eba9b6e2873c	e048308c926d 3af6a4e447f0	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Oct 2 2017, 1:34 PM
e048308c926d	1b0171aa4ba6	e6408316e42d de9cae20fec1	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Sep 27 2017, 2:34 PM
e6408316e42d	f62b2ecd3de5	0d120828c83a 463b87bb2b44	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Sep 27 2017, 11:03 AM
0d120828c83a	97477a35c209	336b3343caef a6970ac4e2ae	Doru Bercea	Merge branch 'unpatched-master' into patch8-9	Sep 25 2017, 1:29 PM
336b3343caef	c06e404db6eb	b31938d1ccc0	Doru Bercea	Fix O0 issue.	Sep 20 2017, 5:15 PM
b31938d1ccc0	5a476172b2bb	bd06bbf49a77 d77fd58694f6	Doru Bercea	Merge remote-tracking branch 'ibm/unpatched-master' into patch8-9	Sep 20 2017, 8:05 AM
bd06bbf49a77	402558bc01ee	1265f146a0a5 32ce91cfc3be	Doru Bercea	Merge remote-tracking branch 'origin/master' into patch8-9	Sep 13 2017, 8:33 AM
1265f146a0a5	d05618f0ffde	bd358f8085a6 32e786ad2a18	Doru Bercea	Merge remote-tracking branch 'origin/master' into patch8-9	Sep 12 2017, 12:49 PM
bd358f8085a6	08f2902b5b30	5b2815bbab10 f9877058a2f5	Doru Bercea	Merge remote-tracking branch 'origin/master' into patch8-9	Aug 15 2017, 1:54 PM
5b2815bbab10	94bbafddbdaf	bc07ca053729	Doru Bercea	Add missing file.	Apr 18 2017, 8:21 AM
bc07ca053729	4dedb9fe3e9a	25f4b592741c	Doru Bercea	Add missing file.	Apr 18 2017, 8:19 AM
25f4b592741c	5cec6afbd73f	6590af4307e9	Doru Bercea	Clean-up code.	Apr 6 2017, 9:02 PM
6590af4307e9	387ee821f760	c1b167f986ce	Doru Bercea	Clean-up	Apr 6 2017, 8:55 PM
c1b167f986ce	a747a1e4b15f	68e85a98692b	Doru Bercea	Clean-up code.	Apr 6 2017, 8:51 PM
68e85a98692b	2b49b9e50875	2f19440439d6	Doru Bercea	Prevent the backend from using the same frame index for local and shared values.	Mar 21 2017, 2:38 PM
2f19440439d6	e49d4037f39d	28702c315e9b	Doru Bercea	Fix inference of shared depot usage when optimization level greater then O1.	Mar 15 2017, 3:02 PM
28702c315e9b	f549ca01750f	297f65c6a347	Doru Bercea	Only use a shared depot in the kernel function.	Mar 14 2017, 1:05 PM
297f65c6a347	c66db69b570a	737aed46db95	Doru Bercea	Detect the pattern that replaces local with shared depot reference.	Mar 10 2017, 12:20 PM
737aed46db95	0f0f13c6a3d3	f591c2a60305	Doru Bercea	First commit.	Mar 9 2017, 2:40 PM
f591c2a60305	4bee543f1776	2811bb8f7243	Doru Bercea	Existing patch D17738: cleaning names in NVPTX.	Jan 24 2017, 2:31 PM

Diff 129093

include/llvm/CodeGen/TargetPassConfig.h

Show First 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	protected:
virtual bool addPreISel() {		virtual bool addPreISel() {
return true;		return true;
}		}

/// addMachineSSAOptimization - Add standard passes that optimize machine		/// addMachineSSAOptimization - Add standard passes that optimize machine
/// instructions in SSA form.		/// instructions in SSA form.
virtual void addMachineSSAOptimization();		virtual void addMachineSSAOptimization();

		/// Add passes that lower variables to a
		/// particular memory type.
		virtual void addMachineSSALowering() {}

/// Add passes that optimize instruction level parallelism for out-of-order		/// Add passes that optimize instruction level parallelism for out-of-order
/// targets. These passes are run while the machine code is still in SSA		/// targets. These passes are run while the machine code is still in SSA
/// form, so they can use MachineTraceMetrics to control their heuristics.		/// form, so they can use MachineTraceMetrics to control their heuristics.
///		///
/// All passes added here should preserve the MachineDominatorTree,		/// All passes added here should preserve the MachineDominatorTree,
/// MachineLoopInfo, and MachineTraceMetrics analyses.		/// MachineLoopInfo, and MachineTraceMetrics analyses.
virtual bool addILPOpts() {		virtual bool addILPOpts() {
return false;		return false;
▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

lib/CodeGen/TargetPassConfig.cpp

Show First 20 Lines • Show All 810 Lines • ▼ Show 20 Lines	void TargetPassConfig::addMachinePasses() {

// Expand pseudo-instructions emitted by ISel.		// Expand pseudo-instructions emitted by ISel.
addPass(&ExpandISelPseudosID);		addPass(&ExpandISelPseudosID);

// Add passes that optimize machine instructions in SSA form.		// Add passes that optimize machine instructions in SSA form.
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
addMachineSSAOptimization();		addMachineSSAOptimization();
} else {		} else {
		// Ensure lowering to the appropriate memroy type occurs even when no
		// optimizations are enabled. This type of lowering is required for
		// correctness by the NVPTX backend.
		addMachineSSALowering();

// If the target requests it, assign local variables to stack slots relative		// If the target requests it, assign local variables to stack slots relative
// to one another and simplify frame index references where possible.		// to one another and simplify frame index references where possible.
addPass(&LocalStackSlotAllocationID, false);		addPass(&LocalStackSlotAllocationID, false);
}		}

if (TM->Options.EnableIPRA)		if (TM->Options.EnableIPRA)
addPass(createRegUsageInfoPropPass());		addPass(createRegUsageInfoPropPass());

▲ Show 20 Lines • Show All 323 Lines • Show Last 20 Lines

lib/Target/NVPTX/CMakeLists.txt

Show All 18 Lines	set(NVPTXCodeGen_sources
NVPTXImageOptimizer.cpp		NVPTXImageOptimizer.cpp
NVPTXInstrInfo.cpp		NVPTXInstrInfo.cpp
NVPTXLowerAggrCopies.cpp		NVPTXLowerAggrCopies.cpp
NVPTXLowerArgs.cpp		NVPTXLowerArgs.cpp
NVPTXLowerAlloca.cpp		NVPTXLowerAlloca.cpp
NVPTXPeephole.cpp		NVPTXPeephole.cpp
NVPTXMCExpr.cpp		NVPTXMCExpr.cpp
NVPTXPrologEpilogPass.cpp		NVPTXPrologEpilogPass.cpp
		NVPTXLowerSharedFrameIndicesPass.cpp
NVPTXRegisterInfo.cpp		NVPTXRegisterInfo.cpp
NVPTXReplaceImageHandles.cpp		NVPTXReplaceImageHandles.cpp
NVPTXSubtarget.cpp		NVPTXSubtarget.cpp
NVPTXTargetMachine.cpp		NVPTXTargetMachine.cpp
NVPTXTargetTransformInfo.cpp		NVPTXTargetTransformInfo.cpp
		NVPTXFunctionDataSharing.cpp
NVPTXUtilities.cpp		NVPTXUtilities.cpp
NVVMIntrRange.cpp		NVVMIntrRange.cpp
NVVMReflect.cpp		NVVMReflect.cpp
)		)

add_llvm_target(NVPTXCodeGen ${NVPTXCodeGen_sources})		add_llvm_target(NVPTXCodeGen ${NVPTXCodeGen_sources})

add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)

lib/Target/NVPTX/NVPTX.h

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines

	FunctionPass *createNVPTXISelDag(NVPTXTargetMachine &TM,			FunctionPass *createNVPTXISelDag(NVPTXTargetMachine &TM,
	llvm::CodeGenOpt::Level OptLevel);			llvm::CodeGenOpt::Level OptLevel);
	ModulePass *createNVPTXAssignValidGlobalNamesPass();			ModulePass *createNVPTXAssignValidGlobalNamesPass();
	ModulePass *createGenericToNVVMPass();			ModulePass *createGenericToNVVMPass();
	FunctionPass *createNVVMIntrRangePass(unsigned int SmVersion);			FunctionPass *createNVVMIntrRangePass(unsigned int SmVersion);
	FunctionPass *createNVVMReflectPass();			FunctionPass *createNVVMReflectPass();
	MachineFunctionPass *createNVPTXPrologEpilogPass();			MachineFunctionPass *createNVPTXPrologEpilogPass();
				MachineFunctionPass *createNVPTXLowerSharedFrameIndicesPass();
	MachineFunctionPass *createNVPTXReplaceImageHandlesPass();			MachineFunctionPass *createNVPTXReplaceImageHandlesPass();
	FunctionPass *createNVPTXImageOptimizerPass();			FunctionPass *createNVPTXImageOptimizerPass();
	FunctionPass createNVPTXLowerArgsPass(const NVPTXTargetMachine TM);			FunctionPass createNVPTXLowerArgsPass(const NVPTXTargetMachine TM);
	BasicBlockPass *createNVPTXLowerAllocaPass();			BasicBlockPass *createNVPTXLowerAllocaPass();
				FunctionPass createNVPTXFunctionDataSharingPass(const NVPTXTargetMachine TM);
	MachineFunctionPass *createNVPTXPeephole();			MachineFunctionPass *createNVPTXPeephole();

	Target &getTheNVPTXTarget32();			Target &getTheNVPTXTarget32();
	Target &getTheNVPTXTarget64();			Target &getTheNVPTXTarget64();

	namespace NVPTX {			namespace NVPTX {
	enum DrvInterface {			enum DrvInterface {
	NVCL,			NVCL,
	▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXAsmPrinter.cpp

Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
#include <sstream>		#include <sstream>
#include <string>		#include <string>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;

#define DEPOTNAME "__local_depot"		#define DEPOTNAME "__local_depot"
		#define SHARED_DEPOTNAME "__shared_depot"

static cl::opt<bool>		static cl::opt<bool>
EmitLineNumbers("nvptx-emit-line-numbers", cl::Hidden,		EmitLineNumbers("nvptx-emit-line-numbers", cl::Hidden,
cl::desc("NVPTX Specific: Emit Line numbers even without -G"),		cl::desc("NVPTX Specific: Emit Line numbers even without -G"),
cl::init(true));		cl::init(true));

static cl::opt<bool>		static cl::opt<bool>
InterleaveSrc("nvptx-emit-src", cl::ZeroOrMore, cl::Hidden,		InterleaveSrc("nvptx-emit-src", cl::ZeroOrMore, cl::Hidden,
▲ Show 20 Lines • Show All 1,613 Lines • ▼ Show 20 Lines	void NVPTXAsmPrinter::setAndEmitFunctionVirtualRegisters(
const MachineFunction &MF) {		const MachineFunction &MF) {
SmallString<128> Str;		SmallString<128> Str;
raw_svector_ostream O(Str);		raw_svector_ostream O(Str);

// Map the global virtual register number to a register class specific		// Map the global virtual register number to a register class specific
// virtual register number starting from 1 with that class.		// virtual register number starting from 1 with that class.
const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();		const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
//unsigned numRegClasses = TRI->getNumRegClasses();		//unsigned numRegClasses = TRI->getNumRegClasses();
		bool IsKernelFunction = isKernelFunction(MF.getFunction());

		bool GenerateSharedDepot =
		MF.getFunction().hasFnAttribute("has-nvptx-shared-depot");

// Emit the Fake Stack Object		// Emit the Fake Stack Object
const MachineFrameInfo &MFI = MF.getFrameInfo();		const MachineFrameInfo &MFI = MF.getFrameInfo();
int NumBytes = (int) MFI.getStackSize();		int NumBytes = (int) MFI.getStackSize();
if (NumBytes) {		if (NumBytes) {
O << "\t.local .align " << MFI.getMaxAlignment() << " .b8 \t" << DEPOTNAME		O << "\t.local .align " << MFI.getMaxAlignment() << " .b8 \t" << DEPOTNAME
<< getFunctionNumber() << "[" << NumBytes << "];\n";		<< getFunctionNumber() << "[" << NumBytes << "];\n";
		if (IsKernelFunction && GenerateSharedDepot) {
		O << "\t.shared .align " << MFI.getMaxAlignment()
		hfinkelUnsubmitted Done Reply Inline Actions Line too long. hfinkel: Line too long.
		<< " .b8 \t" << SHARED_DEPOTNAME << getFunctionNumber()
		<< "[" << NumBytes << "];\n";
		}
if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {		if (static_cast<const NVPTXTargetMachine &>(MF.getTarget()).is64Bit()) {
O << "\t.reg .b64 \t%SP;\n";		O << "\t.reg .b64 \t%SP;\n";
O << "\t.reg .b64 \t%SPL;\n";		O << "\t.reg .b64 \t%SPL;\n";
		if (IsKernelFunction && GenerateSharedDepot) {
		O << "\t.reg .b64 \t%SPS;\n";
		O << "\t.reg .b64 \t%SPSH;\n";
		}
} else {		} else {
O << "\t.reg .b32 \t%SP;\n";		O << "\t.reg .b32 \t%SP;\n";
O << "\t.reg .b32 \t%SPL;\n";		O << "\t.reg .b32 \t%SPL;\n";
		if (IsKernelFunction && GenerateSharedDepot) {
		O << "\t.reg .b32 \t%SPS;\n";
		O << "\t.reg .b32 \t%SPSH;\n";
		traUnsubmitted Done Reply Inline Actions Nit: the name should end with `S` as the L in `SPL` was for 'local' address space. which then gets converted to generic AS. In your case it will be in shared space, hence S would be more appropriate. tra: Nit: the name should end with `S` as the L in `SPL` was for 'local' address space. which then…
		}
}		}
}		}

// Go through all virtual registers to establish the mapping between the		// Go through all virtual registers to establish the mapping between the
// global virtual		// global virtual
// register number and the per class virtual register number.		// register number and the per class virtual register number.
// We use the per class virtual register number in the ptx output.		// We use the per class virtual register number in the ptx output.
unsigned int numVRs = MRI->getNumVirtRegs();		unsigned int numVRs = MRI->getNumVirtRegs();
▲ Show 20 Lines • Show All 608 Lines • ▼ Show 20 Lines
void NVPTXAsmPrinter::printOperand(const MachineInstr *MI, int opNum,		void NVPTXAsmPrinter::printOperand(const MachineInstr *MI, int opNum,
raw_ostream &O, const char *Modifier) {		raw_ostream &O, const char *Modifier) {
const MachineOperand &MO = MI->getOperand(opNum);		const MachineOperand &MO = MI->getOperand(opNum);
switch (MO.getType()) {		switch (MO.getType()) {
case MachineOperand::MO_Register:		case MachineOperand::MO_Register:
if (TargetRegisterInfo::isPhysicalRegister(MO.getReg())) {		if (TargetRegisterInfo::isPhysicalRegister(MO.getReg())) {
if (MO.getReg() == NVPTX::VRDepot)		if (MO.getReg() == NVPTX::VRDepot)
O << DEPOTNAME << getFunctionNumber();		O << DEPOTNAME << getFunctionNumber();
		else if (MO.getReg() == NVPTX::VRSharedDepot)
		O << SHARED_DEPOTNAME << getFunctionNumber();
else		else
O << NVPTXInstPrinter::getRegisterName(MO.getReg());		O << NVPTXInstPrinter::getRegisterName(MO.getReg());
} else {		} else {
emitVirtualRegister(MO.getReg(), O);		emitVirtualRegister(MO.getReg(), O);
}		}
return;		return;

case MachineOperand::MO_Immediate:		case MachineOperand::MO_Immediate:
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXFrameLowering.cpp

Show All 10 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "NVPTXFrameLowering.h"		#include "NVPTXFrameLowering.h"
#include "NVPTX.h"		#include "NVPTX.h"
#include "NVPTXRegisterInfo.h"		#include "NVPTXRegisterInfo.h"
#include "NVPTXSubtarget.h"		#include "NVPTXSubtarget.h"
#include "NVPTXTargetMachine.h"		#include "NVPTXTargetMachine.h"
		#include "NVPTXUtilities.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/TargetInstrInfo.h"		#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/MC/MachineLocation.h"		#include "llvm/MC/MachineLocation.h"

using namespace llvm;		using namespace llvm;
Show All 29 Lines	if (!MR.use_empty(NVPTX::VRFrame)) {
MI = BuildMI(MBB, MI, dl,		MI = BuildMI(MBB, MI, dl,
MF.getSubtarget().getInstrInfo()->get(CvtaLocalOpcode),		MF.getSubtarget().getInstrInfo()->get(CvtaLocalOpcode),
NVPTX::VRFrame)		NVPTX::VRFrame)
.addReg(NVPTX::VRFrameLocal);		.addReg(NVPTX::VRFrameLocal);
}		}
BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(MovDepotOpcode),		BuildMI(MBB, MI, dl, MF.getSubtarget().getInstrInfo()->get(MovDepotOpcode),
NVPTX::VRFrameLocal)		NVPTX::VRFrameLocal)
.addImm(MF.getFunctionNumber());		.addImm(MF.getFunctionNumber());

		bool SharedStackPointerInit =
		MF.getFunction().hasFnAttribute("has-nvptx-shared-depot");

		// Only emit a shared depot for the main kernel function.
		// The other device functions need to get a handle on this shared depot
		// by interacting with a runtime library.
		hfinkelUnsubmitted Done Reply Inline Actions In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say "the OpenMP runtime" here as well (but just saying "the runtime" seems potentially confusing). hfinkel: In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say…
		//
		// Clang can trigger this at any point via the has-nvptx-shared-depot function
		// attribute.
		//
		// Currently this situation arises as a consequence of OpenMP semantics.
		// The interaction in that case invovles the OpenMP runtime.
		if (isKernelFunction(MF.getFunction()) && SharedStackPointerInit) {
		// Emits
		// mov %SHSPL, %shared_depot;
		// cvta.shared %SHSP, %SHSPL;
		// For the time being just emit it even if it's not used.
		unsigned CvtaSharedOpcode =
		Is64Bit ? NVPTX::cvta_shared_yes_64 : NVPTX::cvta_shared_yes;
		unsigned MovSharedDepotOpcode =
		hfinkelUnsubmitted Done Reply Inline Actions Line too long. hfinkel: Line too long.
		Is64Bit ? NVPTX::MOV_SHARED_DEPOT_ADDR_64 : NVPTX::MOV_SHARED_DEPOT_ADDR;
		MI = BuildMI(MBB, MI, dl,
		MF.getSubtarget().getInstrInfo()->get(CvtaSharedOpcode),
		NVPTX::VRShared)
		.addReg(NVPTX::VRFrameShared);
		BuildMI(MBB, MI, dl,
		MF.getSubtarget().getInstrInfo()->get(MovSharedDepotOpcode),
		NVPTX::VRFrameShared)
		.addImm(MF.getFunctionNumber());
		}
}		}
}		}

void NVPTXFrameLowering::emitEpilogue(MachineFunction &MF,		void NVPTXFrameLowering::emitEpilogue(MachineFunction &MF,
MachineBasicBlock &MBB) const {}		MachineBasicBlock &MBB) const {}

// This function eliminates ADJCALLSTACKDOWN,		// This function eliminates ADJCALLSTACKDOWN,
// ADJCALLSTACKUP pseudo instructions		// ADJCALLSTACKUP pseudo instructions
MachineBasicBlock::iterator NVPTXFrameLowering::eliminateCallFramePseudoInstr(		MachineBasicBlock::iterator NVPTXFrameLowering::eliminateCallFramePseudoInstr(
MachineFunction &MF, MachineBasicBlock &MBB,		MachineFunction &MF, MachineBasicBlock &MBB,
MachineBasicBlock::iterator I) const {		MachineBasicBlock::iterator I) const {
// Simply discard ADJCALLSTACKDOWN,		// Simply discard ADJCALLSTACKDOWN,
// ADJCALLSTACKUP instructions.		// ADJCALLSTACKUP instructions.
return MBB.erase(I);		return MBB.erase(I);
}		}

lib/Target/NVPTX/NVPTXFunctionDataSharing.h

This file was added.

				//===--- NVPTXFrameLowering.h - Define frame lowering for NVPTX -- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				//
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_NVPTX_NVPTXFUNCTIONDATASHARING_H
				#define LLVM_LIB_TARGET_NVPTX_NVPTXFUNCTIONDATASHARING_H

				namespace llvm {

				class NVPTXFunctionDataSharing : public FunctionPass {
				bool runOnFunction(Function &F) override;
				bool runOnKernelFunction(Function &F);
				bool runOnDeviceFunction(Function &F);

				public:
				static char ID; // Pass identification, replacement for typeid
				NVPTXFunctionDataSharing(const NVPTXTargetMachine *TM = nullptr)
				: FunctionPass(ID), TM(TM) {}
				StringRef getPassName() const override {
				return "Function level data sharing pass.";
				}

				private:
				const NVPTXTargetMachine *TM;
				};
				} // End llvm namespace

				#endif
				No newline at end of file

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp

This file was added.

				//===-- FunctionDataSharing.cpp - Mark pointers as shared -----------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// For all alloca instructions, add a pair of cast to shared address for
				traUnsubmitted Done Reply Inline Actions Please add details about what the pass is supposed to do. tra: Please add details about what the pass is supposed to do.
				// each of them. For example,
				//
				// %A = alloca i32
				// store i32 0, i32* %A ; emits st.u32
				//
				// will be transformed to
				//
				// %A = alloca i32
				// %Local = addrspacecast i32* %A to i32 addrspace(3)*
				// %Shared = addrspacecast i32 addrspace(3)* %A to i32*
				// store i32 0, i32 addrspace(5)* %Shared ; emits st.shared.u32
				//
				// And we will rely on NVPTXInferAddressSpaces to combine the last two
				// instructions.
				//
				// This pass is invoked for -O0 only.
				//
				//===----------------------------------------------------------------------===//

				#include "NVPTX.h"
				#include "NVPTXUtilities.h"
				#include "NVPTXTargetMachine.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/Type.h"
				#include "llvm/Pass.h"

				using namespace llvm;

				namespace llvm {
				void initializeNVPTXFunctionDataSharingPass(PassRegistry &);
				}

				namespace {
				class NVPTXFunctionDataSharing : public FunctionPass {
				bool runOnFunction(Function &F) override;
				bool runOnKernelFunction(Function &F);
				bool runOnDeviceFunction(Function &F);

				public:
				static char ID; // Pass identification, replacement for typeid
				NVPTXFunctionDataSharing(const NVPTXTargetMachine *TM = nullptr)
				: FunctionPass(ID) {}
				StringRef getPassName() const override {
				return "Function level data sharing pass.";
				}
				};
				} // namespace

				char NVPTXFunctionDataSharing::ID = 1;

				INITIALIZE_PASS(NVPTXFunctionDataSharing, "nvptx-function-data-sharing",
				"Function Data Sharing (NVPTX)", false, false)

				static void markPointerAsShared(Value *Ptr) {
				if (Ptr->getType()->getPointerAddressSpace() == ADDRESS_SPACE_SHARED)
				return;

				// Deciding where to emit the addrspacecast pair.
				// Insert right after Ptr if Ptr is an instruction.
				BasicBlock::iterator InsertPt =
				std::next(cast<Instruction>(Ptr)->getIterator());
				assert(InsertPt != InsertPt->getParent()->end() &&
				"We don't call this function with Ptr being a terminator.");

				auto *PtrInShared = new AddrSpaceCastInst(
				Ptr, PointerType::get(Ptr->getType()->getPointerElementType(),
				ADDRESS_SPACE_SHARED),
				Ptr->getName(), &*InsertPt);
				// Old version
				auto *PtrInGeneric = new AddrSpaceCastInst(PtrInShared, Ptr->getType(),
				Ptr->getName(), &*InsertPt);
				// Replace with PtrInGeneric all uses of Ptr except PtrInShared.
				Ptr->replaceAllUsesWith(PtrInGeneric);
				PtrInShared->setOperand(0, Ptr);
				}

				// =============================================================================
				// Main function for this pass.
				// =============================================================================
				bool NVPTXFunctionDataSharing::runOnKernelFunction(Function &F) {
				bool Modified = false;

				// Skip pass if no data sharing is required.
				if (!F.hasFnAttribute("has-nvptx-shared-depot"))
				return false;

				traUnsubmitted Not Done Reply Inline Actions Nit: `return false` would match the intent better. tra: Nit: `return false` would match the intent better.
				for (auto &B : F) {
				for (auto &I : B) {
				auto *AI = dyn_cast<AllocaInst>(&I);
				if (!AI)
				continue;
				if (AI->getType()->isPointerTy() && ptrIsStored(AI)) {
				markPointerAsShared(AI);
				Modified = true;
				}
				}
				}

				return Modified;
				}

				// Device functions only need to copy byval args into local memory.
				bool NVPTXFunctionDataSharing::runOnDeviceFunction(Function &F) {
				return true;
				}

				bool NVPTXFunctionDataSharing::runOnFunction(Function &F) {
				return isKernelFunction(F) ? runOnKernelFunction(F) : runOnDeviceFunction(F);
				}

				FunctionPass *
				llvm::createNVPTXFunctionDataSharingPass(const NVPTXTargetMachine *TM) {
				return new NVPTXFunctionDataSharing(TM);
				}

lib/Target/NVPTX/NVPTXInstrInfo.td

Show First 20 Lines • Show All 1,577 Lines • ▼ Show 20 Lines	def MOV_ADDR64 : NVPTXInst<(outs Int64Regs:$dst), (ins imem:$a),
[(set Int64Regs:$dst, (Wrapper tglobaladdr:$a))]>;		[(set Int64Regs:$dst, (Wrapper tglobaladdr:$a))]>;

// Get pointer to local stack.		// Get pointer to local stack.
let hasSideEffects = 0 in {		let hasSideEffects = 0 in {
def MOV_DEPOT_ADDR : NVPTXInst<(outs Int32Regs:$d), (ins i32imm:$num),		def MOV_DEPOT_ADDR : NVPTXInst<(outs Int32Regs:$d), (ins i32imm:$num),
"mov.u32 \t$d, __local_depot$num;", []>;		"mov.u32 \t$d, __local_depot$num;", []>;
def MOV_DEPOT_ADDR_64 : NVPTXInst<(outs Int64Regs:$d), (ins i32imm:$num),		def MOV_DEPOT_ADDR_64 : NVPTXInst<(outs Int64Regs:$d), (ins i32imm:$num),
"mov.u64 \t$d, __local_depot$num;", []>;		"mov.u64 \t$d, __local_depot$num;", []>;
		def MOV_SHARED_DEPOT_ADDR : NVPTXInst<(outs Int32Regs:$d), (ins i32imm:$num),
		"mov.u32 \t$d, __shared_depot$num;", []>;
		def MOV_SHARED_DEPOT_ADDR_64 : NVPTXInst<(outs Int64Regs:$d), (ins i32imm:$num),
		"mov.u64 \t$d, __shared_depot$num;", []>;
}		}


// copyPhysreg is hard-coded in NVPTXInstrInfo.cpp		// copyPhysreg is hard-coded in NVPTXInstrInfo.cpp
let IsSimpleMove=1, hasSideEffects=0 in {		let IsSimpleMove=1, hasSideEffects=0 in {
def IMOV1rr : NVPTXInst<(outs Int1Regs:$dst), (ins Int1Regs:$sss),		def IMOV1rr : NVPTXInst<(outs Int1Regs:$dst), (ins Int1Regs:$sss),
"mov.pred \t$dst, $sss;", []>;		"mov.pred \t$dst, $sss;", []>;
def IMOV16rr : NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$sss),		def IMOV16rr : NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$sss),
▲ Show 20 Lines • Show All 1,576 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXLowerAlloca.cpp

Show All 17 Lines
// %A = alloca i32		// %A = alloca i32
// %Local = addrspacecast i32* %A to i32 addrspace(5)*		// %Local = addrspacecast i32* %A to i32 addrspace(5)*
// %Generic = addrspacecast i32 addrspace(5)* %A to i32*		// %Generic = addrspacecast i32 addrspace(5)* %A to i32*
// store i32 0, i32 addrspace(5)* %Generic ; emits st.local.u32		// store i32 0, i32 addrspace(5)* %Generic ; emits st.local.u32
//		//
// And we will rely on NVPTXInferAddressSpaces to combine the last two		// And we will rely on NVPTXInferAddressSpaces to combine the last two
// instructions.		// instructions.
//		//
		// In the case of OpenMP shared variables, perform the same transformation as
		// for local variables but using the shared address space.
		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "NVPTX.h"		#include "NVPTX.h"
#include "NVPTXUtilities.h"		#include "NVPTXUtilities.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
Show All 32 Lines	if (skipBasicBlock(BB))
return false;		return false;

bool Changed = false;		bool Changed = false;
for (auto &I : BB) {		for (auto &I : BB) {
if (auto allocaInst = dyn_cast<AllocaInst>(&I)) {		if (auto allocaInst = dyn_cast<AllocaInst>(&I)) {
Changed = true;		Changed = true;
auto PTy = dyn_cast<PointerType>(allocaInst->getType());		auto PTy = dyn_cast<PointerType>(allocaInst->getType());
auto ETy = PTy->getElementType();		auto ETy = PTy->getElementType();
auto LocalAddrTy = PointerType::get(ETy, ADDRESS_SPACE_LOCAL);
auto NewASCToLocal = new AddrSpaceCastInst(allocaInst, LocalAddrTy, "");		// In the CUDA case, this is always a local address.
auto GenericAddrTy = PointerType::get(ETy, ADDRESS_SPACE_GENERIC);		// In offloading to a device using OpenMP this may be an
		// address allocated in the shared memory of the device.
		auto *AddrTy = PointerType::get(ETy, ADDRESS_SPACE_LOCAL);
		bool PtrIsStored = ptrIsStored(allocaInst);
		bool RequiresSharedMemory =
		BB.getParent()->hasFnAttribute("has-nvptx-shared-depot");

		// Handle shared args: currently shared args are declared as
		// an alloca in LLVM-IR code generation and lowered to
		// shared memory.
		if (PtrIsStored && RequiresSharedMemory)
		AddrTy = PointerType::get(ETy, ADDRESS_SPACE_SHARED);

		auto NewASCToLocal = new AddrSpaceCastInst(allocaInst, AddrTy, "");
		auto *GenericAddrTy = PointerType::get(ETy, ADDRESS_SPACE_GENERIC);
auto NewASCToGeneric = new AddrSpaceCastInst(NewASCToLocal,		auto NewASCToGeneric = new AddrSpaceCastInst(NewASCToLocal,
GenericAddrTy, "");		GenericAddrTy, "");
NewASCToLocal->insertAfter(allocaInst);		NewASCToLocal->insertAfter(allocaInst);
NewASCToGeneric->insertAfter(NewASCToLocal);		NewASCToGeneric->insertAfter(NewASCToLocal);

		// If a value is shared then the additional conversions are required for
		// correctness.
		if (PtrIsStored && RequiresSharedMemory) {
		allocaInst->replaceAllUsesWith(NewASCToGeneric);
		NewASCToLocal->setOperand(0, allocaInst);
		continue;
		}

for (Value::use_iterator UI = allocaInst->use_begin(),		for (Value::use_iterator UI = allocaInst->use_begin(),
UE = allocaInst->use_end();		UE = allocaInst->use_end();
UI != UE; ) {		UI != UE; ) {
// Check Load, Store, GEP, and BitCast Uses on alloca and make them		// Check Load, Store, GEP, and BitCast Uses on alloca and make them
// use the converted generic address, in order to expose non-generic		// use the converted generic address, in order to expose non-generic
// addrspacecast to NVPTXInferAddressSpaces. For other types		// addrspacecast to NVPTXInferAddressSpaces. For other types
// of instructions this is unnecessary and may introduce redundant		// of instructions this is unnecessary and may introduce redundant
// address cast.		// address cast.
const auto &AllocaUse = *UI++;		const auto &AllocaUse = *UI++;
auto LI = dyn_cast<LoadInst>(AllocaUse.getUser());		auto LI = dyn_cast<LoadInst>(AllocaUse.getUser());
if (LI && LI->getPointerOperand() == allocaInst && !LI->isVolatile()) {		if (LI && LI->getPointerOperand() == allocaInst && !LI->isVolatile()) {
LI->setOperand(LI->getPointerOperandIndex(), NewASCToGeneric);		LI->setOperand(LI->getPointerOperandIndex(), NewASCToGeneric);
continue;		continue;
}		}
auto SI = dyn_cast<StoreInst>(AllocaUse.getUser());		auto SI = dyn_cast<StoreInst>(AllocaUse.getUser());
if (SI && SI->getPointerOperand() == allocaInst && !SI->isVolatile()) {		if (SI && !SI->isVolatile()){
SI->setOperand(SI->getPointerOperandIndex(), NewASCToGeneric);		unsigned Idx;
		if (SI->getPointerOperand() == allocaInst)
		Idx = SI->getPointerOperandIndex();
		else if (SI->getValueOperand() == allocaInst)
		Idx = 0;
		else
continue;		continue;
		SI->setOperand(Idx, NewASCToGeneric);
}		}
auto GI = dyn_cast<GetElementPtrInst>(AllocaUse.getUser());		auto GI = dyn_cast<GetElementPtrInst>(AllocaUse.getUser());
if (GI && GI->getPointerOperand() == allocaInst) {		if (GI && GI->getPointerOperand() == allocaInst) {
GI->setOperand(GI->getPointerOperandIndex(), NewASCToGeneric);		GI->setOperand(GI->getPointerOperandIndex(), NewASCToGeneric);
continue;		continue;
}		}
auto BI = dyn_cast<BitCastInst>(AllocaUse.getUser());		auto BI = dyn_cast<BitCastInst>(AllocaUse.getUser());
if (BI && BI->getOperand(0) == allocaInst) {		if (BI && BI->getOperand(0) == allocaInst) {
Show All 12 Lines

lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp

This file was added.

				//===-- NVPTXLowerSharedFrameIndicesPass.cpp - NVPTX lowering ------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass lowers the frame indices that need to reference the shared memry
				// depot to the shared framed virtual register VRShared. This pass only performs
				// the lowering if the function has the has-nvptx-shared-depot attribute.
				//
				hfinkelUnsubmitted Done Reply Inline Actions Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's the only motivation, can we use the regular PEI now? hfinkel: Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's…
				//===----------------------------------------------------------------------===//

				#include "NVPTX.h"
				#include "NVPTXUtilities.h"
				#include "NVPTXRegisterInfo.h"
				#include "NVPTXSubtarget.h"
				#include "NVPTXTargetMachine.h"
				#include "llvm/CodeGen/MachineFrameInfo.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/CodeGen/TargetFrameLowering.h"
				#include "llvm/CodeGen/TargetRegisterInfo.h"
				#include "llvm/CodeGen/TargetSubtargetInfo.h"
				#include "llvm/CodeGen/TargetInstrInfo.h"
				#include "llvm/MC/MachineLocation.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"

				using namespace llvm;

				#define DEBUG_TYPE "nvptx-lower-shared-frame-indices"

				namespace {
				class NVPTXLowerSharedFrameIndicesPass : public MachineFunctionPass {
				public:
				static char ID;
				NVPTXLowerSharedFrameIndicesPass() : MachineFunctionPass(ID) {}

				bool runOnMachineFunction(MachineFunction &MF) override;

				private:
				void calculateSharedFrameObjectOffsets(MachineFunction &Fn);
				};
				}

				MachineFunctionPass *llvm::createNVPTXLowerSharedFrameIndicesPass() {
				return new NVPTXLowerSharedFrameIndicesPass();
				}

				char NVPTXLowerSharedFrameIndicesPass::ID = 0;

				static bool isSharedFrame(
				MachineBasicBlock::iterator II,
				MachineFunction &MF) {
				MachineInstr &currentMI = *II;

				if (!currentMI.getOperand(0).isReg())
				return false;;

				bool useSharedFrame = false;
				unsigned AllocRegisterNumber = currentMI.getOperand(0).getReg();

				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				if (MI.getOpcode() == NVPTX::cvta_to_shared_yes_64 \|\|
				MI.getOpcode() == NVPTX::cvta_to_shared_yes) {
				if (AllocRegisterNumber == MI.getOperand(1).getReg()) {
				useSharedFrame = true;
				break;
				}
				}
				}
				}
				return useSharedFrame;
				}

				bool NVPTXLowerSharedFrameIndicesPass::runOnMachineFunction(MachineFunction &MF) {
				bool Modified = false;
				bool IsKernel = isKernelFunction(MF.getFunction());

				// Skip pass if function is not the kernel.
				if (!IsKernel)
				return Modified;

				// Skip pass if no data sharing is required.
				// TODO: At the moment there is no limit enforced on the amount
				// of shared memory used for this scheme. A limit will be
				// introduced once a fallback scheme that uses global memory is
				// in place: when the size of the shared depot exceeds a certain
				// threshhold then the global memory scheme should be used instead.
				if (!MF.getFunction().hasFnAttribute("has-nvptx-shared-depot"))
				return Modified;

				SmallVector<int, 16> SharedFrameIndices;

				calculateSharedFrameObjectOffsets(MF);

				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
				if (!MI.getOperand(i).isFI())
				continue;

				if (i + 1 >= MI.getNumOperands())
				continue;

				bool IsSharedFrame = false;
				int FrameIndex = MI.getOperand(i).getIndex();

				for(int SFI : SharedFrameIndices)
				if (FrameIndex == SFI)
				IsSharedFrame = true;

				if (!IsSharedFrame && isSharedFrame(MI, MF)) {
				SharedFrameIndices.push_back(FrameIndex);
				IsSharedFrame = true;
				}

				if (IsSharedFrame) {
				// Change Frame index to use shared stack.
				MachineFunction &MF = *MI.getParent()->getParent();
				int Offset = MF.getFrameInfo().getObjectOffset(FrameIndex) +
				MI.getOperand(i + 1).getImm();

				// Using I0 as the frame pointer
				// For shared data use the appropriate virtual register: VRShared
				MI.getOperand(i).ChangeToRegister(NVPTX::VRShared, false);
				MI.getOperand(i + 1).ChangeToImmediate(Offset);
				}
				Modified = true;
				}
				}
				}

				return Modified;
				}

				/// AdjustStackOffset - Helper function used to adjust the stack frame offset.
				static inline void
				AdjustStackOffset(MachineFrameInfo &MFI, int FrameIdx,
				bool StackGrowsDown, int64_t &Offset,
				unsigned &MaxAlign) {
				// If the stack grows down, add the object size to find the lowest address.
				if (StackGrowsDown)
				Offset += MFI.getObjectSize(FrameIdx);

				unsigned Align = MFI.getObjectAlignment(FrameIdx);

				// If the alignment of this object is greater than that of the stack, then
				// increase the stack alignment to match.
				MaxAlign = std::max(MaxAlign, Align);

				// Adjust to alignment boundary.
				Offset = (Offset + Align - 1) / Align * Align;

				if (StackGrowsDown) {
				DEBUG(dbgs() << "alloc FI(" << FrameIdx << ") at SP[" << -Offset << "]\n");
				MFI.setObjectOffset(FrameIdx, -Offset); // Set the computed offset
				} else {
				DEBUG(dbgs() << "alloc FI(" << FrameIdx << ") at SP[" << Offset << "]\n");
				MFI.setObjectOffset(FrameIdx, Offset);
				Offset += MFI.getObjectSize(FrameIdx);
				}
				}

				/// This function computes the offset inside the shared stack.
				///
				/// TODO: For simplicity, currently, the offsets conincide with
				/// the local stack frame offsets - the local and stack frame
				/// offsets are the same length.
				void
				NVPTXLowerSharedFrameIndicesPass::calculateSharedFrameObjectOffsets(
				MachineFunction &Fn) {
				const TargetFrameLowering &TFI = *Fn.getSubtarget().getFrameLowering();
				const TargetRegisterInfo *RegInfo = Fn.getSubtarget().getRegisterInfo();

				bool StackGrowsDown =
				TFI.getStackGrowthDirection() == TargetFrameLowering::StackGrowsDown;

				// Loop over all of the stack objects, assigning sequential addresses...
				MachineFrameInfo &MFI = Fn.getFrameInfo();

				// Start at the beginning of the local area.
				// The Offset is the distance from the stack top in the direction
				// of stack growth -- so it's always nonnegative.
				int LocalAreaOffset = TFI.getOffsetOfLocalArea();
				if (StackGrowsDown)
				LocalAreaOffset = -LocalAreaOffset;
				assert(LocalAreaOffset >= 0
				&& "Local area offset should be in direction of stack growth");
				int64_t Offset = LocalAreaOffset;

				// If there are fixed sized objects that are preallocated in the local area,
				// non-fixed objects can't be allocated right at the start of local area.
				// We currently don't support filling in holes in between fixed sized
				// objects, so we adjust 'Offset' to point to the end of last fixed sized
				// preallocated object.
				for (int i = MFI.getObjectIndexBegin(); i != 0; ++i) {
				int64_t FixedOff;
				if (StackGrowsDown) {
				// The maximum distance from the stack pointer is at lower address of
				// the object -- which is given by offset. For down growing stack
				// the offset is negative, so we negate the offset to get the distance.
				FixedOff = -MFI.getObjectOffset(i);
				} else {
				// The maximum distance from the start pointer is at the upper
				// address of the object.
				FixedOff = MFI.getObjectOffset(i) + MFI.getObjectSize(i);
				}
				if (FixedOff > Offset) Offset = FixedOff;
				}

				// NOTE: We do not have a call stack

				unsigned MaxAlign = MFI.getMaxAlignment();

				// No scavenger

				// FIXME: Once this is working, then enable flag will change to a target
				// check for whether the frame is large enough to want to use virtual
				// frame index registers. Functions which don't want/need this optimization
				// will continue to use the existing code path.
				if (MFI.getUseLocalStackAllocationBlock()) {
				unsigned Align = MFI.getLocalFrameMaxAlign();

				// Adjust to alignment boundary.
				Offset = (Offset + Align - 1) / Align * Align;

				DEBUG(dbgs() << "Local frame base offset: " << Offset << "\n");

				// Resolve offsets for objects in the local block.
				for (unsigned i = 0, e = MFI.getLocalFrameObjectCount(); i != e; ++i) {
				std::pair<int, int64_t> Entry = MFI.getLocalFrameObjectMap(i);
				int64_t FIOffset = (StackGrowsDown ? -Offset : Offset) + Entry.second;
				DEBUG(dbgs() << "alloc FI(" << Entry.first << ") at SP[" <<
				FIOffset << "]\n");
				MFI.setObjectOffset(Entry.first, FIOffset);
				}
				// Allocate the local block
				Offset += MFI.getLocalFrameSize();

				MaxAlign = std::max(Align, MaxAlign);
				}

				// No stack protector

				// Then assign frame offsets to stack objects that are not used to spill
				// callee saved registers.
				for (unsigned i = 0, e = MFI.getObjectIndexEnd(); i != e; ++i) {
				if (MFI.isObjectPreAllocated(i) &&
				MFI.getUseLocalStackAllocationBlock())
				continue;
				if (MFI.isDeadObjectIndex(i))
				continue;

				AdjustStackOffset(MFI, i, StackGrowsDown, Offset, MaxAlign);
				}

				// No scavenger

				if (!TFI.targetHandlesStackFrameRounding()) {
				// If we have reserved argument space for call sites in the function
				// immediately on entry to the current function, count it as part of the
				// overall stack size.
				if (MFI.adjustsStack() && TFI.hasReservedCallFrame(Fn))
				Offset += MFI.getMaxCallFrameSize();

				// Round up the size to a multiple of the alignment. If the function has
				// any calls or alloca's, align to the target's StackAlignment value to
				// ensure that the callee's frame or the alloca data is suitably aligned;
				// otherwise, for leaf functions, align to the TransientStackAlignment
				// value.
				unsigned StackAlign;
				if (MFI.adjustsStack() \|\| MFI.hasVarSizedObjects() \|\|
				(RegInfo->needsStackRealignment(Fn) && MFI.getObjectIndexEnd() != 0))
				StackAlign = TFI.getStackAlignment();
				else
				StackAlign = TFI.getTransientStackAlignment();

				// If the frame pointer is eliminated, all frame offsets will be relative to
				// SP not FP. Align to MaxAlign so this works.
				StackAlign = std::max(StackAlign, MaxAlign);
				unsigned AlignMask = StackAlign - 1;
				Offset = (Offset + AlignMask) & ~uint64_t(AlignMask);
				}

				// Update frame info to pretend that this is part of the stack...
				int64_t StackSize = Offset - LocalAreaOffset;
				MFI.setStackSize(StackSize);
				}

lib/Target/NVPTX/NVPTXRegisterInfo.h

Show All 39 Lines	public:
BitVector getReservedRegs(const MachineFunction &MF) const override;		BitVector getReservedRegs(const MachineFunction &MF) const override;

void eliminateFrameIndex(MachineBasicBlock::iterator MI, int SPAdj,		void eliminateFrameIndex(MachineBasicBlock::iterator MI, int SPAdj,
unsigned FIOperandNum,		unsigned FIOperandNum,
RegScavenger *RS = nullptr) const override;		RegScavenger *RS = nullptr) const override;

unsigned getFrameRegister(const MachineFunction &MF) const override;		unsigned getFrameRegister(const MachineFunction &MF) const override;

		unsigned getSharedFrameRegister(const MachineFunction &MF) const;

ManagedStringPool *getStrPool() const {		ManagedStringPool *getStrPool() const {
return const_cast<ManagedStringPool *>(&ManagedStrPool);		return const_cast<ManagedStringPool *>(&ManagedStrPool);
}		}

const char *getName(unsigned RegNo) const {		const char *getName(unsigned RegNo) const {
std::stringstream O;		std::stringstream O;
O << "reg" << RegNo;		O << "reg" << RegNo;
return getStrPool()->getManagedString(O.str().c_str())->c_str();		return getStrPool()->getManagedString(O.str().c_str())->c_str();
Show All 10 Lines

lib/Target/NVPTX/NVPTXRegisterInfo.cpp

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	void NVPTXRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
// Using I0 as the frame pointer		// Using I0 as the frame pointer
MI.getOperand(FIOperandNum).ChangeToRegister(NVPTX::VRFrame, false);		MI.getOperand(FIOperandNum).ChangeToRegister(NVPTX::VRFrame, false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);		MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
}		}

unsigned NVPTXRegisterInfo::getFrameRegister(const MachineFunction &MF) const {		unsigned NVPTXRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
return NVPTX::VRFrame;		return NVPTX::VRFrame;
}		}

		unsigned NVPTXRegisterInfo::getSharedFrameRegister(
		hfinkelUnsubmitted Done Reply Inline Actions Line too long. hfinkel: Line too long.
		const MachineFunction &MF) const {
		return NVPTX::VRShared;
		}

lib/Target/NVPTX/NVPTXRegisterInfo.td

	Show All 19 Lines

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Registers			// Registers
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// Special Registers used as stack pointer			// Special Registers used as stack pointer
	def VRFrame : NVPTXReg<"%SP">;			def VRFrame : NVPTXReg<"%SP">;
	def VRFrameLocal : NVPTXReg<"%SPL">;			def VRFrameLocal : NVPTXReg<"%SPL">;
				def VRShared : NVPTXReg<"%SPS">;
				def VRFrameShared : NVPTXReg<"%SPSH">;

	// Special Registers used as the stack			// Special Registers used as the stack
	def VRDepot : NVPTXReg<"%Depot">;			def VRDepot : NVPTXReg<"%Depot">;
				def VRSharedDepot : NVPTXReg<"%SharedDepot">;

	// We use virtual registers, but define a few physical registers here to keep			// We use virtual registers, but define a few physical registers here to keep
	// SDAG and the MachineInstr layers happy.			// SDAG and the MachineInstr layers happy.
	foreach i = 0-4 in {			foreach i = 0-4 in {
	def P#i : NVPTXReg<"%p"#i>; // Predicate			def P#i : NVPTXReg<"%p"#i>; // Predicate
	def RS#i : NVPTXReg<"%rs"#i>; // 16-bit			def RS#i : NVPTXReg<"%rs"#i>; // 16-bit
	def R#i : NVPTXReg<"%r"#i>; // 32-bit			def R#i : NVPTXReg<"%r"#i>; // 32-bit
	def RL#i : NVPTXReg<"%rd"#i>; // 64-bit			def RL#i : NVPTXReg<"%rd"#i>; // 64-bit
	Show All 25 Lines
	def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;			def Float32Regs : NVPTXRegClass<[f32], 32, (add (sequence "F%u", 0, 4))>;
	def Float64Regs : NVPTXRegClass<[f64], 64, (add (sequence "FL%u", 0, 4))>;			def Float64Regs : NVPTXRegClass<[f64], 64, (add (sequence "FL%u", 0, 4))>;
	def Int32ArgRegs : NVPTXRegClass<[i32], 32, (add (sequence "ia%u", 0, 4))>;			def Int32ArgRegs : NVPTXRegClass<[i32], 32, (add (sequence "ia%u", 0, 4))>;
	def Int64ArgRegs : NVPTXRegClass<[i64], 64, (add (sequence "la%u", 0, 4))>;			def Int64ArgRegs : NVPTXRegClass<[i64], 64, (add (sequence "la%u", 0, 4))>;
	def Float32ArgRegs : NVPTXRegClass<[f32], 32, (add (sequence "fa%u", 0, 4))>;			def Float32ArgRegs : NVPTXRegClass<[f32], 32, (add (sequence "fa%u", 0, 4))>;
	def Float64ArgRegs : NVPTXRegClass<[f64], 64, (add (sequence "da%u", 0, 4))>;			def Float64ArgRegs : NVPTXRegClass<[f64], 64, (add (sequence "da%u", 0, 4))>;

	// Read NVPTXRegisterInfo.cpp to see how VRFrame and VRDepot are used.			// Read NVPTXRegisterInfo.cpp to see how VRFrame and VRDepot are used.
	def SpecialRegs : NVPTXRegClass<[i32], 32, (add VRFrame, VRFrameLocal, VRDepot,			def SpecialRegs : NVPTXRegClass<[i32], 32, (add VRFrame, VRFrameLocal, VRDepot,
				traUnsubmitted Done Reply Inline Actions Line too long. tra: Line too long.
	(sequence "ENVREG%u", 0, 31))>;			VRShared, VRFrameShared, VRSharedDepot, (sequence "ENVREG%u", 0, 31))>;

lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
void initializeNVVMIntrRangePass(PassRegistry&);		void initializeNVVMIntrRangePass(PassRegistry&);
void initializeNVVMReflectPass(PassRegistry&);		void initializeNVVMReflectPass(PassRegistry&);
void initializeGenericToNVVMPass(PassRegistry&);		void initializeGenericToNVVMPass(PassRegistry&);
void initializeNVPTXAllocaHoistingPass(PassRegistry &);		void initializeNVPTXAllocaHoistingPass(PassRegistry &);
void initializeNVPTXAssignValidGlobalNamesPass(PassRegistry&);		void initializeNVPTXAssignValidGlobalNamesPass(PassRegistry&);
void initializeNVPTXLowerAggrCopiesPass(PassRegistry &);		void initializeNVPTXLowerAggrCopiesPass(PassRegistry &);
void initializeNVPTXLowerArgsPass(PassRegistry &);		void initializeNVPTXLowerArgsPass(PassRegistry &);
void initializeNVPTXLowerAllocaPass(PassRegistry &);		void initializeNVPTXLowerAllocaPass(PassRegistry &);
		void initializeNVPTXFunctionDataSharingPass(PassRegistry &);

} // end namespace llvm		} // end namespace llvm

extern "C" void LLVMInitializeNVPTXTarget() {		extern "C" void LLVMInitializeNVPTXTarget() {
// Register the target.		// Register the target.
RegisterTargetMachine<NVPTXTargetMachine32> X(getTheNVPTXTarget32());		RegisterTargetMachine<NVPTXTargetMachine32> X(getTheNVPTXTarget32());
RegisterTargetMachine<NVPTXTargetMachine64> Y(getTheNVPTXTarget64());		RegisterTargetMachine<NVPTXTargetMachine64> Y(getTheNVPTXTarget64());

// FIXME: This pass is really intended to be invoked during IR optimization,		// FIXME: This pass is really intended to be invoked during IR optimization,
// but it's very NVPTX-specific.		// but it's very NVPTX-specific.
PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
initializeNVVMReflectPass(PR);		initializeNVVMReflectPass(PR);
initializeNVVMIntrRangePass(PR);		initializeNVVMIntrRangePass(PR);
initializeGenericToNVVMPass(PR);		initializeGenericToNVVMPass(PR);
initializeNVPTXAllocaHoistingPass(PR);		initializeNVPTXAllocaHoistingPass(PR);
initializeNVPTXAssignValidGlobalNamesPass(PR);		initializeNVPTXAssignValidGlobalNamesPass(PR);
initializeNVPTXLowerArgsPass(PR);		initializeNVPTXLowerArgsPass(PR);
initializeNVPTXLowerAllocaPass(PR);		initializeNVPTXLowerAllocaPass(PR);
		initializeNVPTXFunctionDataSharingPass(PR);
initializeNVPTXLowerAggrCopiesPass(PR);		initializeNVPTXLowerAggrCopiesPass(PR);
}		}

static std::string computeDataLayout(bool is64Bit) {		static std::string computeDataLayout(bool is64Bit) {
std::string Ret = "e";		std::string Ret = "e";

if (!is64Bit)		if (!is64Bit)
Ret += "-p:32:32";		Ret += "-p:32:32";
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	public:
NVPTXTargetMachine &getNVPTXTargetMachine() const {		NVPTXTargetMachine &getNVPTXTargetMachine() const {
return getTM<NVPTXTargetMachine>();		return getTM<NVPTXTargetMachine>();
}		}

void addIRPasses() override;		void addIRPasses() override;
bool addInstSelector() override;		bool addInstSelector() override;
void addPostRegAlloc() override;		void addPostRegAlloc() override;
void addMachineSSAOptimization() override;		void addMachineSSAOptimization() override;
		void addMachineSSALowering() override;

FunctionPass *createTargetRegisterAllocator(bool) override;		FunctionPass *createTargetRegisterAllocator(bool) override;
void addFastRegAlloc(FunctionPass *RegAllocPass) override;		void addFastRegAlloc(FunctionPass *RegAllocPass) override;
void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;		void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;

private:		private:
// If the opt level is aggressive, add GVN; otherwise, add EarlyCSE. This		// If the opt level is aggressive, add GVN; otherwise, add EarlyCSE. This
// function is only called in opt mode.		// function is only called in opt mode.
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	if (getOptLevel() != CodeGenOpt::None)
addPass(createNVPTXImageOptimizerPass());		addPass(createNVPTXImageOptimizerPass());
addPass(createNVPTXAssignValidGlobalNamesPass());		addPass(createNVPTXAssignValidGlobalNamesPass());
addPass(createGenericToNVVMPass());		addPass(createGenericToNVVMPass());

// NVPTXLowerArgs is required for correctness and should be run right		// NVPTXLowerArgs is required for correctness and should be run right
// before the address space inference passes.		// before the address space inference passes.
addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));		addPass(createNVPTXLowerArgsPass(&getNVPTXTargetMachine()));
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
		// Add address space inference passes
addAddressSpaceInferencePasses();		addAddressSpaceInferencePasses();
if (!DisableLoadStoreVectorizer)		if (!DisableLoadStoreVectorizer)
addPass(createLoadStoreVectorizerPass());		addPass(createLoadStoreVectorizerPass());
addStraightLineScalarOptimizationPasses();		addStraightLineScalarOptimizationPasses();
		} else {
		// When the shared depot is generated, even when no optimizations are
		// used, we need to lower certain alloca instructions to the appropriate
		// memory type for correctness.
		addPass(createNVPTXFunctionDataSharingPass(&getNVPTXTargetMachine()));
}		}

// === LSR and other generic IR passes ===		// === LSR and other generic IR passes ===
TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();
// EarlyCSE is not always strong enough to clean up what LSR produces. For		// EarlyCSE is not always strong enough to clean up what LSR produces. For
// example, GVN can combine		// example, GVN can combine
//		//
// %0 = add %a, %b		// %0 = add %a, %b
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	void NVPTXPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {
addPass(&StackSlotColoringID);		addPass(&StackSlotColoringID);

// FIXME: Needs physical registers		// FIXME: Needs physical registers
//addPass(&PostRAMachineLICMID);		//addPass(&PostRAMachineLICMID);

printAndVerify("After StackSlotColoring");		printAndVerify("After StackSlotColoring");
}		}

		void NVPTXPassConfig::addMachineSSALowering() {
		// Lower shared frame indices.
		addPass(createNVPTXLowerSharedFrameIndicesPass(), false);
		}

void NVPTXPassConfig::addMachineSSAOptimization() {		void NVPTXPassConfig::addMachineSSAOptimization() {
// Pre-ra tail duplication.		// Pre-ra tail duplication.
if (addPass(&EarlyTailDuplicateID))		if (addPass(&EarlyTailDuplicateID))
printAndVerify("After Pre-RegAlloc TailDuplicate");		printAndVerify("After Pre-RegAlloc TailDuplicate");

// Optimize PHIs before DCE: removing dead PHI cycles may make more		// Optimize PHIs before DCE: removing dead PHI cycles may make more
// instructions dead.		// instructions dead.
addPass(&OptimizePHIsID);		addPass(&OptimizePHIsID);

		// To avoid SSA optimizations on the local frame indices from treating
		// shared and local frame indices the same, we will lower shared frame
		// before the optimizations are applied.
		addMachineSSALowering();

// This pass merges large allocas. StackSlotColoring is a different pass		// This pass merges large allocas. StackSlotColoring is a different pass
// which merges spill slots.		// which merges spill slots.
addPass(&StackColoringID);		addPass(&StackColoringID);

// If the target requests it, assign local variables to stack slots relative		// If the target requests it, assign local variables to stack slots relative
// to one another and simplify frame index references where possible.		// to one another and simplify frame index references where possible.
addPass(&LocalStackSlotAllocationID);		addPass(&LocalStackSlotAllocationID);

Show All 22 Lines

lib/Target/NVPTX/NVPTXUtilities.h

	//===-- NVPTXUtilities - Utilities ------------------------------ C++ --====//			//===-- NVPTXUtilities - Utilities ------------------------------ C++ --====//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the declaration of the NVVM specific utility functions.			// This file contains the declaration of the NVVM specific utility functions.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H			#ifndef LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H
	#define LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H			#define LLVM_LIB_TARGET_NVPTX_NVPTXUTILITIES_H

				#include "NVPTXTargetMachine.h"
				#include "llvm/CodeGen/MachineFunction.h"
	#include "llvm/IR/Function.h"			#include "llvm/IR/Function.h"
	#include "llvm/IR/GlobalVariable.h"			#include "llvm/IR/GlobalVariable.h"
	#include "llvm/IR/IntrinsicInst.h"			#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/Value.h"			#include "llvm/IR/Value.h"
	#include <cstdarg>			#include <cstdarg>
	#include <set>			#include <set>
	#include <string>			#include <string>
	#include <vector>			#include <vector>
	Show All 30 Lines

	bool getMinCTASm(const Function &, unsigned &);			bool getMinCTASm(const Function &, unsigned &);
	bool getMaxNReg(const Function &, unsigned &);			bool getMaxNReg(const Function &, unsigned &);
	bool isKernelFunction(const Function &);			bool isKernelFunction(const Function &);

	bool getAlign(const Function &, unsigned index, unsigned &);			bool getAlign(const Function &, unsigned index, unsigned &);
	bool getAlign(const CallInst &, unsigned index, unsigned &);			bool getAlign(const CallInst &, unsigned index, unsigned &);

				bool ptrIsStored(Value *Ptr);

	}			}

	#endif			#endif

lib/Target/NVPTX/NVPTXUtilities.cpp

//===- NVPTXUtilities.cpp - Utility Functions -----------------------------===//		//===- NVPTXUtilities.cpp - Utility Functions -----------------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains miscellaneous utility functions		// This file contains miscellaneous utility functions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "NVPTXUtilities.h"		#include "NVPTXUtilities.h"
#include "NVPTX.h"		#include "NVPTX.h"
		#include "llvm/Analysis/CaptureTracking.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/GlobalVariable.h"		#include "llvm/IR/GlobalVariable.h"
#include "llvm/IR/InstIterator.h"		#include "llvm/IR/InstIterator.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/Support/ManagedStatic.h"		#include "llvm/Support/ManagedStatic.h"
#include "llvm/Support/MutexGuard.h"		#include "llvm/Support/MutexGuard.h"
#include <algorithm>		#include <algorithm>
#include <cstring>		#include <cstring>
#include <map>		#include <map>
#include <string>		#include <string>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

		#define DEBUG_TYPE "nvptx-utilities"

namespace {		namespace {
typedef std::map<std::string, std::vector<unsigned> > key_val_pair_t;		typedef std::map<std::string, std::vector<unsigned> > key_val_pair_t;
typedef std::map<const GlobalValue *, key_val_pair_t> global_val_annot_t;		typedef std::map<const GlobalValue *, key_val_pair_t> global_val_annot_t;
typedef std::map<const Module *, global_val_annot_t> per_module_annot_t;		typedef std::map<const Module *, global_val_annot_t> per_module_annot_t;
} // anonymous namespace		} // anonymous namespace

static ManagedStatic<per_module_annot_t> annotationCache;		static ManagedStatic<per_module_annot_t> annotationCache;
static sys::Mutex Lock;		static sys::Mutex Lock;
▲ Show 20 Lines • Show All 270 Lines • ▼ Show 20 Lines	for (int i = 0, n = alignNode->getNumOperands(); i < n; i++) {
return false;		return false;
}		}
}		}
}		}
}		}
return false;		return false;
}		}

		/// Returns true if there are any instructions storing
		/// the address of this pointer.
		bool ptrIsStored(Value *Ptr) {
		hfinkelUnsubmitted Done Reply Inline Actions Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this function? If so, please do. hfinkel: Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this…
		return PointerMayBeCaptured(Ptr, false, true);
		}

} // namespace llvm		} // namespace llvm

test/CodeGen/NVPTX/insert-shared-depot.ll

This file was added.

				; Check that the shared depot is emitted when the "has-nvptx-shared-depot" is used.

				; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefixes=PTX32,CHECK
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s --check-prefixes=PTX64,CHECK

				; CHECK: {{.*}}kernel()
				traUnsubmitted Done Reply Inline Actions You could put common checks under the same label (e.g. `CHECK`) and run tests with `-check-prefixes=PTX32,CHECK` tra: You could put common checks under the same label (e.g. `CHECK`) and run tests with `-check…

				; PTX32: .local .align 8{{.}}.b8{{.}}__local_depot0
				; PTX64: .local .align 8{{.}}.b8{{.}}__local_depot0

				; PTX32: .shared .align 8{{.}}.b8{{.}}__shared_depot0
				; PTX64: .shared .align 8{{.}}.b8{{.}}__shared_depot0

				; PTX32: .reg .b32{{.*}}%SPS;
				; PTX64: .reg .b64{{.*}}%SPS;

				; PTX32: .reg .b32{{.*}}%SPSH;
				; PTX64: .reg .b64{{.*}}%SPSH;

				; PTX32: mov.u32{{.*}}%SPSH, __shared_depot0;
				; PTX64: mov.u64{{.*}}%SPSH, __shared_depot0;

				; PTX32: cvta.shared.u32{{.*}}%SPS, %SPSH;
				; PTX64: cvta.shared.u64{{.*}}%SPS, %SPSH;

				target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() #0 {
				%A = alloca i32, align 4
				traUnsubmitted Done Reply Inline Actions 'LABEL' is not a check-prefix and `@linsert_shared_depot` is not this function's name, so I'm puzzled what this line is supposed to do. Did you intend `<prefix>-LABEL: @kernel` ? This appears in all the test cases in the patch. tra: 'LABEL' is not a check-prefix and `@linsert_shared_depot` is not this function's name, so I'm…
				gtberceaAuthorUnsubmitted Not Done Reply Inline Actions This is modeled after the lower-alloca.ll test which has a similar label. The label is always equal to the name of the test file. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot" gtbercea: This is modeled after the lower-alloca.ll test which has a similar label. The label is always…
				traUnsubmitted Not Done Reply Inline Actions This is modeled after the lower-alloca.ll test which has a similar label. lower-alloca.ll indeed has the same problem. The label is always equal to the name of the test file. I don't think FileCheck has such a feature. Nor do I see anything matching this description in the FileCheck documentation. Nor does it work. See below. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot" The line does not check anything right now. In this test FileCheck only pays attention to lines that have CHECK or PTX64/PTX32. This line contains neither and is ignored. You can do an experiment -- replace the line with `; LABEL: this should never match` and run the test. I've tried that on lower-alloca.ll and the test, as expected, passes regardless of the nonsense I put after the `LABEL:`. tra: > This is modeled after the lower-alloca.ll test which has a similar label. lower-alloca.ll…
				%shared_args = alloca i8**, align 8
				call void @callee(i8*** %shared_args)
				store i32 10, i32* %A
				ret void
				}

				declare void @callee(i8***)

				attributes #0 = {"has-nvptx-shared-depot"}

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

test/CodeGen/NVPTX/lower-alloca-shared.ll

This file was added.

				; Test case for NVPTXLowerAlloca pass.

				; RUN: opt < %s -S -nvptx-lower-alloca -infer-address-spaces \| FileCheck %s
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s --check-prefix PTX

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() #0 {
				; PTX-LABEL: .visible .entry kernel(
				%A = alloca i32
				; CHECK: addrspacecast i32* %A to i32 addrspace(3)*
				; CHECK: addrspacecast i32 addrspace(3)* %1 to i32*
				; CHECK: store i32 0, i32 addrspace(3)* {{%.+}}
				; PTX: add.u64 {{%rd[0-9]+}}, %SPS, 0;
				; PTX: cvta.to.shared.u64 {{%rd[0-9]+}}, {{%rd[0-9]+}};
				; PTX: st.shared.u32 [{{%rd[0-9]+}}], {{%r[0-9]+}}
				%shared_args = alloca i32**
				call void @callee(i32*** %shared_args)
				%1 = load i32, i32* %shared_args
				%2 = getelementptr inbounds i32, i32* %1, i64 0
				store i32* %A, i32** %2
				store i32 0, i32* %A
				ret void
				}

				declare void @callee(i32***)

				attributes #0 = {"has-nvptx-shared-depot"}

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

test/CodeGen/NVPTX/no-shared-depot.ll

This file was added.

				; Test case for when the shared depot does not need to be emitted.
				; The shared depot is not emitted if the function does not have the
				; "has-nvptx-shared-depot" attribute.

				; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX32
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s --check-prefix=PTX64

				; PTX32: {{.*}}kernel()
				; PTX64: {{.*}}kernel()

				; PTX32: .local .align 8{{.}}.b8{{.}}__local_depot0
				; PTX64: .local .align 8{{.}}.b8{{.}}__local_depot0

				; PTX32-NOT: .shared .align 8{{.}}.b8{{.}}__shared_depot0
				; PTX64-NOT: .shared .align 8{{.}}.b8{{.}}__shared_depot0

				; PTX32-NOT: .reg .b32{{.*}}%SPS;
				; PTX64-NOT: .reg .b64{{.*}}%SPS;

				; PTX32-NOT: .reg .b32{{.*}}%SPSH;
				; PTX64-NOT: .reg .b64{{.*}}%SPSH;

				; PTX32-NOT: mov.u32{{.*}}%SPSH, __shared_depot0;
				; PTX64-NOT: mov.u64{{.*}}%SPSH, __shared_depot0;

				; PTX32-NOT: cvta.shared.u32{{.*}}%SPS, %SPSH;
				; PTX64-NOT: cvta.shared.u64{{.*}}%SPS, %SPSH;

				target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() {
				%A = alloca i32, align 4
				%shared_args = alloca i8**, align 8
				call void @callee(i8*** %shared_args)
				store i32 10, i32* %A
				ret void
				}

				declare void @callee(i8***)

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

test/CodeGen/NVPTX/nvptx-function-data-sharing.ll

This file was added.

				; Test case for NVPTXFunctionDataSharing pass.

				; RUN: opt < %s -S -nvptx-function-data-sharing -infer-address-spaces \| FileCheck %s
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_35 \| FileCheck %s --check-prefix PTX

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				target triple = "nvptx64-unknown-unknown"

				define void @kernel() #0 {
				; PTX-LABEL: .visible .entry kernel(
				%A = alloca i32
				; CHECK: addrspacecast i32* %A to i32 addrspace(3)*
				; CHECK: addrspacecast i32 addrspace(3)* %A1 to i32*
				; CHECK: store i32 0, i32 addrspace(3)* {{%.+}}
				; PTX: add.u64 {{%rd[0-9]+}}, %SPS, 0;
				; PTX: cvta.to.shared.u64 {{%rd[0-9]+}}, {{%rd[0-9]+}};
				; PTX: st.shared.u32 [{{%rd[0-9]+}}], {{%r[0-9]+}}
				%shared_args = alloca i32**
				call void @callee(i32*** %shared_args)
				%1 = load i32, i32* %shared_args
				%2 = getelementptr inbounds i32, i32* %1, i64 0
				store i32* %A, i32** %2
				store i32 0, i32* %A
				ret void
				}

				declare void @callee(i32***)

				attributes #0 = {"has-nvptx-shared-depot"}

				!nvvm.annotations = !{!0}
				!0 = !{void ()* @kernel, !"kernel", i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memoryAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 129093

include/llvm/CodeGen/TargetPassConfig.h

lib/CodeGen/TargetPassConfig.cpp

lib/Target/NVPTX/CMakeLists.txt

lib/Target/NVPTX/NVPTX.h

lib/Target/NVPTX/NVPTXAsmPrinter.cpp

lib/Target/NVPTX/NVPTXFrameLowering.cpp

lib/Target/NVPTX/NVPTXFunctionDataSharing.h

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp

lib/Target/NVPTX/NVPTXInstrInfo.td

lib/Target/NVPTX/NVPTXLowerAlloca.cpp

lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp

lib/Target/NVPTX/NVPTXRegisterInfo.h

lib/Target/NVPTX/NVPTXRegisterInfo.cpp

lib/Target/NVPTX/NVPTXRegisterInfo.td

lib/Target/NVPTX/NVPTXTargetMachine.cpp

lib/Target/NVPTX/NVPTXUtilities.h

lib/Target/NVPTX/NVPTXUtilities.cpp

test/CodeGen/NVPTX/insert-shared-depot.ll

test/CodeGen/NVPTX/lower-alloca-shared.ll

test/CodeGen/NVPTX/no-shared-depot.ll

test/CodeGen/NVPTX/nvptx-function-data-sharing.ll

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory
AbandonedPublic