Page MenuHomePhabricator
Feed Advanced Search

Feb 6 2020

rampitec updated the diff for D74177: [AMDGPU] Cleanup assumptions about generated subregs.

Fixed typo.
Found 2 more places: SIAddIMGInit and computeIndirectRegAndOffset.

Feb 6 2020, 4:14 PM · Restricted Project
rampitec created D74177: [AMDGPU] Cleanup assumptions about generated subregs.
Feb 6 2020, 3:47 PM · Restricted Project

Feb 5 2020

rampitec accepted D74089: AMDGPU: Make LDS_DIRECT an artifical register.

LGTM

Feb 5 2020, 2:07 PM · Restricted Project

Feb 4 2020

rampitec accepted D73989: AMDGPU: Fix isAlwaysUniform for simple asm SGPR results.

LGTM

Feb 4 2020, 1:18 PM · Restricted Project
rampitec accepted D73939: [AMDGPU] Fix infinite loop with fma combines.

LGTM

Feb 4 2020, 10:40 AM · Restricted Project

Feb 3 2020

rampitec accepted D73925: AMDGPU: Cleanup SMRD buffer selection.

LGTM

Feb 3 2020, 5:27 PM · Restricted Project
rampitec accepted D73915: AMDGPU: Add flag to control mem intrinsic expansion.
Feb 3 2020, 1:34 PM · Restricted Project
rampitec added inline comments to D73915: AMDGPU: Add flag to control mem intrinsic expansion.
Feb 3 2020, 1:24 PM · Restricted Project
rampitec added inline comments to D73915: AMDGPU: Add flag to control mem intrinsic expansion.
Feb 3 2020, 12:36 PM · Restricted Project
rampitec accepted D73909: AMDGPU: Analyze divergence of inline asm.

LGTM

Feb 3 2020, 11:57 AM · Restricted Project
rampitec accepted D73877: AMDGPU: Fix splitting wide f32 s.buffer.load intrinsics.

LGTM

Feb 3 2020, 11:48 AM · Restricted Project
rampitec accepted D73868: [ANDGPU] getMemOperandsWithOffset: support BUF non-stack-access instructions with resource but no vaddr.

LGTM

Feb 3 2020, 9:19 AM · Restricted Project

Feb 1 2020

rampitec accepted D73831: AMDGPU/GFX10: Fix NSA reassign pass when operands are undef.
Feb 1 2020, 10:18 AM · Restricted Project

Jan 31 2020

rampitec added a comment to D73815: AMDGPU: Fix divergence analysis of control flow intrinsics.

Let's Alex review, but could you run a PSDB in the meanwhile?

Jan 31 2020, 4:07 PM · Restricted Project
rampitec accepted D73814: AMDGPU: Switch some tests to use generated checks.
Jan 31 2020, 4:07 PM · Restricted Project

Jan 30 2020

rampitec accepted D73737: AMDGPU: Cleanup SMRD buffer selection.

LGTM

Jan 30 2020, 2:55 PM · Restricted Project
rampitec accepted D73731: AMDGPU: Cleanup and fix SMRD offset handling.

LGTM

Jan 30 2020, 2:45 PM · Restricted Project
rampitec accepted D73694: AMDGPU: Don't use separate cache arguments for s_buffer_load node.

LGTM

Jan 30 2020, 10:10 AM · Restricted Project
rampitec added a reviewer for D73694: AMDGPU: Don't use separate cache arguments for s_buffer_load node: kzhuravl.
Jan 30 2020, 10:10 AM · Restricted Project
rampitec added inline comments to D73483: [AMDGPU] fixed divergence driven shift operations selection.
Jan 30 2020, 9:59 AM · Restricted Project
rampitec accepted D73695: AMDGPU: Replace subtarget check with an assert.

LGTM

Jan 30 2020, 9:39 AM · Restricted Project

Jan 29 2020

rampitec added inline comments to D73483: [AMDGPU] fixed divergence driven shift operations selection.
Jan 29 2020, 10:35 AM · Restricted Project
rampitec accepted D73073: AMDGPU: Add option to expand 64-bit integer division in IR.

LGTM

Jan 29 2020, 10:26 AM · Restricted Project
rampitec accepted D72852: AMDGPU/GlobalISel: Select permlane16/permlanex16.

LGTM

Jan 29 2020, 10:16 AM · Restricted Project
rampitec accepted D73634: [AMDGPU] Cluster FLAT instructions with both vaddr and saddr.

LGTM

Jan 29 2020, 8:53 AM · Restricted Project
rampitec committed rGc2ad7ee1a9ad: [AMDGPU] override isHighLatencyDef (authored by rampitec).
[AMDGPU] override isHighLatencyDef
Jan 29 2020, 8:07 AM
rampitec closed D73582: [AMDGPU] override isHighLatencyDef.
Jan 29 2020, 8:07 AM · Restricted Project
rampitec added a comment to D71717: [MachineScheduler] Ignore artificial edges when forming store chains.

I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:

flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24

I can add it but I get different results. With D73509:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v7 offset:24

With this patch:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v7 offset:24

I can see from the debug output that all four stores are being clustered now.

Do you prefer tests that just check the generated code, instead of checking the -debug-only output? It seems to me that there is a high chance of stores getting clustered by accident, even if the scheduler is not doing the right thing. E.g. the scheduler could do nothing at all and the test would still pass, because the loads and stores are already in the correct order before scheduling!

Jan 29 2020, 7:49 AM · Restricted Project
rampitec added a comment to D73582: [AMDGPU] override isHighLatencyDef.

I have no objection to this patch, but I'm not sure isHighLatencyDef will ever be called by generic code. I think the latency information ususally comes from the sched model instead. Also note that TargetInstrInfo::defaultDefLatency never calls isHighLatencyDef on a load instruction!

Jan 29 2020, 7:49 AM · Restricted Project

Jan 28 2020

rampitec created D73582: [AMDGPU] override isHighLatencyDef.
Jan 28 2020, 2:25 PM · Restricted Project
rampitec accepted D71717: [MachineScheduler] Ignore artificial edges when forming store chains.

I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

Jan 28 2020, 12:58 PM · Restricted Project
rampitec added a comment to D73509: [MachineScheduler] relax successor chain on clustering.

After some meditation in gdb I have found that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. D71717 just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise.

Jan 28 2020, 12:58 PM · Restricted Project
rampitec added a comment to D73509: [MachineScheduler] relax successor chain on clustering.

Yeah, it was because our threshold is too low. I have increased it and I see my change misses one cluster which D71717 does. This change:

Jan 28 2020, 11:34 AM · Restricted Project
rampitec added a comment to D73509: [MachineScheduler] relax successor chain on clustering.

This doesn't fix the problem that inspired D71717. Consider the first test case in memory_clause.ll. With baseline llvm I get:

$ bin/llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs -amdgpu-enable-global-sgpr-addr -o /dev/null ~/git/llvm-project/llvm/test/CodeGen/AMDGPU/memory_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|Machine code for function"
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(6) - SU(7)
Cluster ld/st SU(8) - SU(9)
Cluster ld/st SU(11) - SU(13)
# Machine code for function vector_clause: NoPHIs, TracksLiveness

My problem is I cannot reproduce it. All I have is

# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
# Machine code for function vector_clause: NoPHIs, TracksLiveness

It just does not try to cluster all global loads and stores at all! It also does not do it even with D71717.
I will debug it...

Jan 28 2020, 11:10 AM · Restricted Project
rampitec added a comment to D73509: [MachineScheduler] relax successor chain on clustering.

This doesn't fix the problem that inspired D71717. Consider the first test case in memory_clause.ll. With baseline llvm I get:

$ bin/llc -march=amdgcn -mcpu=gfx902 -verify-machineinstrs -amdgpu-enable-global-sgpr-addr -o /dev/null ~/git/llvm-project/llvm/test/CodeGen/AMDGPU/memory_clause.ll -debug-only=machine-scheduler |& egrep "^Cluster|Machine code for function"
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Cluster ld/st SU(2) - SU(3)
Cluster ld/st SU(6) - SU(7)
Cluster ld/st SU(8) - SU(9)
Cluster ld/st SU(11) - SU(13)
# Machine code for function vector_clause: NoPHIs, TracksLiveness
Jan 28 2020, 10:49 AM · Restricted Project
rampitec accepted D73485: [AMDGPU] Simplify DS and SM cases in getMemOperandsWithOffset.

LGTM

Jan 28 2020, 10:11 AM · Restricted Project
rampitec retitled D73509: [MachineScheduler] relax successor chain on clustering from [MachineScheduler] relax successfor chain on clustering to [MachineScheduler] relax successor chain on clustering.
Jan 28 2020, 10:11 AM · Restricted Project

Jan 27 2020

rampitec added a comment to D71717: [MachineScheduler] Ignore artificial edges when forming store chains.

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

AFAIR the logic is to cross-transfer all successors to all successors to prevent any of them to be scheduled inside the cluster. The same for predecessors to prevent any predecessor to be scheduled inside the cluster. But that way we end up with a forest of cross edges.

Also any user transformation may insert artificial edges for any other reason. I do not think we can simple ignore them without removing them.

Maybe what we need instead of cross edges is to have a single post-dominator node which will become a singe predecessor for all successors of the nodes in a cluster. The same is for predecessors, we could just use a single dominator to be a common successor of predecessors. I.e. to have just two guard nodes instead of all of them being guards. I think a first and a last node in a cluster could become such single use guards.

Jan 27 2020, 2:54 PM · Restricted Project
rampitec created D73509: [MachineScheduler] relax successor chain on clustering.
Jan 27 2020, 2:52 PM · Restricted Project
rampitec committed rG53eb0f8c0713: [AMDGPU] Attempt to reschedule withou clustering (authored by rampitec).
[AMDGPU] Attempt to reschedule withou clustering
Jan 27 2020, 10:30 AM
rampitec closed D73386: [AMDGPU] Attempt to reschedule withou clustering.
Jan 27 2020, 10:30 AM · Restricted Project
rampitec added a comment to D73483: [AMDGPU] fixed divergence driven shift operations selection.

Is there a test for scalar selection?

Jan 27 2020, 10:28 AM · Restricted Project
rampitec updated the diff for D73386: [AMDGPU] Attempt to reschedule withou clustering.

Switched to BitVector.

Jan 27 2020, 10:18 AM · Restricted Project

Jan 24 2020

rampitec accepted D73402: AMDGPU: Fix not using f16 fsin/fcos.

LGTM

Jan 24 2020, 8:38 PM · Restricted Project
rampitec added a comment to D73386: [AMDGPU] Attempt to reschedule withou clustering.

This tries scheduling with all mutations disabled including macro fusion right?

Jan 24 2020, 4:18 PM · Restricted Project
rampitec created D73386: [AMDGPU] Attempt to reschedule withou clustering.
Jan 24 2020, 3:32 PM · Restricted Project
rampitec committed rGbe8e38cbd978: Correct NumLoads in clustering (authored by rampitec).
Correct NumLoads in clustering
Jan 24 2020, 12:46 PM
rampitec closed D73292: [AMDGPU] Correct NumLoads in clustering.
Jan 24 2020, 12:46 PM · Restricted Project
rampitec updated the diff for D73292: [AMDGPU] Correct NumLoads in clustering.

Added operand description to TargetInstrInfo::shouldClusterMemOps()

Jan 24 2020, 12:46 PM · Restricted Project
rampitec accepted D72258: AMDGPU: Don't error on ds.ordered intrinsic in function.
Jan 24 2020, 12:45 PM · Restricted Project
rampitec added inline comments to D72258: AMDGPU: Don't error on ds.ordered intrinsic in function.
Jan 24 2020, 12:16 PM · Restricted Project
rampitec requested review of D73292: [AMDGPU] Correct NumLoads in clustering.
Jan 24 2020, 12:08 PM · Restricted Project
rampitec updated the diff for D73292: [AMDGPU] Correct NumLoads in clustering.

Changed patch to only correct clusterNeighboringMemOps. Threshold logic adjusted accordingly.
We will need to retune our threshold in the AMDGPU separately, likely after scheduler will gain the ability to break clustering under a high pressure.

Jan 24 2020, 12:07 PM · Restricted Project
rampitec accepted D73375: AMDGPU/GlobalISel: Fix tablegen selection for scalar bin ops.

LGTM

Jan 24 2020, 11:57 AM · Restricted Project
rampitec committed rG555d8f4ef5eb: [AMDGPU] Bundle loads before post-RA scheduler (authored by rampitec).
[AMDGPU] Bundle loads before post-RA scheduler
Jan 24 2020, 11:39 AM
rampitec closed D72737: [AMDGPU] Bundle loads before post-RA scheduler.
Jan 24 2020, 11:38 AM · Restricted Project
rampitec committed rG44b865fa7fea: [AMDGPU] Allow narrowing muti-dword loads (authored by rampitec).
[AMDGPU] Allow narrowing muti-dword loads
Jan 24 2020, 11:20 AM
rampitec closed D73133: [AMDGPU] Allow narrowing muti-dword loads.
Jan 24 2020, 11:19 AM · Restricted Project
rampitec accepted D73372: Revert "AMDGPU: Temporary drop s_mul_hi_i/u32 patterns".

LGTM

Jan 24 2020, 11:10 AM · Restricted Project
rampitec accepted D73371: AMDGPU/GlobalISel: Eliminate SelectVOP3Mods_f32.

LGTM

Jan 24 2020, 11:10 AM · Restricted Project
rampitec committed rG7a94d4f4ee43: Allow combining of extract_subvector to extract element (authored by rampitec).
Allow combining of extract_subvector to extract element
Jan 24 2020, 11:04 AM
rampitec closed D73132: Allow combining of extract_subvector to extract element.
Jan 24 2020, 11:03 AM · Restricted Project
rampitec accepted D73364: AMDGPU: Don't check constant address space for atomic stores.

LGTM

Jan 24 2020, 11:01 AM · Restricted Project
rampitec accepted D73365: AMDGPU/GlobalISel: Fix not using global atomics on gfx9+.

LGTM

Jan 24 2020, 11:01 AM · Restricted Project
rampitec accepted D73338: [AMDGPU] Fix GCN regpressure trackers for INLINEASM instructions.

LGTM. Thanks.

Jan 24 2020, 10:53 AM · Restricted Project
rampitec added inline comments to D72737: [AMDGPU] Bundle loads before post-RA scheduler.
Jan 24 2020, 8:04 AM · Restricted Project
rampitec accepted D73323: [DA] Don't propagate from unreachable blocks.

LGTM

Jan 24 2020, 1:41 AM · Restricted Project
rampitec accepted D73315: Resubmit: [DA][TTI][AMDGPU] Add option to select GPUDA with TTI.

LGTM as long as parent is submitted.

Jan 24 2020, 1:41 AM · Restricted Project
rampitec added a comment to D73292: [AMDGPU] Correct NumLoads in clustering.

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I agree. I also think we should fix this properly in MachineScheduler:

--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1584,7 +1584,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     SUnit *SUb = MemOpRecords[Idx+1].SU;
     if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
                                  MemOpRecords[Idx + 1].BaseOps,
-                                 ClusterLength)) {
+                                 ClusterLength + 1)) {
       if (SUa->NodeNum > SUb->NodeNum)
         std::swap(SUa, SUb);
       if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {

... and adjust any other target implementations of shouldClusterMemOps accordingly.

Jan 24 2020, 1:32 AM · Restricted Project

Jan 23 2020

rampitec added a comment to D73315: Resubmit: [DA][TTI][AMDGPU] Add option to select GPUDA with TTI.

What has caused the failure? Will it happen on Windows in real life?

Jan 23 2020, 5:29 PM · Restricted Project
rampitec added a comment to D73292: [AMDGPU] Correct NumLoads in clustering.

I also have an early version of a patch which can reschedule without clustering if no optimal schedule were found for a region.

Jan 23 2020, 4:40 PM · Restricted Project
rampitec added a comment to D73292: [AMDGPU] Correct NumLoads in clustering.

I tried something similar in D72325.

Jan 23 2020, 2:31 PM · Restricted Project
rampitec created D73292: [AMDGPU] Correct NumLoads in clustering.
Jan 23 2020, 1:33 PM · Restricted Project

Jan 22 2020

rampitec accepted D73247: AMDGPU/GlobalISel: Select V_ADD3_U32/V_XOR3_B32.

LGTM

Jan 22 2020, 10:11 PM · Restricted Project
rampitec updated the diff for D73132: Allow combining of extract_subvector to extract element.

Pre-commited NFC changes and rebased.

Jan 22 2020, 9:27 AM · Restricted Project
rampitec committed rG2d0fcf786c5c: Precommit NFC part of DAGCombiner change. NFC. (authored by rampitec).
Precommit NFC part of DAGCombiner change. NFC.
Jan 22 2020, 9:09 AM
rampitec accepted D73192: [tests] Use host-based XFAIL for test/MC/AMDGPU/hsa-gfx10-v3.s.

LGTM

Jan 22 2020, 9:09 AM · Restricted Project
rampitec committed rGfb8a3d18340e: Regenerate test/CodeGen/ARM/vext.ll. NFC. (authored by rampitec).
Regenerate test/CodeGen/ARM/vext.ll. NFC.
Jan 22 2020, 9:00 AM
rampitec added inline comments to D73132: Allow combining of extract_subvector to extract element.
Jan 22 2020, 8:12 AM · Restricted Project
rampitec added inline comments to D73132: Allow combining of extract_subvector to extract element.
Jan 22 2020, 12:33 AM · Restricted Project

Jan 21 2020

rampitec created D73133: [AMDGPU] Allow narrowing muti-dword loads.
Jan 21 2020, 1:00 PM · Restricted Project
rampitec added a parent revision for D73133: [AMDGPU] Allow narrowing muti-dword loads: D73132: Allow combining of extract_subvector to extract element.
Jan 21 2020, 1:00 PM · Restricted Project
rampitec added a child revision for D73132: Allow combining of extract_subvector to extract element: D73133: [AMDGPU] Allow narrowing muti-dword loads.
Jan 21 2020, 1:00 PM · Restricted Project
rampitec created D73132: Allow combining of extract_subvector to extract element.
Jan 21 2020, 12:42 PM · Restricted Project
rampitec added inline comments to D72737: [AMDGPU] Bundle loads before post-RA scheduler.
Jan 21 2020, 10:39 AM · Restricted Project

Jan 20 2020

rampitec accepted D73069: AMDGPU: Look through casted selects to constant fold bin ops.

LGTM

Jan 20 2020, 5:22 PM · Restricted Project
rampitec accepted D72187: AMDGPU: Prepare to use scalar register indexing.

LGTM. A verifier update is desirable though.

Jan 20 2020, 1:04 PM · Restricted Project
rampitec accepted D72185: AMDGPU: Partially merge indirect register write handling.

LGTM

Jan 20 2020, 12:55 PM · Restricted Project
rampitec accepted D73049: [DA][TTI][AMDGPU] Add option to select GPUDA with TTI.

LGTM

Jan 20 2020, 11:50 AM · Restricted Project
rampitec added a comment to D72185: AMDGPU: Partially merge indirect register write handling.

The change seems to miss parent revision introducing V_INDIRECT_REG_WRITE_*.

Jan 20 2020, 11:31 AM · Restricted Project
rampitec added inline comments to D73049: [DA][TTI][AMDGPU] Add option to select GPUDA with TTI.
Jan 20 2020, 11:31 AM · Restricted Project
rampitec accepted D73033: AMDGPU: Cleanup and generate 64-bit div tests.

LGTM

Jan 20 2020, 11:31 AM · Restricted Project
rampitec accepted D73008: AMDGPU: Do binop of select of constant fold in AMDGPUCodeGenPrepare.

LGTM

Jan 20 2020, 11:22 AM · Restricted Project
rampitec accepted D73009: AMDGPU: Don't create weird sized integers.

LGTM

Jan 20 2020, 11:22 AM · Restricted Project
rampitec added inline comments to D72187: AMDGPU: Prepare to use scalar register indexing.
Jan 20 2020, 11:22 AM · Restricted Project

Jan 17 2020

rampitec updated the diff for D72737: [AMDGPU] Bundle loads before post-RA scheduler.

Prevent bundling of loads with overlapping destinations. They are dependent.

Jan 17 2020, 1:36 PM · Restricted Project
rampitec committed rGeebdd85e7df4: [AMDGPU] allow multi-dword flat scratch access since GFX9 (authored by rampitec).
[AMDGPU] allow multi-dword flat scratch access since GFX9
Jan 17 2020, 10:52 AM
rampitec closed D72865: [AMDGPU] allow multi-dword flat scratch access since GFX9.
Jan 17 2020, 10:52 AM · Restricted Project
rampitec accepted D72927: AMDGPU/GlobalISel: Select llvm.amdgcn.mov.dpp.

LGTM

Jan 17 2020, 10:51 AM · Restricted Project
rampitec accepted D72925: AMDGPU/GlobalISel: Select llvm.amdgcn.update.dpp.

LGTM

Jan 17 2020, 10:51 AM · Restricted Project