Page MenuHomePhabricator

mareko (Marek Olšák)
User

Projects

User does not belong to any projects.

User Details

User Since
Aug 17 2015, 4:04 AM (396 w, 12 h)

Recent Activity

Dec 13 2022

mareko accepted D130170: [AMDGPU] Stop running IR code sinking pass.
Dec 13 2022, 9:02 AM · Restricted Project, Restricted Project
mareko added a comment to D130170: [AMDGPU] Stop running IR code sinking pass.

This can be merged. I have a patch for Mesa.

Dec 13 2022, 9:01 AM · Restricted Project, Restricted Project

Aug 8 2022

Herald added a project to D120312: [StructurizeCFG] Fix boolean not bug: Restricted Project.

@arsenm @foad Can we backport it to LLVM 14? LLVM 14 is unusable because of this bug.

Aug 8 2022, 12:09 PM · Restricted Project, Restricted Project

Jul 20 2022

mareko added a comment to D130170: [AMDGPU] Stop running IR code sinking pass.

I think we don't need the pass, but it's hard to tell.

Jul 20 2022, 10:29 AM · Restricted Project, Restricted Project

May 4 2022

mareko added a comment to D120544: [AMDGPU] Omit unnecessary waitcnt before barriers.

I suppose the only difference this makes here is potentially in the counter value we wait for? I.e., we may wait for vmcnt(N) with N != 0?

May 4 2022, 12:12 PM · Restricted Project, Restricted Project

May 3 2022

mareko added a comment to D120544: [AMDGPU] Omit unnecessary waitcnt before barriers.

The IR (NIR) that we translate to LLVM typically contains this sequence of instructions:

  1. memory_barrier (fence)
  2. control_barrier (s_barrier)
May 3 2022, 9:26 AM · Restricted Project, Restricted Project

May 2 2022

mareko added a comment to D120544: [AMDGPU] Omit unnecessary waitcnt before barriers.

We are going to insert amdgcn.s.waitcnt instead of a fence because fences wait for memory stores, which we don't want.

May 2 2022, 4:11 PM · Restricted Project, Restricted Project
mareko added a comment to D120544: [AMDGPU] Omit unnecessary waitcnt before barriers.

I believe HIP adds a release fence before the barrier and an acquire directly after. Would something like that make sense?

Yes, generally I expect that is what's required.

May 2 2022, 3:00 PM · Restricted Project, Restricted Project

Apr 21 2022

mareko added a comment to D120544: [AMDGPU] Omit unnecessary waitcnt before barriers.

Oh, I misunderstood what this patch does. It's possibly an unfortunate interaction between Mesa and LLVM, where Mesa assumed that the waitcnt was there, and your removal of it causes problems. In that case, can you please coordinate with Mesa to make sure it starts adding the required fence instructions in the LLVM IR? Cc @mareko

Apr 21 2022, 9:49 PM · Restricted Project, Restricted Project

Jan 15 2021

mareko added a comment to D94645: [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization.

BTW I felt the input in the name of llvm.amdgcn.init.exec.from.input some kind of means the argument is the function input argument in LLVM IR. I think we can clarify that the value to this intrinsic should be an function input argument. That is the only requirement I can see currently from frontend. So that we don't need to consider various hard to solve situations. @mareko What do you think?

Jan 15 2021, 11:22 PM · Restricted Project

Jan 14 2021

mareko added a comment to D94645: [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization.

This looks good to me, but I'm not very familiar with LLVM.

Jan 14 2021, 2:32 AM · Restricted Project

Sep 24 2020

mareko added a comment to D87674: [AMDGPU] Insert waitcnt after returning from call.

Using the new return behavior only for the calling conventions that Mesa doesn't use sounds like the right solution.

Sep 24 2020, 2:58 AM · Restricted Project

Sep 23 2020

mareko added a comment to D87674: [AMDGPU] Insert waitcnt after returning from call.

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Please revert. @nhaehnle

I don't understand why would it fail. This patch just moves s_waitcnt to the caller so they would be executed anyway. I think I am missing something. It would be helpful to root cause if we can isolate to a small test case.

Shader returns aren't real returns and the "caller" doesn't wait

Sep 23 2020, 9:57 AM · Restricted Project
mareko added a comment to D87674: [AMDGPU] Insert waitcnt after returning from call.

Yes, this commit is incorrect. It completely breaks code linking in Mesa OpenGL. s_waitcnt is required at the end of all global functions that return values.

Sep 23 2020, 9:40 AM · Restricted Project

Sep 9 2020

mareko added a comment to D84403: [AMDGPU] Use ds_read/write_b96/b128 when possible for SDag.

This breaks LDS. LLVMSetAlignment(inst, 4) on loads and stores has no effect. The IR says "align 4", yet the backend still selects b128.

On what subtargets? GFX9 and 10 should select b128 for align 4. That is the purpose of the patch. Are you saying it selects it for SI, CI or VI?

On GFX10. Apparently b128 with align 4 doesn't work there.

I've checked a couple Vulkan CTS tests that now produce b128 instructions for SDag and they work fine. I also did not find any regressions on others. Can you give us any more details? Or a test to reproduce the issue?

More information:

  • (gfx9 hasn't been tested)
  • gfx10.1 has the corruption in WGP mode only (CU mode works)
  • gfx10.3 works

It looks like it's a gfx10.1 hw bug in WGP mode, so a fix or workaround is needed. The driver always uses WGP mode.

Do you know what the driver sets SH_MEM_CONFIG.alignment_mode to?

Sep 9 2020, 12:05 AM · Restricted Project

Sep 2 2020

mareko added a comment to D84403: [AMDGPU] Use ds_read/write_b96/b128 when possible for SDag.

This breaks LDS. LLVMSetAlignment(inst, 4) on loads and stores has no effect. The IR says "align 4", yet the backend still selects b128.

On what subtargets? GFX9 and 10 should select b128 for align 4. That is the purpose of the patch. Are you saying it selects it for SI, CI or VI?

On GFX10. Apparently b128 with align 4 doesn't work there.

I've checked a couple Vulkan CTS tests that now produce b128 instructions for SDag and they work fine. I also did not find any regressions on others. Can you give us any more details? Or a test to reproduce the issue?

Sep 2 2020, 7:19 PM · Restricted Project
mareko added a comment to D84403: [AMDGPU] Use ds_read/write_b96/b128 when possible for SDag.

This breaks LDS. LLVMSetAlignment(inst, 4) on loads and stores has no effect. The IR says "align 4", yet the backend still selects b128.

On what subtargets? GFX9 and 10 should select b128 for align 4. That is the purpose of the patch. Are you saying it selects it for SI, CI or VI?

On GFX10. Apparently b128 with align 4 doesn't work there.

I've checked a couple Vulkan CTS tests that now produce b128 instructions for SDag and they work fine. I also did not find any regressions on others. Can you give us any more details? Or a test to reproduce the issue?

Sep 2 2020, 12:09 PM · Restricted Project

Sep 1 2020

mareko added a comment to D84403: [AMDGPU] Use ds_read/write_b96/b128 when possible for SDag.

This breaks LDS. LLVMSetAlignment(inst, 4) on loads and stores has no effect. The IR says "align 4", yet the backend still selects b128.

On what subtargets? GFX9 and 10 should select b128 for align 4. That is the purpose of the patch. Are you saying it selects it for SI, CI or VI?

Sep 1 2020, 6:16 AM · Restricted Project

Aug 31 2020

mareko added a comment to D84403: [AMDGPU] Use ds_read/write_b96/b128 when possible for SDag.

This breaks LDS. LLVMSetAlignment(inst, 4) on loads and stores has no effect. The IR says "align 4", yet the backend still selects b128.

Aug 31 2020, 8:20 PM · Restricted Project

Jul 29 2020

mareko added a comment to D84223: [AMDGPU] Don't combine memory intrs to v3i16.

Can you please cherry-pick this to the LLVM 11 branch?

Jul 29 2020, 4:04 AM · Restricted Project

Jan 6 2020

mareko added a comment to D71358: AMDGPU: Remove denormal subtarget features.

Can you push this after (not before) LLVM 10 has been branched?

Jan 6 2020, 12:33 PM · Restricted Project

Dec 30 2019

mareko updated subscribers of D71209: InstCombine: Don't rewrite phi-of-bitcast when the phi has other users.

@nhaehnle

Dec 30 2019, 12:36 PM · Restricted Project
mareko added a comment to D71209: InstCombine: Don't rewrite phi-of-bitcast when the phi has other users.

Would somebody please push this?

Dec 30 2019, 9:46 AM · Restricted Project

Dec 18 2019

mareko added a comment to D70315: [InstCombine][AMDGPU] Trim more components of *buffer_load.

This breaks gfx8. Can you revert?

Dec 18 2019, 11:35 AM · Restricted Project

Sep 30 2019

mareko accepted D67745: AMDGPU/GlobalISel: Add support for init.exec intrinsics.
Sep 30 2019, 8:44 AM

Sep 12 2019

mareko accepted D67338: AMDGPU: Add immarg tto llvm.amdgcn.init.exec.from.input.

Yes.

Sep 12 2019, 1:57 PM

Aug 27 2019

mareko accepted D66388: AMDGPU/GlobalISel: Implement addrspacecast for 32-bit constant addrspace.
Aug 27 2019, 3:04 PM
mareko accepted D66812: AMDGPU Reduce reported maximum group size to 1024.
Aug 27 2019, 3:04 PM
mareko added a comment to D66812: AMDGPU Reduce reported maximum group size to 1024.

It's supported by Mesa on gfx6-8. I'm OK with this though.

Aug 27 2019, 2:37 PM

Aug 19 2019

mareko added inline comments to D66388: AMDGPU/GlobalISel: Implement addrspacecast for 32-bit constant addrspace.
Aug 19 2019, 11:28 AM

Jul 2 2019

mareko created D64114: AMDGPU: Add missing code for GDS.
Jul 2 2019, 5:10 PM · Restricted Project

Jun 26 2019

mareko added a comment to D63420: AMDGPU: Fix s.buffer.load being marked as readnone.

This has a possibly negative impact on Mesa. Both SGPR usage and SGPR spilling increased. I don't see anything suspicious in the generated assembly other than instructions being reordered. I'd like to better understand the side effects of this patch.

Jun 26 2019, 10:21 AM

Jun 21 2019

mareko added a comment to D63420: AMDGPU: Fix s.buffer.load being marked as readnone.

This breaks Mesa:
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.s.buffer.load

Jun 21 2019, 9:26 AM

Feb 7 2019

mareko accepted D57894: AMDGPU: Fix @llvm.amdgcn.wqm.vote implementation.
Feb 7 2019, 12:06 PM · Restricted Project

Jan 18 2019

mareko accepted D56880: AMDGPU: Remove llvm.SI.load.const.
Jan 18 2019, 7:54 AM

Jan 16 2019

mareko committed rL351351: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.
AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap
Jan 16 2019, 7:47 AM
mareko closed D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.
Jan 16 2019, 7:47 AM

Jan 14 2019

mareko committed rL351150: AMDGPU: Add a fast path for icmp.i1(src, false, NE).
AMDGPU: Add a fast path for icmp.i1(src, false, NE)
Jan 14 2019, 6:17 PM
mareko closed D52060: AMDGPU: Add a fast path for icmp.i1(src, false, NE).
Jan 14 2019, 6:17 PM
mareko added a comment to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

FYI, if there is no feedback on January 15, I'll commit this to get it into LLVM 8.0.

Jan 14 2019, 5:56 PM

Jan 11 2019

mareko added a comment to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

Ping

Jan 11 2019, 6:14 AM

Jan 10 2019

mareko accepted D55181: AMDGPU: Convert tests away from llvm.SI.load.const.
Jan 10 2019, 6:24 PM

Nov 28 2018

mareko added inline comments to D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.
Nov 28 2018, 5:04 PM

Nov 26 2018

mareko updated the diff for D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

Use report_fatal_error, add SourceOfDivergence lines

Nov 26 2018, 8:40 PM
mareko added a comment to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

Any comments?

Nov 26 2018, 8:33 PM
mareko updated the diff for D52060: AMDGPU: Add a fast path for icmp.i1(src, false, NE).

Add InstCombine tests.

Nov 26 2018, 8:30 PM

Nov 20 2018

mareko added inline comments to D52060: AMDGPU: Add a fast path for icmp.i1(src, false, NE).
Nov 20 2018, 3:18 PM

Nov 12 2018

mareko added a comment to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

If I make m0 integer, DS_ORDERED_COUNT won't be a mem node.

Nov 12 2018, 9:14 PM
mareko added a comment to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

I might change the intrinsic to add the option to insert "s_and_saveexec s[N:M], 1" and "s_mov_b64 exec, s[N:M]" around the intrinsic to get an optimal single-lane block.

Nov 12 2018, 7:48 PM
mareko added a comment to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.

Does this need to be marked as isSourceOfDivergence?

Now that you mention it, yes, even though it's for stupid reasons: I believe the ds_ordered_count instruction executes only in a single lane, so it's intuitively a uniform operation; however, it returns its result only in lane 0, so it's formally non-uniform.

Nov 12 2018, 7:43 PM
mareko abandoned D33865: Mark llvm.*annotation intrinsics as NoMem and Speculatable.
Nov 12 2018, 7:38 PM
mareko abandoned D52907: AMDGPU: Don't merge DS opcodes on SI to fix corruption in Hitman.

Why exactly? Is it possible the Mesa lds allocation isn’t aligning properly?

Since the issue only occurs on SI, I don't think Mesa is doing anything bad. Unless there is some LDS hw difference on SI...

I do remember one bug we have that may be related. We try to use the ds_read2_b32 with 4-byte signed trick on SI, without checking that we can use the offsets if the base address isn't known positive

Nov 12 2018, 7:37 PM

Nov 2 2018

mareko added a comment to D54042: [AMDGPU] Extend the SI Load/Store optimizer to combine more things..

I'm concerned that x8 and x16 loads will significantly increase SGPR usage and therefore SGPR spilling. We have a shader database with over 70 games and benchmarks and I guess the results will not be good after this is committed.

Nov 2 2018, 1:39 PM · Unknown Object (Project)

Oct 8 2018

mareko added a comment to D52907: AMDGPU: Don't merge DS opcodes on SI to fix corruption in Hitman.

Why exactly? Is it possible the Mesa lds allocation isn’t aligning properly?

Oct 8 2018, 6:25 PM
mareko added inline comments to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.
Oct 8 2018, 2:59 PM
mareko added inline comments to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.
Oct 8 2018, 2:56 PM
mareko added inline comments to D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.
Oct 8 2018, 8:14 AM

Oct 5 2018

mareko created D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap.
Oct 5 2018, 1:34 PM
mareko added a comment to D52907: AMDGPU: Don't merge DS opcodes on SI to fix corruption in Hitman.

Multi-dword LDS opcodes seem to be the culprit.

Oct 5 2018, 1:34 PM
mareko updated the diff for D52060: AMDGPU: Add a fast path for icmp.i1(src, false, NE).

AMDGPU: Add a fast path for icmp.i1(src, false, NE)

Oct 5 2018, 1:31 PM
mareko added a comment to D52907: AMDGPU: Don't merge DS opcodes on SI to fix corruption in Hitman.

This isn't a correct fix. If there's an issue with 64-bit DS instructions, it's a lowering problem. If we can't use them for some reason, changing this here might be a helpful heuristic but as-is this is not a real fix

Oct 5 2018, 10:59 AM

Oct 4 2018

mareko created D52907: AMDGPU: Don't merge DS opcodes on SI to fix corruption in Hitman.
Oct 4 2018, 2:18 PM

Sep 29 2018

mareko accepted D52683: [AMDGPU] Fix for negative offsets in buffer/tbuffer intrinsics.
Sep 29 2018, 5:07 PM

Sep 27 2018

mareko added a comment to D52577: [AMDGPU] Fold copy (copy vgpr).

On a related note, another way to decrease VGPR usage is to fold immediates with more than 1 uses. The backend currently folds immediates with only 1 use.

Sep 27 2018, 10:03 AM

Sep 24 2018

mareko added a comment to D51969: [AMDGPU] Add an AMDGPU specific atomic optimizer..

What happens if a shader already does "if (threadID == 0) { do_atomic(); }"? Is the optimization skipped in this case?

Sep 24 2018, 6:41 PM · Unknown Object (Project)

Sep 21 2018

mareko added a comment to D52060: AMDGPU: Add a fast path for icmp.i1(src, false, NE).

Should the instcombine part change also to allow creation of i1 uses?

Sep 21 2018, 12:03 AM

Sep 13 2018

mareko created D52060: AMDGPU: Add a fast path for icmp.i1(src, false, NE).
Sep 13 2018, 2:30 PM

Aug 29 2018

mareko committed rL340959: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.
AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes
Aug 29 2018, 1:04 PM
mareko closed D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.
Aug 29 2018, 1:03 PM

Aug 28 2018

mareko updated the diff for D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.

This fixes GPU hangs with OpenGL bindless handle arithmetic.

Aug 28 2018, 11:50 PM
mareko added a comment to D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.

@arsenm Where do you have the patches that preserve NUW?

Aug 28 2018, 8:34 PM

Aug 27 2018

mareko added a comment to D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.

We can ignore old Mesa + new LLVM, because LLVM 7 is the first release to have 32-bit pointers, and I think we can fix that before release.

Aug 27 2018, 11:32 AM

Aug 24 2018

mareko added inline comments to D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.
Aug 24 2018, 10:26 PM
mareko added inline comments to D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.
Aug 24 2018, 9:20 AM

Aug 23 2018

mareko created D51203: AMDGPU: Handle 32-bit address wraparounds for SMRD opcodes.
Aug 23 2018, 6:08 PM

Aug 22 2018

mareko added a comment to D51098: [AMDGPU] Add support for multi-dword s.buffer.load intrinsic.
In D51098#1209699, @tpr wrote:

Marek, I will correct the spelling of your name in the commit message when I land this. :-)

Aug 22 2018, 12:46 PM

Aug 21 2018

mareko added inline comments to D47261: AMDGPU: bump AS.MAX_COMMON_ADDRESS to 6 since 32-bit addr space.
Aug 21 2018, 5:41 PM

Aug 20 2018

mareko accepted D50306: [AMDGPU] New buffer intrinsics.
Aug 20 2018, 4:23 PM
mareko added inline comments to D47261: AMDGPU: bump AS.MAX_COMMON_ADDRESS to 6 since 32-bit addr space.
Aug 20 2018, 2:28 PM

Aug 11 2018

mareko accepted D50469: AMDGPU: Check NSZ MI flag when folding omod.

Accepted.

Aug 11 2018, 11:02 AM

Aug 10 2018

mareko accepted D50486: MachineScheduler: Refactor setPolicy() to limit computing remaining latency.
Aug 10 2018, 9:31 AM

Aug 1 2018

mareko accepted D50148: AMDGPU: Partially fix handling of packed amdgpu_ps arguments.
Aug 1 2018, 12:10 PM

Jul 30 2018

mareko added a comment to D47261: AMDGPU: bump AS.MAX_COMMON_ADDRESS to 6 since 32-bit addr space.

The patch is missing a test. shader-db can reproduce it.

Jul 30 2018, 2:17 PM

Jul 13 2018

mareko added a comment to D49065: AMDGPU: Stop wasting argument registers with v3i32/v3f32.

This patch along with the prerequisite patch doesn't break Mesa.

What's the prerequisite patch?

Jul 13 2018, 11:36 AM

Jul 12 2018

mareko added a comment to D49065: AMDGPU: Stop wasting argument registers with v3i32/v3f32.

This patch along with the prerequisite patch doesn't break Mesa.

Jul 12 2018, 7:15 PM
mareko accepted D49128: AMDGPU: Properly handle shader inputs with split arguments.
Jul 12 2018, 7:12 PM

Jul 11 2018

mareko added a comment to D49128: AMDGPU: Properly handle shader inputs with split arguments.

This is a no-op change, right? Because the previous code also works.

Jul 11 2018, 11:52 AM

Jul 9 2018

mareko added inline comments to D49065: AMDGPU: Stop wasting argument registers with v3i32/v3f32.
Jul 9 2018, 6:34 PM

Jul 5 2018

mareko added a comment to D47261: AMDGPU: bump AS.MAX_COMMON_ADDRESS to 6 since 32-bit addr space.

FYI, this bug is also reproducible on Mesa OpenGL now.

Jul 5 2018, 12:37 PM

May 23 2018

mareko accepted D47261: AMDGPU: bump AS.MAX_COMMON_ADDRESS to 6 since 32-bit addr space.

LGTM.

May 23 2018, 7:59 AM

May 19 2018

mareko added a comment to D46992: [AMDGPU] Add perf hints to functions.

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

May 19 2018, 7:33 PM
mareko added a comment to D46992: [AMDGPU] Add perf hints to functions.

How does this pass affect shaders that use a lot of memory instructions but no pointers?

May 19 2018, 10:01 AM

May 16 2018

mareko added a comment to D46992: [AMDGPU] Add perf hints to functions.

How can UMDs disable this optimization?

May 16 2018, 9:56 PM

May 15 2018

mareko closed D46351: StructurizeCFG: fix inverting conditions.

Committed.

May 15 2018, 2:52 PM
mareko committed rL332404: AMDGPU: Add a missing test for the 128-bit local addr space option.
AMDGPU: Add a missing test for the 128-bit local addr space option
May 15 2018, 2:45 PM
mareko committed rL332403: StructurizeCFG: fix inverting conditions.
StructurizeCFG: fix inverting conditions
May 15 2018, 2:45 PM

Apr 10 2018

mareko committed rL329764: AMDGPU: enable 128-bit for local addr space under an option.
AMDGPU: enable 128-bit for local addr space under an option
Apr 10 2018, 3:53 PM

Apr 9 2018

mareko committed rL329591: AMDGPU: enable 128-bit for local addr space under an option.
AMDGPU: enable 128-bit for local addr space under an option
Apr 9 2018, 10:02 AM

Mar 16 2018

mareko added a comment to D44401: [AMDGPU] Always use IDX for load/store format intrinsics..

I think the unit change is a hint we should use different intrinsics for this

Mar 16 2018, 1:31 PM · Restricted Project

Feb 21 2018

mareko added inline comments to D42647: AMDGPU: Track physreg uses in SILoadStoreOptimizer.
Feb 21 2018, 9:48 AM