This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: shrink the size of BUFFER_LOAD_FORMAT_XYZW when possible
AbandonedPublic

Authored by nhaehnle on Oct 9 2015, 4:26 AM.

Details

Reviewers
tstellarAMD
Summary

This reduces register pressure and helps avoid some unnecessary waits
as well.

Diff Detail

Event Timeline

nhaehnle updated this revision to Diff 36937.Oct 9 2015, 4:26 AM
nhaehnle retitled this revision from to AMDGPU: shrink the size of BUFFER_LOAD_FORMAT_XYZW when possible.
nhaehnle updated this object.
nhaehnle added a reviewer: tstellarAMD.

You also need to add lit tests for this. Take a look in test/CodeGen/AMDGPU for examples.

lib/Target/AMDGPU/AMDGPUInstrInfo.cpp
326–327

Coding style braces go on the same line as function signature.

lib/Target/AMDGPU/SIISelLowering.cpp
2098–2099

Coding style.

2128–2131

Coding style for the brace, and also function arguments should go on the same line. You should reivew. http://llvm.org/docs/CodingStandards.html

You can also run clang-format on the file to get the correct style, but I'm sure there are other errors in the file, so this will probably update more than just your code.

lib/Target/AMDGPU/SIInstrInfo.td
2787–2788

You only need to create the mapping for the pseudo instructions, so I think this means you can drop isPseudo and Subtarget from the list.

lib/Target/AMDGPU/SIInstructions.td
921

Interesting, I did not now it was possible to inherit from multiclasses and classes, but as mentioned above, I think the pseudo instructions should inherit from ResizableLoad instead.

arsenm added a comment.Oct 9 2015, 9:28 AM

I think this should be done on the IR intrinsics in instcombine. Is there some advantage to doing it in the DAG? We don't produce formatted loads for any other reason

Matt, can you give me some hints as to how to go about this? Keep in mind that things are icky when reducing to a single f32, because we need to bypass the EXTRACT_SUBREG.

Do you think adjustWritemask (for texture sampling) should be moved analogously? There may be a reason why that is done where it's done.

lib/Target/AMDGPU/SIISelLowering.cpp
2128–2131

Heh. I actually read those standards, and produced what I feel is the most readable code following the rules spelled out in there (e.g., the braces thing is not actually in there). Function arguments on the same line is difficult and counter-productive for readability with the 80 columns limit (and also not in that document), etc.

Anyway, thanks for telling me about clang-format (yet another thing not mentioned in that document...) - I'll run that later and see what it has to say.

lib/Target/AMDGPU/SIInstrInfo.td
2787–2788

True, I only need them for the pseudo instructions, but how can I tell TableGen about that? The issue is that the _si and _vi variants are generated in the inner-most defm, and I attach the ResizableLoad class outside, so the _si and _vi variants are added to the InstrMapping table even though I don't need them, and dropping isPseudo and Subtarget leads to a legitimate error about multiple matches in the same mapping table row.

The thing is, it _would_ be nicer to add only the pseudos to the mapping table, but I ultimately gave up on that because I don't know how to add the ResizableLoad class only to the pseudo instructions that are loads. The defm at which the split into pseudo, si and vi happens is used by all MUBUF instructions, including stores and atomics.

If you have a cleaner way to do it, I'd be delighted, but at some point I just gave up fighting TableGen and went with what you see here.

lib/Target/AMDGPU/SIInstructions.td
921

See the above. I see no way of doing that other than duplicating MUBUF_m into MUBUF_m_load, MUBUF_m_store, and MUBUF_m_atomic (or something like that). And even then, there's still the issue of what to do about the other BUFFER_LOAD_ instruction that also go via MUBUF_Load_Helper.

Hmm.... conceptually, the issue with TableGen is that we have multiple dimensions in which instructions are created (pseudo/si/vi, offen/idxen/both, different types of load/store/atomic), and we want to filter out the rows in the InstrMapping based on _two_ (or more?) of those dimensions, which is something that InstrMapping does not support. So perhaps the correct way is to extend what InstrMapping can do - perhaps allow filtering rows based on values instead of just based on an attached class.

What do you think?

I still need to discuss with Matt the best way to implement this.

lib/Target/AMDGPU/SIISelLowering.cpp
2128–2131

Ok, you are right, those things aren't in the document, I think clang-format may be the definitive coding style now. Let's just keep things consistent with the rest of the file, which means braces on the same line and when lines are too long arguments, argument on the next line are aligned with arguments on the first line.

lib/Target/AMDGPU/SIInstructions.td
921

I think it will work if you have MUBUF_Psuedo class inherit from ResizableLoad.

arsenm added a comment.Oct 9 2015, 6:22 PM

Matt, can you give me some hints as to how to go about this? Keep in mind that things are icky when reducing to a single f32, because we need to bypass the EXTRACT_SUBREG.

That's why I think doing it in instcombine is easier. The set of possible users is shrunk, so you don't need to worry about as many of the variants of copy. You just need to see if you only have constant extractelement and shufflevector users

I think this should be done on the IR intrinsics in instcombine. Is there some advantage to doing it in the DAG? We don't produce formatted loads for any other reason

Do you mean LLVM instcombine or a new AMDGPU specific instcombine?

arsenm added a comment.Oct 9 2015, 6:41 PM

I think this should be done on the IR intrinsics in instcombine. Is there some advantage to doing it in the DAG? We don't produce formatted loads for any other reason

Do you mean LLVM instcombine or a new AMDGPU specific instcombine?

It can be done in LLVM instcombine. The intrinsics would have to be moved to include/llvm/IR/IntrinsicsAMDGPU.td (and ideally changed to use the amdgcn prefix and clean any other mistakes in how these are currently defined). There are already a few intrinsics handled there.

I think this should be done on the IR intrinsics in instcombine. Is there some advantage to doing it in the DAG? We don't produce formatted loads for any other reason

Do you mean LLVM instcombine or a new AMDGPU specific instcombine?

It can be done in LLVM instcombine. The intrinsics would have to be moved to include/llvm/IR/IntrinsicsAMDGPU.td (and ideally changed to use the amdgcn prefix and clean any other mistakes in how these are currently defined). There are already a few intrinsics handled there.

I thought about it some more and I think it's probably a bit much for instcombine. It should probably go in a new AMDGPU CodeGenPrepare like IR pass

Just to clarify, adjustWritemask (for MIMG) should then be similarly moved, right?

nhaehnle added inline comments.Oct 12 2015, 6:55 AM
lib/Target/AMDGPU/SIInstructions.td
921

Sorry, missed that.

That would cause buffer_store and _atomic instructions to also inherit from ResizableLoad, if I'm reading the code right.

nhaehnle abandoned this revision.Feb 21 2018, 6:55 AM