Page MenuHomePhabricator

Move core RDF files from lib/Target/Hexagon to CodeGen
Needs ReviewPublic

Authored by kparzysz on Jan 30 2017, 12:20 PM.

Details

Summary

The RDF now supports regmasks, which was the major missing feature preventing it from being used on other targets besides Hexagon.

[More tests have been done since the first patch was posted. See below for more details.]
This move includes the data-flow graph construction and recalculation of block live-ins. I have tested 3 targets: ARM, PowerPC and X86 using this scaffold, usually placed in post-RA pseudo-instruction expansion:

  1. Construct the DFG.
  2. Recalculate the block live-ins using the DFG.
  3. Replace the existing live-ins with the calculated ones.
  4. Carry on with the rest of the pass.

Then I checked each target via make check-llvm-codegen-<target>.

All tests on ARM and PowerPC pass, the X86 backend has 14 failures. At the first glance, about half of them are caused by the issue with EAX and AX having the same set of register units (in general E_X and _X). The remaining half seems to be caused by (as of yet) undetermined issues related to exception handling on Windows.

This patch does not introduce any functional changes, the only changes are in the locations of the files. Putting the files in CodeGen will make it easier to further test it and enable any potentially interested users to try it out.

New round of tests:

Implemented a target-independent RDF-optimization pass similar to Hexagon's RDF optimizations, i.e.:

  • copy propagation
  • dead code elimination
  • liveness recomputation

Not surprisingly, many tests failed in "make check-llvm-codegen". I have manually checked the differences for AArch64, ARM and PowerPC, and the final code appeared to be correct, and the differences seemed to have been caused by the optimizations taking effect.

These tests also involved RDFCopy.cpp and RDFDeadCode.cpp, but these files are not included in this patch (for no other reason than that they were not included in the original patch).

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz created this revision.Jan 30 2017, 12:20 PM
MatzeB edited edge metadata.Jan 30 2017, 1:38 PM

My intuition would be that needing/computing a full dataflow graph that late in the compilation process is a bad sign. Having said that this probably won't go away for Hexagon and sharing code between targets is a good idea. Do you know of any plans to use this in other targets?

Some random nitpicks following but I didn't really read the whole patch yet.

include/llvm/CodeGen/RDFGraph.h
11

should be /// \file

279

I usually declare bitsets like this:

enum MyBitSet {
   A = 1u << 0,
   B = 1u << 1,
   C = 1u << 2,
   // ...
   // or 1u << FirstBit, 1u << FirstBit+1, if you cannot start at 1
   Mask = 0xf // Though I usually avoid masks and rather try to pack stuff into separate bitfield
              // members instead of putting them all in a uint16_t
};

which makes it more obvious what is happening.

361

Use doxygen comments.

lib/CodeGen/RDFLiveness.cpp
20

There is an alternative algorithm floating around, for example https://reviews.llvm.org/D28934

Which I usually prefer because it is simpler and doesn't require dominance tree construction. But I may be biased here :)

hfinkel edited edge metadata.Jan 30 2017, 9:04 PM

My intuition would be that needing/computing a full dataflow graph that late in the compilation process is a bad sign. Having said that this probably won't go away for Hexagon and sharing code between targets is a good idea. Do you know of any plans to use this in other targets?

I'm certainly interested in testing this for PowerPC, and I'm happy to see this as target independent code. There are certainly cases where we want to do transformations post-RA where this kind of information is helpful (e.g. where we do register scavenging, shrink wrapping, etc.), and I'm definitely open to the possibility that this is the right way to get it.

We should definitely address the question of "Why is this needed currently?". If you're using this for copy propagation and DCE, why do you have dead code post-RA that was not eliminated by our pre-RA DCE? Why do you have copy-propagation opportunities that present themselves only after RA? Does this mean that you/we're missing opportunities to make better RA decisions?

Some random nitpicks following but I didn't really read the whole patch yet.

We should definitely address the question of "Why is this needed currently?". If you're using this for copy propagation and DCE, why do you have dead code post-RA that was not eliminated by our pre-RA DCE? Why do you have copy-propagation opportunities that present themselves only after RA? Does this mean that you/we're missing opportunities to make better RA decisions?

CP/DCE was part of the original motivation for this effort. The idea of doing some post-RA optimizations had been floating around for quite some time before this code was written. Then after it was written, it took some time before it was upstreamed. Back then, the code coming out of the register allocator was not great---there were quite a few cases of register copies left over after PHI elimination, which we thought could be removed. We were receiving bug reports from our customers about redundant register assignments (crossing basic block boundary), etc., and we thought that having a general framework would help solve that, as well as provide a way to address any future concerns related to register allocation specifics.[1] Eventually, the register allocator became a lot better, and so that issue was no longer the most important. The CP/DCE code uses target-specific hooks, which on Hexagon, allow us to do some cleanup in cases that the register allocator cannot handle. For example, we can convert a post-increment load into a regular load where the address register is not used afterwards (i.e. the "increment" part of the instruction is "dead"). We could also recognize certain Hexagon instructions as copies, for example "r1:0 = combine(r2, r3)", which is equivalent to "r1 = r2; r0 = r3". In a more specific case, like "r1:0 = combine(r1, r3)", it could be replaced with "r0 = r3".[2] Those are not sources of major improvements in code size of performance, but still do some useful things that help certain benchmarks. Another role that they play at the moment is helping to test developments in the graph construction and liveness computations.

One of existing use cases for CP are situations where a register copy is immediately followed by uses of the destination register. This creates a flow dependency and restricts packetization. Even in the absence of DCE, CP itself could make a difference. The note [2] also refers to such cases. I don't have a testcase handy that shows this, but it still does happen in our code.

[1] Not all such issues are correctible in the allocator itself, as not all of them qualify as "bugs". For example, having a pair of copies "r1 = r10; r2 = r18" takes two instructions, but a pair "r1 = r10; r0 = r18" could be replaced with a single instruction "r1:0 = combine(r10, r18)". This form of register renaming is not implemented yet, but it would be a good use case for this framework. So far most of the time has been spent on developing the framework itself, and not so much on actually utilizing it (although we do have other users than the CP/DCE).

[2] Other things unchanged, this wouldn't make much of a difference. The benefit of doing it is to relax restrictions on post-RA scheduling (which on Hexagon includes packetization).

kparzysz updated this revision to Diff 86631.Feb 1 2017, 6:57 AM

Converted comments with descriptions to doxygen format.

kparzysz marked 2 inline comments as done.Feb 1 2017, 6:58 AM
kparzysz added inline comments.
lib/CodeGen/RDFLiveness.cpp
20

The algorithm you linked to is for building/updating SSA, this one is for calculating block live-ins from existing SSA.

hiraditya added inline comments.Feb 23 2017, 10:50 AM
lib/CodeGen/RDFGraph.cpp
597

This function findBlock appears to be O(n) algorithm, and it is used in multiple places where this function is called for each BasicBlock in the function. Is there a way to speed up this algorithm.

kparzysz added inline comments.Mar 3 2017, 5:48 AM
lib/CodeGen/RDFGraph.cpp
597

The DFG has a private cache where it stores the addresses for quick lookup and it uses its own function to locate block nodes.

This function is restricted to use the public interfaces available from its class hierarchy and from the DFG, so it has to search all blocks. The search itself could be improved a bit, but it would remain O(n), and the code would become more complex.

asb added a subscriber: asb.Mar 26 2017, 6:26 AM
kparzysz updated this revision to Diff 95438.Apr 17 2017, 7:47 AM

Multiple fixes to RDF, addressed comments.

Increased the scope of testing: created an extra post-RA pass that ran the RDF-based copy propagation and dead code elimination, followed by liveness recomputation (in a similar way to what HexagonRDFOpt.cpp does). I manually checked "make check-llvm-codegen" failures on AArch64, ARM and PowerPC, and they seem to be caused by these optimizations modifying the code to a form not expected by the testcase. In all those cases the code appeared to still be correct.

kparzysz edited the summary of this revision. (Show Details)Apr 17 2017, 7:53 AM
MatzeB resigned from this revision.Aug 15 2017, 11:45 AM

This is a huge chunk of code to review, and to really form an opinion about it I believe it needs to be used in practice.
Overal the code looks fine to me, but I didn't dive in too deeply. I wonder if inventing a new allocator was really necessary here or what the point of TargetOperandInfo is (I would expect the register and regmask operand to have a specific clear meaning and this looks like a way to have targets override that?).

But all in all I'll leave review to whoever plans to use this code.

include/llvm/CodeGen/RDFGraph.h
11

seems unchanged.

926–941

You could use the Printable helper class here.

include/llvm/CodeGen/RDFLiveness.h
23–28

no indentation for namespaces, also a strange de-dent when the rdf namespace starts. Similar in the other files.

lib/CodeGen/RDFLiveness.cpp
20

Sorry, indeed.

arsenm added a subscriber: arsenm.Aug 15 2017, 11:48 AM

I've briefly looked at this before to use in AMDGPU, but was never entirely sure if this would actually help with my problems.