- User Since
- Aug 10 2016, 1:07 PM (214 w, 1 d)
Overall this seems like a natural extension of the existing EarlyCSE handling of load/store. I don't see any major issues.
Mostly looks fine; couple minor suggestions.
LGTM. Sorry about the delay.
Among all the attributes that can be attached to an argument of a tail call, byval is unusual. It significantly changes the semantics of the call: when the call executes, instead of passing the pointer itself, it makes a copy of the data behind the pointer.
If there are transformations that are specifically illegal with corouties, I'd like to see a section in https://llvm.org/docs/Coroutines.html describing what transforms are and are not legal. Just saying that the current version of the coroutine lowering code can't handle it isn't sufficient.
Further, because of MachO limitations on relocation addends (have I said how much I hate that format in the last week? Time to do it again), it can't be anything inside the function, leaving only the function symbol itself.
I have a proposal in https://bugs.llvm.org/show_bug.cgi?id=46786#c20 which would solve this, among other issues. Not sure when I'll get around to actually implementing it.
Please fix clang-format warnings. Otherwise LGTM.
Please make sure we have test coverage for the SVE-only extracts (e.g. extracting the third element of a <vscale x 2 x i64>).
even though it _doesn't_ match what the HW does, because the HW can signal and APFloat can't
The other codepaths should do the right thing in theory, so I'm fine with leaving them, I think.
I'm okay with taking this as an incremental improvement, and then looking into a different approach later.
If I'm following this correctly, with this change, getABIRegCopyCC has one remaining caller, and that call always returns None. So getABIRegCopyCC is useless, and the CallConv member of RegsForValue is always None? It seems weird that all this code got built up around a useless operation, but I guess that sort of thing can happen.
Wed, Sep 16
Sure, seems fine for now.
On a side-note, should we be trying to emit movw+vmov instead of a constant-pool load under any circumstances?
Err, actually, thinking about it a bit more, we should probably delete GenWidenVectorTruncStores in favor of using scalarizeVectorStore(), since they do basically the same thing.
This is by inspection - as was mentioned on the ticket, this code isn't active at all in the tests and I've not been able to get it to fire
I think the NEON code is trying to avoid producing an extra instruction if the result ends up getting used by a vector operation. SelectionDAG technically allows instructions that produce an i64 result to return it in a floating-point register, but it doesn't work very well in practice: the operations that consume it will force it back into a GPR. This is one of the issues GlobalISel's RegBankSelect solves.
Tue, Sep 15
Also, it would be nice to have some regression test coverage; add a Solaris RUN line to llvm/test/CodeGen/X86/stack-align2.ll ?
Mon, Sep 14
Fri, Sep 11
The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.
Maybe we should be explicitly checking for isCall here? It's not really right to assume instructions that any instruction that doesn't have mayLoad/mayStore modifies memory.
Having a bunch of boolean modifiers floating around on memcpy sounds miserable for code to deal with; it makes it hard to code to just ignore the "complicated" memcpy operations if it doesn't want to reason about them. (It's already messy just dealing with volatile operations.)
I think it's called this out in the comments on the aarch64 implementation, but ultimately, I think we want specialized MCFragments for pdata/xdata. The current AArch64 implementation currently has a couple FIXMEs which can't be fixed in the current architecture (in particular, it doesn't interact with relaxation correctly).
Maybe worth noting somewhere that willreturn implies mustprogress.
Instead of trying to make the table relative to the jump/adr, can we use the approach we use for compressed jump tables (using the address of the first destination as the base) unconditionally? Seems less complicated overall. I guess making it work in general costs one extra instruction in general: if you want to handle functions larger than the range of adr, you need to use adrp instead of adr. But functions that large are rare, and the potential upside from duplicating jumptable jumps is pretty large.
Thu, Sep 10
Add an extra test so the divide path is exercised.
the length of the synthesized canonical prologue might differ a little from the actual one maybe
LGTM, but I still would like to see some numbers before merge.
Given that most jump tables are probably compressed anyway, I don't expect any big performance impact. But it would be nice to have some numbers.
I will point out that of all the modern compilers I could test, only Clang trunk deletes volatile stores, even followed by clearly unreachable code: https://compiler-explorer.com/z/EGTofn
You can't blindly specify "-mcx16" without checking the target.
Wed, Sep 9
I'm still a little skeptical about the short-term plan with LTO, but I'm not strongly against it.
Unfortunately, in practice, this isn't ever matched by the code generated by LLVM