- User Since
- Aug 10 2016, 1:07 PM (179 w, 3 d)
My initial reaction was that the PHI was invalid, but according to @qcolombet it's ok to have different register banks on the operands
It looks like the build isn't passing the new flag to cmake. Does the buildbot need to be restarted?
I don't like fiddling with the ABI based on CPU features; it's common to mix code with different CPU features enabled (particularly for desktop/mobile CPUs; less so on microcontrollers, but it can still happen). If we want to support some other ABI for performance reasons, we should make the user request it explicitly.
What is the restriction on the register classes of PHI operands, if they don't have to have to be same register class as the result? I don't see any relevant documentation, or checks in MachineVerifier::checkPHIOps().
Thu, Jan 16
We currently don't have any mechanism for restricting builtins to specific operating systems. I guess we could add one, but this change doesn't seem like a compelling argument to add that capability.
Did you audit the other uses of dyn_cast<MCEncodedFragment>
This makes it impossible to do a neat trick when using NEON intrinsics: one can load a number of constants using a single vector load, which are then repeatedly used to multiply whole vectors by one of the constants. This trick is used for a nice performance upside (2.5% to 4% on one microbenchmark) in libjpeg-turbo.
So I believe that after this change, we are no longer guaranteed that FP + 16 = SP at exit, correct?
Apparently the ARM-mode LDRD is a bit more strange than I realized. From the ARM manual: if t2 == 15 || m == 15 || m == t || m == t2 then UNPREDICTABLE;. I guess we're managed to avoid running into this in the past by never generating the register form of ldrd.
This is very hacky, but it might be the least-bad alternative. I mean, we could change D71082 so it doesn't allow system headers to define memset, but that seems worse.
Wed, Jan 15
LGTM with one question
I'm not really tracking performance for 32-bit armv8-a at the moment... but I would be fine with changing the default with appropriate testing. I don't think we should have core-specific heuristics unless there's evidence some core is actually different in a meaningful way.
Can we also also add mul patterns for targets that have SVE, but not SVE2?
It is not redundant with respect to -mfloat-abi=soft, as it leaves the door open to passing integer vector arguments via the so-called "FP registers".
Please add a testcase to make sure we can round-trip from .ll->.bc->.ll.
Tue, Jan 14
This makes sense.
Okay. Please let me know if you want me to review anything.
check is false for most combinations for int and fp types except maybe i32 and f64
In terms of the user-visible behavior, I guess my question is whether it would make sense to add "-mfpu=mve"/"-mfpu=mve.fp", and make "-mfpu=none" mean "no FP registers". I'm not sure why a user would specify "-mfpu=none" if they didn't want to disable the floating-point registers altogether.
Mon, Jan 13
Fixed one more issue. While I'm at it, get rid of the old overload of ConstantExpr::getShuffleVector, which only had a few remaining uses.
Rebase. Address review comment. More work on bitcode.
I'd prefer to just add writeonly markings in places where they're missing. For llvm.memset in particular, can we mark it IntrWriteMem?
LGTM. Do you have commit access?
Do we have any negative tests, for immediates that are outside the range of the sve mul immediate instruction?
Are there any cores where we don't want FeatureDontRestrictIT ? I mean, multi-instruction it blocks were formally deprecated, but I'm not sure anyone actually took advantage of that.
Please add a comment explaining each isScalableVector(), to make it clear why we're excluding them. Otherwise LGTM
Target-independent changes look fine.
The new IR intrinsic makes more sense.
Eagerly evaluating based on MaxAtomicPromoteWidth seems fine... assuming we're actually setting MaxAtomicPromoteWidth to something appropriate. The value on PowerPC looks wrong.
Probably if I was designing these flags from scratch, I wouldn't choose the semantics of "-mfpu=none is *very* similar to -mfloat-abi=soft, only that it should not disable MVE-I.", but I guess we're following gcc here?
Fri, Jan 10
LGTM with one minor request.
Does it make sense to have a special case for vscale * 1 or vscale * -1? Not that it's likely to come up in most cases, but it would look pretty silly to generate a madd that multiplies by 1.
My concern with this is that there there might be some places where we're assuming the backedge of a loop latch is exactly one edge, and you might not have found them. There are a lot of places that call getLoopLatch or isLoopSimplifyForm, we don't really have much regression test coverage for this sort of construct, and the C/C++ tests most people run probably have few switches like this.
Figured out the changes necessary to get bitcode reading/writing working. ninja check now passes.
Add more comments, fixed the GlobalISel representation of shuffles, misc other cleanups. Still haven't addressed the bitcode reader/writer issues.
I'd like to see a few more testcases here, to cover the new functionality
Thu, Jan 9
Instead of returning a symbol difference from evaluatePCRelLo and folding it MCAssembler, can just fold the symbol difference to a constant in evaluatePCRelLo itself? MCExpr::evaluateAsRelocatableImpl does something like that in EvaluateSymbolicAdd.
(Reupload with context.)
Wed, Jan 8
It might be possible to rearrange Low Overhead Loops to run before ConstantIslands, but you'd probably need to do more to make it work properly. I don't think ConstantIslands knows how to handle the branches generated by LowOverheadLoop. If you think it's necessary, please split it into a separate patch with proper tests.
Tue, Jan 7
the approach that's here has proved sufficient in all the cases we can come up with
it requires . and .Ltmp0 be in the same fragment [...] This is checked earlier by ensuring findAssociatedFragment() matches for each.
I think ultimately, the problem here is that the relocation is marked FKF_IsPCRel. That has a specific meaning that doesn't apply here: the ultimate value encoded into the addi is based on the distance between two symbols: the symbol in the pcrel_lo, and the symbol in the corresponding pcrel_hi. The address of the addi itself isn't relevant.
I don't understand what you're doing here.
http://lab.llvm.org:8011/builders/aosp-O3-polly-before-vectorizer-unprofitable is currently broken, I think due to this patch. It looks like polly isn't getting linked into clang by default anymore?
Mon, Jan 6
better reflect the difference in the semantics of the corresponding instructions
Needs a LangRef update.
I think you need to check the elements are byte-sized?
Sun, Dec 29
What happened to this patch?
Mon, Dec 23
Wong SMAX; you're referring to VECREDUCE_SMAX, which isn't relevant here. There is no IR intrinsic for ISD::SMAX; it's pattern-matched from select instructions.
Fri, Dec 20
and it seems to involve a lot of AST traversal
Do we actually need an intrinsic for this, as opposed to adding patterns for ISD::SMAX etc.?
Would it make sense to merge int_aarch64_sve_ld1_gather and int_aarch64_sve_ld1_gather_imm into one intrinsic? They're sort of similar... maybe not worth messing with it, though.
It is expected that targets will implement custom lowering to proper machine instructions for better performance.
Can we guarantee "sizeof(<vscale x 1 x i8>) == vscale" for all future targets?