This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits
ClosedPublic

Authored by RKSimon on Apr 9 2020, 7:20 AM.

Details

Summary

This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.

This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits.

There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP.

Diff Detail

Unit TestsFailed

Event Timeline

RKSimon created this revision.Apr 9 2020, 7:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 9 2020, 7:20 AM
RKSimon added a subscriber: foad.Apr 9 2020, 8:20 AM
RKSimon added inline comments.
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

@arsenm @foad Not sure if pulling out the immediate is a good idea or not - shouldn't a u16 immediate be cheap?

arsenm added inline comments.Apr 9 2020, 9:26 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

This is worse. Integer constants -16 to 64 and a handful of FP values are free, but 0xffff is not so it requires materialization.

RKSimon planned changes to this revision.Jun 22 2020, 12:15 PM
RKSimon planned changes to this revision.Aug 2 2020, 10:22 AM

still looking at the remaining regressions

RKSimon planned changes to this revision.Sep 9 2020, 8:57 AM
RKSimon updated this revision to Diff 309539.Dec 4 2020, 7:57 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

yubing added a subscriber: yubing.Dec 7 2020, 5:04 AM
RKSimon added inline comments.Jan 26 2021, 4:25 AM
llvm/test/CodeGen/RISCV/rv64Zbp.ll
1105

Looks like we've defeated the RISCVISD::GORCI matching code

craig.topper added inline comments.Jan 26 2021, 12:43 PM
llvm/test/CodeGen/RISCV/rv64Zbp.ll
1105

Running the tests through instcombine also breaks GORCI matching.

craig.topper added inline comments.Jan 26 2021, 12:47 PM
llvm/test/CodeGen/RISCV/rv64Zbp.ll
1105

It's also worth noting, the tests that are failing are repeating the same pattern gorc pattern twice, which is redundant. The test was trying to test that we could detect the redundancy. I guess this patch may have seen some of the redundancy?

RKSimon planned changes to this revision.Jun 3 2021, 4:29 AM
RKSimon updated this revision to Diff 361510.Jul 25 2021, 8:29 AM

rebase (still needs work)

RKSimon planned changes to this revision.Jul 25 2021, 8:29 AM

I've raised https://bugs.llvm.org/show_bug.cgi?id=51209 about the poor quality of the gorc2 pattern matching and the gorc2, gorc2 -> gorc2 tests.

@RKSimon are the other problems with this patch than just the GORCI matching?

@RKSimon are the other problems with this patch than just the GORCI matching?

The GORCI matching is the main one.

There is also some minor issues with MatchRotate - we should be allowed to match rotate/funnel by constant pre-legalization (see ARM/ror.ll) as that can be re-expanded later without any harm done, before we see through the pattern and lose it, although now that we match this quite well in InstCombine I'm not sure is this is as likely to happen.

lenary removed a subscriber: lenary.Nov 2 2021, 6:05 AM
RKSimon updated this revision to Diff 391281.Dec 2 2021, 5:10 AM

rebase - squashed a few more regressions...

RKSimon planned changes to this revision.Dec 10 2021, 2:19 AM
RKSimon planned changes to this revision.Jan 23 2022, 11:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 2:53 AM
RKSimon planned changes to this revision.Apr 6 2022, 2:53 AM
RKSimon planned changes to this revision.May 8 2022, 6:10 AM

Waiting for D124839 to land

RKSimon updated this revision to Diff 429424.May 14 2022, 2:16 AM
RKSimon retitled this revision from [DAG] Enable ISD::SHL/SRL SimplifyMultipleUseDemandedBits handling (WIP) to [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP).
RKSimon edited the summary of this revision. (Show Details)

Rebased after D124839 to just handle ISD::SRL shifts

RKSimon added inline comments.May 14 2022, 2:22 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

@arsenm @foad At EuroLLVM Matt suggested that maybe we should increase the tolerance to 2 uses of the large immediates before pulling out the constant?

llvm/test/CodeGen/ARM/uxtb.ll
112

I'm going to take a look at this, but I'm really not familiar with the UXTB matching code, so any pointers would be appreciated.

arsenm added inline comments.May 16 2022, 6:17 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

s_mov_b32 K + 2 * v_and_b32_32 = 16 bytes, 12 cycles
2 * (v_and_b32_e32 K) = 16 bytes, 8 cycles which is clearly better.

3 * (v_and_b32_e32 K) = 24 bytes, 12 cycles

So 2 uses of a constant seems plainly better for VOP1/VOP2 ops. Abbe that it becomes a code size vs. latency tradeoff

arsenm added inline comments.May 16 2022, 6:23 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

This decision is also generally made by SIFoldOperands. Probably need to fix it there and not in the DAG

foad added inline comments.May 16 2022, 6:35 AM
llvm/test/CodeGen/AMDGPU/trunc-combine.ll
148

I'm strongly in favour of never pulling out the constant (or rather, always folding into the instruction) and I have patches to that effect starting with D114643, which I'm hoping to get back to pretty soon.

RKSimon updated this revision to Diff 431345.May 23 2022, 5:40 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

foad added a comment.May 23 2022, 5:54 AM

AMDGPU changes LGTM.

RKSimon added inline comments.
llvm/test/CodeGen/ARM/uxtb.ll
112

instcombine optimises this as well:

define i32 @test10(i32 %p0) {
  %tmp1 = lshr i32 %p0, 7
  %tmp2 = and i32 %tmp1, 16253176
  %tmp4 = lshr i32 %p0, 12
  %tmp5 = and i32 %tmp4, 458759
  %tmp7 = or i32 %tmp5, %tmp2
  ret i32 %tmp7
}

which has the same problem:

_test10:
@ %bb.0:
        mov     r1, #248
        mov     r2, #7
        orr     r1, r1, #16252928
        orr     r2, r2, #458752
        and     r1, r1, r0, lsr #7
        and     r0, r2, r0, lsr #12
        orr     r0, r0, r1
        bx      lr
RKSimon added inline comments.May 26 2022, 10:38 AM
llvm/test/CodeGen/Thumb2/thumb2-uxtb.ll
175

same problem - instcombine will have already optimized this to:

define i32 @test10(i32 %p0) {
  %tmp1 = lshr i32 %p0, 7
  %tmp2 = and i32 %tmp1, 16253176
  %tmp4 = lshr i32 %p0, 12
  %tmp5 = and i32 %tmp4, 458759
  %tmp7 = or i32 %tmp5, %tmp2
  ret i32 %tmp7
}

It feels like I'm avoiding the issue - but should I update the arm/thumb2 UXTB16 tests to match what the middle-end will have generated?

dmgreen added inline comments.May 27 2022, 6:37 AM
llvm/test/CodeGen/ARM/uxtb.ll
112

I was taking a look. The test is super old now, so old that it had signed types when it was originally added.

I was surprised to see that and 0x70007 is being recognised via an and 0xff00ff tablegen pattern - it goes into SelectionDAGISel::CheckAndMask which checks that the other mask bits are already 0.

I think that is what this is trying to test - that a smaller and mask still matches the UXTB16. Is it possible to change it to something that still captures that, without relying on the multi-use fold of the %tmp2 not happening?

Maybe something like this?

%p = and i32 %p0, 3
%a = shl i32 65537, %p
%b = lshr i32 %a, 1
%tmp7 = and i32 %b, 458759
RKSimon added inline comments.May 30 2022, 1:59 PM
llvm/test/CodeGen/ARM/uxtb.ll
112

Thanks for the hint - I'll give it a try

RKSimon updated this revision to Diff 433345.Jun 1 2022, 3:23 AM

rebase with alternative uxtb16 tests

RKSimon added inline comments.Jun 1 2022, 3:25 AM
llvm/test/CodeGen/ARM/uxtb.ll
112

Thanks @dmgreen - those still match fine. Should I pre-commit these new tests and possibly alter the existing test10 variants with the -instcombine optimized IR to show they already fail to match?

dmgreen added inline comments.Jun 1 2022, 7:03 AM
llvm/test/CodeGen/ARM/uxtb.ll
112

That sounds good to me.

RKSimon updated this revision to Diff 433398.Jun 1 2022, 7:42 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

RKSimon added inline comments.
llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
128–139

@jonpa @uweigand These tests are proving very fragile depending on the order of and/shifts - should SystemZ be preferring masking leading/trailing bits with shift-pairs over shift+and / and+shift do you think? We have TLI::shouldFoldConstantShiftPairToMask to hand that.

uweigand added inline comments.Jun 10 2022, 5:23 AM
llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
128–139

Well, this specific test only loads and then saves unmodified a 3xi31 vector, so ideally however the masking is done, it should be optimized away as unnecessary in either case. That's what currently happens, not sure why this is changing with this PR.

In general, I think using an and-mask would be preferable over a shift pair on SystemZ.

RKSimon planned changes to this revision.Jun 10 2022, 5:32 AM

Thanks @uweigand I'll take another look at this soon

RKSimon updated this revision to Diff 438291.Jun 20 2022, 1:36 AM

rebase after D125836

RKSimon planned changes to this revision.Jun 20 2022, 1:36 AM
RKSimon updated this revision to Diff 443924.Jul 12 2022, 5:36 AM
RKSimon edited the summary of this revision. (Show Details)

rebase and prefer SimplifyDemandedBits over GetDemandedBits for trunc stores

RKSimon updated this revision to Diff 444009.Jul 12 2022, 10:42 AM

Added (or (and X, C1), (and (or X, Y), C2)) -> (or (and X, C1|C2), (and Y, C2)) fold to try to reduce the SystemZ regression

spatel added inline comments.Jul 12 2022, 1:25 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6871 ↗(On Diff #444009)

This could be a preliminary patch. I don't think we'd get that in IR either (even without extra uses):
https://alive2.llvm.org/ce/z/g61VRe

spatel added inline comments.Jul 12 2022, 1:50 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6871 ↗(On Diff #444009)

If I'm reading the SystemZ debug spew correctly, we should have gotten this transform to fire twice, so it would do this:
https://alive2.llvm.org/ce/z/tUsepa
...but we miss it because we don't revisit the last 'or' node? Is that what D127115 would solve?

RKSimon added inline comments.Jul 17 2022, 11:21 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6871 ↗(On Diff #444009)

I've confirmed that D127115 solves the SystemZ fun3 regression but not fun2

IMO the fun2 regression probably shouldn't block the patch from being merged. I've looked into the sequences, and actually neither of them is even close to optimal.

Looking at the semantics, we have 8 x i32 inputs, which need to be truncated to i31, concatenated, and then stored, occupying 31 bytes of memory. Memory is written via three 8-byte stores, followed by a 4-byte, a 2-byte, and a 1-byte store, which does look optimal to me. However, the computation of the 64-bit values to be stored is not.

The first of these should be the value

(A << 33) | ((B << 2) & 0x1fffffffc) | ((C >> 29) & 3)

where A, B, and C are the first three i32 inputs.

However, the computation being performed is more like

((A << 25) | ((B >> 6) & 0x01ffffff)) << 8
| ((B << 58) | ((C & 0x7fffffff) << 27)) >> 56

which gets the correct result, but in about double the number of instructions or cycles that should be required.

While the variant with this PR is even slightly worse than the variant before, that's probably not really relevant given the fact both sequences are rather inefficient. Ideally, we could fix this to get (close to) an optimal sequence, but that would be a different issue. (I'm not even sure yet whether the current inefficiency is due to the middle end or the back end.)

Thanks - I have a lot of individual DAG / SimplifyDemanded patches in progress atm, plus we're now getting closer to completing D127115.

A few patches still have minor regressions that I'm addressing, but this one in particular I've been wondering how much of a real world issue illegal type copies like this actually are? If we were further away from 15.x branch I'd ask to get this in and we ensure we address it once all the patches are in, but given how close we are I'm going to wait for now.

RKSimon updated this revision to Diff 447979.Jul 27 2022, 3:19 AM
RKSimon retitled this revision from [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP) to [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits.
RKSimon edited the summary of this revision. (Show Details)

I think I've covered all the remaining regressions now - D129765 has cleaned up a number of annoying cases - including the SystemZ v3i31 copy test!

I think I've covered all the remaining regressions now - D129765 has cleaned up a number of annoying cases - including the SystemZ v3i31 copy test!

Thanks! SystemZ changes LGTM now as discussed above.

I think this is patch is good to go now - any more comments?

foad added a comment.Jul 28 2022, 3:43 AM

AMDGPU changes still LGTM.

spatel accepted this revision.Jul 28 2022, 5:56 AM

x86 diffs LGTM

llvm/test/CodeGen/X86/ins_subreg_coalesce-1.ll
8–10

Not sure if this test still models some situation that we care about, but you could put a TODO note on it (don't need to copy to %ecx?).

This revision is now accepted and ready to land.Jul 28 2022, 5:56 AM
This revision was landed with ongoing or failed builds.Jul 28 2022, 6:11 AM
This revision was automatically updated to reflect the committed changes.

Hi, we found a regression with some bpf code with this patch. The following shows the problem:

[$ ~/tmp] cat run.sh
/home/yhs/work/llvm-project/llvm/build.cur/install/bin/clang -target bpf -O2 -g -c t.c
[$ ~/tmp] cat t.c
typedef unsigned char u8;
struct event {
  u8 tag;
  u8 hostname[84];
};

void *g;
void bar(void *);

int foo() {
  struct event event = {};

  event.tag = 1;
  __builtin_memcpy(&event.hostname, g, 84);
  bar(&event);
  return 0;
}
[$ ~/tmp] ./run.sh
t.c:14:3: error: Looks like the BPF stack limit of 512 bytes is exceeded. Please move large on stack variables into BPF per-cpu array map.

  __builtin_memcpy(&event.hostname, g, 84);
  ^
t.c:14:3: error: Looks like the BPF stack limit of 512 bytes is exceeded. Please move large on stack variables into BPF per-cpu array map.

2 errors generated.
[$ ~/tmp]

The BPF program enforces the stack size <= 512 bytes. For the above program, with this patch, the code after dag insn selection is worse and eventually in register allocation stage, the stack size is more than 512 and caused the above issue.

To illustrate the problem in more details, without this patch, the lowered machine code looks like

  STB killed %7:gpr, %stack.1.event.i, 0, debug-location !21355 :: (store (s8) into %ir.event.i, align 8, !tbaa !21356); tracecon/src/bpf/tracecon.bpf.c:78:12 @[ tracecon/src/b
pf/tracecon.bpf.c:68:5 ]
  %8:gpr = LDB %6:gpr, 7, debug-location !21358 :: (load (s8) from %ir.call1.i + 7); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %8:gpr, %stack.1.event.i, 12, debug-location !21358 :: (store (s8) into %ir.hostname.i + 7); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.
c:68:5 ]
  %9:gpr = LDB %6:gpr, 6, debug-location !21358 :: (load (s8) from %ir.call1.i + 6); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %9:gpr, %stack.1.event.i, 11, debug-location !21358 :: (store (s8) into %ir.hostname.i + 6); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.
c:68:5 ]
  %10:gpr = LDB %6:gpr, 5, debug-location !21358 :: (load (s8) from %ir.call1.i + 5); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %10:gpr, %stack.1.event.i, 10, debug-location !21358 :: (store (s8) into %ir.hostname.i + 5); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf
.c:68:5 ]
  %11:gpr = LDB %6:gpr, 4, debug-location !21358 :: (load (s8) from %ir.call1.i + 4); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %11:gpr, %stack.1.event.i, 9, debug-location !21358 :: (store (s8) into %ir.hostname.i + 4); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.
c:68:5 ]
  %12:gpr = LDB %6:gpr, 3, debug-location !21358 :: (load (s8) from %ir.call1.i + 3); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %12:gpr, %stack.1.event.i, 8, debug-location !21358 :: (store (s8) into %ir.hostname.i + 3); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.
c:68:5 ]
  %13:gpr = LDB %6:gpr, 2, debug-location !21358 :: (load (s8) from %ir.call1.i + 2); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %13:gpr, %stack.1.event.i, 7, debug-location !21358 :: (store (s8) into %ir.hostname.i + 2); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.
c:68:5 ]
  %14:gpr = LDB %6:gpr, 1, debug-location !21358 :: (load (s8) from %ir.call1.i + 1); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %14:gpr, %stack.1.event.i, 6, debug-location !21358 :: (store (s8) into %ir.hostname.i + 1); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.
c:68:5 ]
  %15:gpr = LDB %6:gpr, 0, debug-location !21358 :: (load (s8) from %ir.call1.i); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %15:gpr, %stack.1.event.i, 5, debug-location !21358 :: (store (s8) into %ir.hostname.i); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68
:5 ]
  %16:gpr = LDB %6:gpr, 15, debug-location !21358 :: (load (s8) from %ir.call1.i + 15); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %16:gpr, %stack.1.event.i, 20, debug-location !21358 :: (store (s8) into %ir.hostname.i + 15); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bp
f.c:68:5 ]
  %17:gpr = LDB %6:gpr, 14, debug-location !21358 :: (load (s8) from %ir.call1.i + 14); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %17:gpr, %stack.1.event.i, 19, debug-location !21358 :: (store (s8) into %ir.hostname.i + 14); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bp
f.c:68:5 ]
...
  %88:gpr = LDB %6:gpr, 83, debug-location !21358 :: (load (s8) from %ir.call1.i + 83); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %88:gpr, %stack.1.event.i, 88, debug-location !21358 :: (store (s8) into %ir.hostname.i + 83); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %89:gpr = LDB %6:gpr, 82, debug-location !21358 :: (load (s8) from %ir.call1.i + 82); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %89:gpr, %stack.1.event.i, 87, debug-location !21358 :: (store (s8) into %ir.hostname.i + 82); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %90:gpr = LDB %6:gpr, 81, debug-location !21358 :: (load (s8) from %ir.call1.i + 81); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %90:gpr, %stack.1.event.i, 86, debug-location !21358 :: (store (s8) into %ir.hostname.i + 81); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %91:gpr = LDB %6:gpr, 80, debug-location !21358 :: (load (s8) from %ir.call1.i + 80); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB killed %91:gpr, %stack.1.event.i, 85, debug-location !21358 :: (store (s8) into %ir.hostname.i + 80); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]

Pretty straightforward byte load and stores and the corresponding stack after register allocation,

# *** IR Dump After Greedy Register Allocator (greedy) ***:
# Machine code for function tcp_v4_connect_exit: NoPHIs, TracksLiveness, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=4, align=4, at location [SP]
  fi#1: size=89, align=8, at location [SP]
  fi#2: size=4, align=4, at location [SP]
Function Live Ins: $r1 in %0

But this patch, the code becomes very complex,

  %8:gpr = LDB %6:gpr, 71, debug-location !21358 :: (load (s8) from %ir.call1.i + 71); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %9:gpr = SLL_ri %8:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %10:gpr = LDB %6:gpr, 70, debug-location !21358 :: (load (s8) from %ir.call1.i + 70); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %11:gpr = OR_rr %9:gpr(tied-def 0), killed %10:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %12:gpr = LDB %6:gpr, 15, debug-location !21358 :: (load (s8) from %ir.call1.i + 15); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %13:gpr = SLL_ri %12:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %14:gpr = LDB %6:gpr, 14, debug-location !21358 :: (load (s8) from %ir.call1.i + 14); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %15:gpr = OR_rr %13:gpr(tied-def 0), killed %14:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
...
  %71:gpr = OR_rr %69:gpr(tied-def 0), killed %70:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %72:gpr = LDB %6:gpr, 77, debug-location !21358 :: (load (s8) from %ir.call1.i + 77); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %73:gpr = SLL_ri %72:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %74:gpr = LDB %6:gpr, 76, debug-location !21358 :: (load (s8) from %ir.call1.i + 76); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %75:gpr = OR_rr %73:gpr(tied-def 0), killed %74:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %76:gpr = LDB %6:gpr, 79, debug-location !21358 :: (load (s8) from %ir.call1.i + 79); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %77:gpr = SLL_ri %76:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %78:gpr = LDB %6:gpr, 78, debug-location !21358 :: (load (s8) from %ir.call1.i + 78); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %79:gpr = OR_rr %77:gpr(tied-def 0), killed %78:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %80:gpr = SLL_ri %27:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %81:gpr = OR_rr %80:gpr(tied-def 0), killed %23:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %82:gpr = SLL_ri %19:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %83:gpr = OR_rr %82:gpr(tied-def 0), killed %71:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %84:gpr = SLL_ri %15:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %85:gpr = OR_rr %84:gpr(tied-def 0), killed %67:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %86:gpr = SLL_ri %11:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %87:gpr = OR_rr %86:gpr(tied-def 0), killed %63:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %88:gpr = SLL_ri %43:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
...
  %100:gpr = LDB %6:gpr, 74, debug-location !21358 :: (load (s8) from %ir.call1.i + 74); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %101:gpr = LDB %6:gpr, 75, debug-location !21358 :: (load (s8) from %ir.call1.i + 75); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %102:gpr = SLL_ri %101:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %103:gpr = OR_rr %102:gpr(tied-def 0), %100:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %104:gpr = SLL_ri %103:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %105:gpr = OR_rr %104:gpr(tied-def 0), killed %99:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %106:gpr = SLL_ri %79:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %107:gpr = OR_rr %106:gpr(tied-def 0), killed %75:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %108:gpr = LDB %6:gpr, 64, debug-location !21358 :: (load (s8) from %ir.call1.i + 64); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %109:gpr = LDB %6:gpr, 65, debug-location !21358 :: (load (s8) from %ir.call1.i + 65); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %110:gpr = SLL_ri %109:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %111:gpr = OR_rr %110:gpr(tied-def 0), %108:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %112:gpr = LDB %6:gpr, 66, debug-location !21358 :: (load (s8) from %ir.call1.i + 66); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %113:gpr = LDB %6:gpr, 67, debug-location !21358 :: (load (s8) from %ir.call1.i + 67); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %114:gpr = SLL_ri %113:gpr(tied-def 0), 8, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %115:gpr = OR_rr %114:gpr(tied-def 0), %112:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %116:gpr = SLL_ri %115:gpr(tied-def 0), 16, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %117:gpr = OR_rr %116:gpr(tied-def 0), killed %111:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
...
  %225:gpr = OR_rr %224:gpr(tied-def 0), killed %117:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %226:gpr = SLL_ri %107:gpr(tied-def 0), 32, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  %227:gpr = OR_rr %226:gpr(tied-def 0), killed %105:gpr, debug-location !21358; tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB %193:gpr, %stack.1.event.i, 8, debug-location !21358 :: (store (s8) into %ir.hostname.i + 3); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5
 ]
  STB %192:gpr, %stack.1.event.i, 7, debug-location !21358 :: (store (s8) into %ir.hostname.i + 2); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB %189:gpr, %stack.1.event.i, 6, debug-location !21358 :: (store (s8) into %ir.hostname.i + 1); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB %188:gpr, %stack.1.event.i, 5, debug-location !21358 :: (store (s8) into %ir.hostname.i); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68:5 ]
  STB %183:gpr, %stack.1.event.i, 16, debug-location !21358 :: (store (s8) into %ir.hostname.i + 11); tracecon/src/bpf/tracecon.bpf.c:79:2 @[ tracecon/src/bpf/tracecon.bpf.c:68
:5 ]
...

And the code becomes very complex and inefficient and this caused later larger stack size,

# *** IR Dump After Greedy Register Allocator (greedy) ***:
# Machine code for function tcp_v4_connect_exit: NoPHIs, TracksLiveness, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=4, align=4, at location [SP]
  fi#1: size=89, align=8, at location [SP]
  fi#2: size=4, align=4, at location [SP]
  fi#3: size=8, align=8, at location [SP]
...
  fi#57: size=8, align=8, at location [SP]
  fi#58: size=8, align=8, at location [SP]
  fi#59: size=8, align=8, at location [SP]
  fi#60: size=8, align=8, at location [SP]
Function Live Ins: $r1 in %0

Could you help take a look at this problem and suggest how to fix it?

@yonghong-song Please can you raise this as an issue and include the IR as well? AFAICT this is a perf regression, and not an actual bug

@yonghong-song Please can you raise this as an issue and include the IR as well? AFAICT this is a perf regression, and not an actual bug

Thanks @RKSimon, just created an llvm-project issue https://github.com/llvm/llvm-project/issues/57872 thanks for taking care of this!