This is an archive of the discontinued LLVM Phabricator instance.

[BPF] BPFContextAccessMarkerPass added to avoid verifier unfriendly context access
AbandonedPublicDraft

Authored by eddyz87 on Aug 10 2022, 4:51 PM.

Details

Reviewers
None
Summary

This commit adds a new BPF specific pass named
BPFContextAccessMarkerPass. The goal of this pass is to restraint
the SimplifyCFGPass optimization pass to avoid generation of Kernel
BPF verifier unfriendly context access pattern.

Context access means an access to the context register, R1, the first
parameter of a BPF program. This patch assumes that context parameters
are marked using the following attribute:

__attribute__((btf_decl_tag("ctx")))

Kernel BPF verifier only allows accesses of form BASE + static-offset
if BASE is the context register (or it's alias).

For access to function arguments, CLang frontend generates IR that is
translated in the conforming way. However the SimplifyCFGPass might
move some instructions and add intermediate values that would lead to
dynamic base offset computation. This is could be illustrated by the
next example:

struct bpf_sock { ... }
struct bpf_sockopt { ... }
extern int f(int x);

int _getsockopt(struct bpf_sockopt *ctx __attribute__((btf_decl_tag("ctx"))))
{
  unsigned g = 0;
  switch (ctx->level) {
  case 10:
    g = f(ctx->sk->family);
    break;
  case 20:
    g = f(ctx->optlen);
    break;
  }
  return g % 2;
}

The following (simplified) IR is generated for function _getsockopt:

define dso_local i32 @_getsockopt(ptr noundef %ctx)
  ...
sw.bb:
  %1 = load ptr, ptr %ctx                  ;; access
  %family = getelementptr                  ;; to ctx->sk->family
    inbounds %struct.bpf_sock, ptr %1      ;; (a)
  %2 = load i32, ptr %family               ;;
  %call = call i32 @f(i32 noundef %2)
  br label %sw.epilog

sw.bb1:
  %optlen = getelementptr                  ;; access
    inbounds %struct.bpf_sockopt, ptr %ctx ;; to ctx->optlen
  %3 = load i32, ptr %optlen               ;; (b)
  %call2 = call i32 @f(i32 noundef %3)
  br label %sw.epilog

sw.epilog:
  ...

W/o SimplifyCFGPass machine code for (a) and (b) looks as follows:

$r1 = LDW $r1, 4  ;; for ctx->sk->family
$r1 = LDW $r1, 12 ;; for ctx->optlen

Which is allowed by BPF verifier. However, if SimplifyCFGPass is
executed the code is transformed as follows:

  ...
sw.bb:
  %1 = load ptr, ptr %ctx
  %family = getelementptr inbounds %struct.bpf_sock, ptr %1
  br label %sw.epilog.sink.split

sw.bb1:
  %optlen = getelementptr inbounds %struct.bpf_sockopt, ptr %ctx
  br label %sw.epilog.sink.split

sw.epilog.sink.split:
  %optlen.sink = phi
    ptr [ %optlen, %sw.bb1 ], [ %family, %sw.bb ]
  %2 = load i32, ptr %optlen.sink                 ;; (c)
  %call2 = call fastcc i32 @f(i32 noundef %2)
  br label %sw.epilog

sw.epilog:
  ...

Note that load instructions (a) and (b) are replaced by a single load
instruction (c) that gets it's value from a PHI node. This is done by
a code sinking part of the SimplifyCFGPass. This leads to the
following machine code:

bb.2.sw.bb:
  $r1 = LDD $r1, 0
  $r1 = ADD_ri $r1, 4
  JMP %bb.4

bb.3.sw.bb1:
  $r1 = ADD_ri $r1, 12

bb.4.sw.epilog.sink.split:
  $r1 = LDW $r1, 0
  JAL @f

Here the offset is dynamically added to r1 (context register), this
access pattern is not allowed by BPF verifier.

To prevent the undesired code motion the BPFContextAccessMarkerPass
inserts a call to llvm.bpf.passthrough function after each load of a
value from location that aliases the context parameter.

llvm.bpf.passthrough accepts a unique integer constant as one of the
parameters, thus preventing the common code sinking after certain
position.

The IR from above is transformed as follows:

sw.bb:
  %2 = load ptr, ptr %ctx, align 8
  %family = getelementptr inbounds %struct.bpf_sock, ptr %2
  %3 = load i32, ptr %family, align 4
  %4 = call i32 @llvm.bpf.passthrough.i32.i32(i32 2, i32 %3)  ;; <-- added call
  %call = call i32 @f(i32 noundef %4)
  br label %sw.epilog

sw.bb1:
  %optlen = getelementptr inbounds %struct.bpf_sockopt, ptr %ctx
  %5 = load i32, ptr %optlen
  %6 = call i32 @llvm.bpf.passthrough.i32.i32(i32 3, i32 %5)  ;; <-- added call
  %call2 = call i32 @f(i32 noundef %6)
  br label %sw.epilog

sw.epilog:
  ...

This addresses the issue reported by the following thread:

https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6

Diff Detail

Event Timeline

eddyz87 created this revision.Aug 10 2022, 4:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 4:51 PM

Current implementation depends on the alias analysis to identify all
aliases of parameters marked with btf_decl_tag("ctx"). The calls to
llvm.bpf.passthrough are inserted after loads from these aliases.

However, during testing I identified that default alias analysis pass
is a bit too conservative for the purpose. For example the unnecessary
passthrough calls are inserted in the following cases:

#define __CTX__ __attribute__((btf_decl_tag("ctx")))

struct outer { ..., int last; };

extern struct outer *magic();

int no_marker_expected_2(struct outer *ctx __CTX__) {
  struct outer *x = magic();
  return x->last; // passthrough is added but not needed
}

int no_marker_expected_3(struct outer *ctx __CTX__, struct outer *non_ctx) {
  return non_ctx->last; // passthrough is added but not needed
}

I want to experiment with a custom algorithm instead of alias
analysis. As far as I understand only limited amount of instructions
has to be handled, all of them (except PHI) are present in the snippet
below:

define dso_local i32 @h(ptr noundef %i) #1 section "sockops" !dbg !15 {
entry:
  %i.addr = alloca ptr, align 8
  store ptr %i, ptr %i.addr, align 8, !tbaa !23
  %0 = load ptr, ptr %i.addr, align 8, !dbg !28, !tbaa !23
  %1 = load i64, ptr @"llvm.bpf_sock_ops:0:0$0:0", align 8
  %2 = bitcast ptr %0 to ptr
  %3 = getelementptr i8, ptr %2, i64 %1
  %4 = bitcast ptr %3 to ptr
  %5 = call ptr @llvm.bpf.passthrough.p0.p0(i32 1, ptr %4)

The custom alias analysis replacement algorithm might look as follows:

  • EC = EquivalenceClasses<Value*, ...>()
  • For each instruction in the function match one of the patterns:
    • case store ptr %x, ptr %y -> EC.unionSets(%x, %y)
    • case %x = load ptr, ptr %y -> EC.unionSets(%x, %y)
    • case %x = bitcast %y -> EC.unionSets(%x, %y)
    • case %x = getelementptr _, ptr %y, _ -> EC.unionSets(%x, %y)
    • case %x = call ptr @llvm.bpf.passthrough.p0.p0(_, ptr %y) -> EC.unionSets(%x, %y)
    • case %x = phi [_, %y], [_, %z] -> EC.unionSets(%x, %y), EC.unionSets(%x, %z)
  • Use the resulting equivalence classes instead of alias analysis results, proceed as in the current draft.

Do we need alias analysis? The hook is in very early stage of optimization which has lots of stack operations. No sure whether alias analysis can do a good job or not. Also in typical bpf programs, ctx is used and not really aliased to other variables due to special rewrite requirement for ctx references. You can check bpf selftest and cilium bpf code (https://github.com/cilium/cilium.git, bpf directory) for some examples.

Also it is not needed to add the passthrough builtin after every ctx load. In ideal case, we only want to add passthrough builtin in cases where simplifyCFG may do "harmful" optimization. But it is possible that adding after every ctx load is okay from the final code point of view. Some analysis is needed here to ensure we don't cause potential performance degradation as simplifyCFG 'harmful' optimization is actually not that common.

eddyz87 added a comment.EditedAug 14 2022, 3:01 PM

Patch for Linux Kernel BPF tests.

For testing purposes I've updated BPF tests in tools/testing/selftests/bpf/progs with btf_decl_tag("ctx") attribute. When tests are compiled with such modification I get the following statistics (program == BPF .o file):

  • 363 programs are patched;
  • 51 programs differ when compiled with and without btf_decl_tag("ctx");
  • 13 programs increased in size
  • 26 programs decreased in size
  • 12 programs have the same size but some changes in code gen
  • total number of added instructions for increased programs is 215 insns
  • total number of removed instructions for decreased programs is 91 insns

I'm still looking through the disassembly diff to assess changes in the code generation.

When tests are executed with btf_decl_tag("ctx") attribute a single test fails. The test name is test_sk_lookup.c. Here is a part of this test:

int access_ctx_sk(struct bpf_sk_lookup *ctx __CTX__)
{
	struct bpf_sock *sk1 = NULL, *sk2 = NULL;
	int err, ret;

	ret = SK_DROP;

	/* Try accessing unassigned (NULL) ctx->sk field */
	if (ctx->sk && ctx->sk->family != AF_INET)
		goto out;

	/* Assign a value to ctx->sk */
	sk1 = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
	if (!sk1)
		goto out;
	err = bpf_sk_assign(ctx, sk1, 0);
	if (err)
		goto out;
	if (ctx->sk != sk1)
		goto out;

	/* Access ctx->sk fields */
	if (ctx->sk->family != AF_INET ||    // <--------- error here because of access to ->family
	    ctx->sk->type != SOCK_STREAM ||
	    ctx->sk->state != BPF_TCP_LISTEN)
		goto out;
        // ...
}

The error message from BPF verifier looks as follows:

libbpf: prog 'access_ctx_sk': BPF program load failed: Permission denied
libbpf: prog 'access_ctx_sk': -- BEGIN PROG LOAD LOG --
0: R1=ctx(off=0,imm=0) R10=fp0
; int access_ctx_sk(struct bpf_sk_lookup *ctx __CTX__)
0: (bf) r8 = r1                       ; R1=ctx(off=0,imm=0) R8_w=ctx(off=0,imm=0)
; if (ctx->sk && ctx->sk->family != AF_INET)
1: (79) r1 = *(u64 *)(r8 +0)          ; R1_w=sock_or_null(id=1,off=0,imm=0) R8_w=ctx(off=0,imm=0)
; if (ctx->sk && ctx->sk->family != AF_INET)
2: (15) if r1 == 0x0 goto pc+3        ; R1_w=sock(off=0,imm=0)
3: (b4) w7 = 0                        ; R7_w=0
; if (ctx->sk && ctx->sk->family != AF_INET)
4: (61) r1 = *(u32 *)(r1 +4)          ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
; if (ctx->sk && ctx->sk->family != AF_INET)
5: (56) if w1 != 0x2 goto pc+63       ; R1_w=2
; sk1 = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
6: (18) r1 = 0xffff8881037d4c00       ; R1_w=map_ptr(off=0,ks=4,vs=8,imm=0)
8: (18) r2 = 0xffffc90000322000       ; R2_w=map_value(off=0,ks=4,vs=200,imm=0)
10: (85) call bpf_map_lookup_elem#1   ; R0=map_value_or_null(id=3,off=0,ks=4,vs=8,imm=0) refs=3
11: (bf) r6 = r0                      ; R0=map_value_or_null(id=3,off=0,ks=4,vs=8,imm=0) R6_w=map_value_or_null(id=3,off=0,ks=4,vs=8,imm=0) refs=3
12: (b4) w7 = 0                       ; R7_w=0 refs=3
; if (!sk1)
13: (15) if r6 == 0x0 goto pc+55      ; R6_w=sock(ref_obj_id=3,off=0,imm=0) refs=3
; err = bpf_sk_assign(ctx, sk1, 0);
14: (bf) r1 = r8                      ; R1_w=ctx(off=0,imm=0) R8=ctx(off=0,imm=0) refs=3
15: (bf) r2 = r6                      ; R2_w=sock(ref_obj_id=3,off=0,imm=0) R6_w=sock(ref_obj_id=3,off=0,imm=0) refs=3
16: (b7) r3 = 0                       ; R3_w=0 refs=3
17: (85) call bpf_sk_assign#124       ; R0_w=scalar() refs=3
18: (b4) w7 = 0                       ; R7_w=0 refs=3
19: (bf) r9 = r6                      ; R6=sock(ref_obj_id=3,off=0,imm=0) R9=sock(ref_obj_id=3,off=0,imm=0) refs=3
; if (err)
20: (56) if w0 != 0x0 goto pc+46      ; R0=scalar(smax=9223372032559808512,umax=18446744069414584320,var_off=(0x0; 0xffffffff00000000),s32_min=0,s32_max=0,u32_max=0) refs=3
; if (ctx->sk != sk1)
21: (79) r1 = *(u64 *)(r8 +0)         ; R1_w=sock_or_null(id=4,off=0,imm=0) R8=ctx(off=0,imm=0) refs=3
22: (bf) r9 = r6                      ; R6=sock(ref_obj_id=3,off=0,imm=0) R9_w=sock(ref_obj_id=3,off=0,imm=0) refs=3
; if (ctx->sk != sk1)
23: (5d) if r1 != r6 goto pc+43       ; R1_w=sock_or_null(id=4,off=0,imm=0) R6=sock(ref_obj_id=3,off=0,imm=0) refs=3
; if (ctx->sk->family != AF_INET ||
24: (61) r2 = *(u32 *)(r1 +4)
R1 invalid mem access 'sock_or_null'
processed 23 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
-- END PROG LOAD LOG --
libbpf: prog 'access_ctx_sk': failed to load: -13
libbpf: failed to load object 'test_sk_lookup'
libbpf: failed to load BPF skeleton 'test_sk_lookup': -13
test_sk_lookup:FAIL:skel open_and_load failed

The verifier failure is caused by the a small change in the code generation for ctx->sk->family access. Code generated without btf_decl_tag("ctx") looks as follows (r6 contains the result of call to bpf_map_lookup_elem, effectively sk1):

293:	5d 61 2b 00 00 00 00 00	if r1 != r6 goto +43 <LBB14_18>
294:	61 61 04 00 00 00 00 00	r1 = *(u32 *)(r6 + 4)

Code generated with btf_decl_tag("ctx") looks as follows (r6 has the same meaning):

293:	5d 61 2b 00 00 00 00 00	if r1 != r6 goto +43 <LBB14_18>
294:	61 12 04 00 00 00 00 00	r2 = *(u32 *)(r1 + 4)

Note r1 access instead of r6.

This change in code generation is caused by interference between GVNPass transformation and call to @llvm.bpf.passthrough.p0.p0. Without the ctx attribute IR before/after the GVNPass looks as follows:

;;; IR Dump Before GVNPass on access_ctx_sk

if.end:                                           ; preds = %land.lhs.true, %entry
  %call = tail call ptr inttoptr (i64 1 to ptr)
    (ptr noundef nonnull @redir_map, ptr noundef nonnull @KEY_SERVER_A) #8, !dbg !588
; ...
if.end7:                                          ; preds = %if.end3
  %3 = load ptr, ptr %ctx, align 8, !dbg !596, !tbaa !550
  %cmp8.not = icmp eq ptr %3, %call, !dbg !598
  br i1 %cmp8.not, label %if.end11, label %if.end59.thread103, !dbg !599

if.end11:                                         ; preds = %if.end7
  %family12 = getelementptr inbounds %struct.bpf_sock, ptr %3, i64 0, i32 1, !dbg !600
; ...

;;; IR Dump After GVNPass on access_ctx_sk

if.end7:                                          ; preds = %if.end3
  %3 = load ptr, ptr %ctx, align 8, !dbg !596, !tbaa !550
  %cmp8.not = icmp eq ptr %3, %call, !dbg !598
  br i1 %cmp8.not, label %if.end11, label %if.end59.thread103, !dbg !599

if.end11:                                         ; preds = %if.end7
  %family12 = getelementptr inbounds %struct.bpf_sock, ptr %call, i64 0, i32 1, !dbg !600
                                                           ^^^^^
                                                           %3 replaced with %call
; ...

With the ctx attribute IR the change does not happen and IR after GVNPass looks as follows:

if.end7:                                          ; preds = %if.end3
  %5 = load ptr, ptr %ctx, align 8, !dbg !596, !tbaa !550
  %6 = tail call ptr @llvm.bpf.passthrough.p0.p0(i32 16, ptr %5)
  %cmp8.not = icmp eq ptr %6, %call, !dbg !598
  br i1 %cmp8.not, label %if.end11, label %if.end59.thread101, !dbg !599

if.end11:                                         ; preds = %if.end7
  %family12 = getelementptr inbounds %struct.bpf_sock, ptr %5, i64 0, i32 1, !dbg !600
                                                           ^^
                                                           no replacement

This test depends on a particular behavior of the GVNPass transformation, namely replacement of access to ctx->sk by access to sk1 in the else branch of the check ctx->sk != sk1. This behavior is not guaranteed. Additionally the verifier does not propagate register equivalence information from equality checks. Thus I think that either test or verifier should be updated:

  • An additional NULL check could be added to the test to satisfy verifier
  • The verifier could be updated to propagate register equivalence information from equality checks. At the first glance this seem to be an easy modification, verifier already has some logic to process comparisons with scalar 0, it could be extended with registers comparison.

The same verifier failure could be demonstrated by the following kernel test case:

{
	"sk_fullsock(skb->sk): sk->type [fullsock field], indirect null check",
	.insns = {
	/* This is equivalent to the following program:
	 *
	 *   r6 = skb->sk;
	 *   r7 = sk_fullsock(r6);
	 *   r0 = sk_fullsock(r6);
	 *   if (r0 == 0) return 0;    (a)
	 *   if (r0 != r7) return 0;   (b)
	 *   *r7->type;                (c)
	 *   return 0;
	 *
	 * It is safe to dereference r7 at point (c), because of (a) and (b).
	 * The test passes if relation r0 == r7 is propagated from (b) to (c).
         */

	/* read skb->sk, check null */
	BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, offsetof(struct __sk_buff, sk)),
	BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 0, 2),
	BPF_MOV64_IMM(BPF_REG_0, 0),
	BPF_EXIT_INSN(),
	/* r7 = sk_fullsock(skb) */
	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
	/* r0 = sk_fullsock(skb) */
	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
	/* if r0 == null then exit */
	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
	BPF_MOV64_IMM(BPF_REG_0, 0),
	BPF_EXIT_INSN(),
	/* if r0 != r7 then exit else read r7->type, use BPF_JNE on purpose */
	BPF_JMP_REG(BPF_JNE, BPF_REG_0, BPF_REG_7, 1),
	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_7, offsetof(struct bpf_sock, type)),
	/* return 0 */
	BPF_MOV64_IMM(BPF_REG_0, 0),
	BPF_EXIT_INSN(),
	},
	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
	.result = ACCEPT,
}

Currently this test case fails to pass verifier with the same error message.

eddyz87 updated this revision to Diff 452764.Aug 15 2022, 11:54 AM

Rewrite of the BPFContextAccessMarkerPass pass to use simple
usage tracking instead of alias analysis.

With this update the pass looks for access chains that read fields
of context parameters and puts a call to llvm.bpf.passthrough at
the last load for each chain.

Access chains are recognized as a sequence of interdependent load,
getelementptr, bitcast instructions and passthrough calls.

Access chains start from a load of a known store location
of the context parameter.

E.g. this is an access chain:

;; %i metadata contains btf_decl_tag("ctx") annotation
define dso_local i32 @h(ptr noundef %i)
  %i.addr = alloca ptr, align 8
  store ptr %i, ptr %i.addr, align 8  ;; %i.addr store location is recorded
  %0 = load ptr, ptr %i.addr, align 8                      ;; chain starts
  %1 = load i64, ptr @"llvm.bpf_sock_ops:0:0$0:0"
  %2 = bitcast ptr %0 to ptr                               ;; chain continues
  %3 = getelementptr i8, ptr %2, i64 %1                    ;; chain continues
  %4 = bitcast ptr %3 to ptr                               ;; chain continues
  %5 = call ptr @llvm.bpf.passthrough.p0.p0(i32 1, ptr %4) ;; chain continues
  %6 = load i64, ptr %5, align 8                           ;; last chain load

Updated statistics for kernel BPF tests looks as follows:

  • 47 programs differ when compiled with and without btf_decl_tag("ctx");
  • 17 programs increased in size
  • 18 programs decreased in size
  • 12 programs have the same size but some changes in code gen
  • total number of added instructions for increased programs is 85 insns
  • total number of removed instructions for decreased programs is 45 insns

Hi Younghong,

I've tried to move the transformation to the later point in the pipeline to avoid dealing with stack load / store operations. This point has to be somewhere inside PassBuilder::buildFunctionSimplificationPipeline. At the first glance it has to be placed with the following constraints:

  • after SROAPass (removes unnecessary load, store and alloca);
  • after EarlyCSEPass (removes unnecessary bitcast's, a soft constraint);
  • before SimplifyCFG with sinking option enabled (the very end of the pipeline).

For testing purposes I used PeepholeEPCallbacks.

However, I found out that InstCombinePass can do some unwanted code movement as well. Thus the new registration point has to be between SROAPass and InstCombinePass. Unfortunately, there are no extension points in this range.

Here is the example transformed by InstCombinePass:

int one_marker_in_each_branch(struct outer *ctx __CTX__) {
  if (ctx->first)
    return ctx->butlast;
  else
    return ctx->last;
}

Here it is before InstCombinePass:

...
if.then:
  %butlast = getelementptr inbounds %struct.outer, ptr %ctx, i32 0, i32 3
  %1 = load i32, ptr %butlast, align 8
  br label %return

if.else:
  %last = getelementptr inbounds %struct.outer, ptr %ctx, i32 0, i32 4
  %2 = load i32, ptr %last, align 4
  br label %return

return:
  %retval.0 = phi i32 [ %1, %if.then ], [ %2, %if.else ]
  ret i32 %retval.0

And here it is after InstCombinePass:

...
if.then:
  %butlast = getelementptr inbounds %struct.outer, ptr %ctx, i64 0, i32 3
  br label %return

if.else:
  %last = getelementptr inbounds %struct.outer, ptr %ctx, i64 0, i32 4
  br label %return

return:
  %retval.0.in = phi ptr [ %butlast, %if.then ], [ %last, %if.else ]
  %retval.0 = load i32, ptr %retval.0.in, align 4 ;; <---------------- load moved
  ret i32 %retval.0

Note the sunk load.

The transformation itself is quite trivial and removal of stack handling does not have much impact on the overall code size:

llvm/lib/Target/BPF/BPFContextAccessMarkerPass.cpp | 27 +++++++++++----------------
llvm/lib/Target/BPF/BPFTargetMachine.cpp           |  4 ++--
2 files changed, 13 insertions(+), 18 deletions(-)

Thus, I don't think that current registration point (registerPipelineStartEPCallback) should be changed.

I'll continue work on the remaining action points.

Hi Yonghong,

As discussed I've updated SimplifyCFG and InstCombine passes to obtain a baseline for amount of changes in the modified BPF test cases. I attach the patch to this revision for archeological purposes.

I get the following statistics for the baseline:

Same   : 355 
Differ : 8 
insn count inc : 2 
insn count dec : 1 
# of programs with more insn : 1
# of programs with less insn : 1
# of programs with diff insn but same size : 6

bpf_iter_bpf_map.o                39 vs 38
bpf_iter_ksym.o                  150 vs 152
bpf_iter_task_btf.o               37 vs 37
fexit_bpf2bpf.o                  128 vs 128
pyperf600_bpf_loop.o            1073 vs 1073
sockopt_inherit.o                 61 vs 61
test_core_reloc_type_id.o         44 vs 44
test_tcpnotify_kern.o             57 vs 57

While exemining the reasons for difference between the baseline and marker trasnformation I identified interference with GVN and CSE passes as a main reason. To mitigate it I've implemented two adjustements to the algorithm:

  • access chains are only marked with passthrough if access chain length is more than 1 (should have been done in the first version...);
  • passthrough sequence numbers are reused for identical access chains.

After these modifications I get the following stats:

Same   : 343
Differ : 20
insn count inc : 16
insn count dec : 16
# of programs with more insn : 1
# of programs with less insn : 15
# of programs with diff insn but same size : 4

bind4_prog.o                          244 vs 244
bind6_prog.o                          306 vs 306
bpf_iter_bpf_map.o                     39 vs 38
bpf_iter_netlink.o                    101 vs 100
bpf_iter_task_btf.o                    37 vs 36
bpf_iter_task_file.o                   56 vs 55
bpf_iter_task.o                       135 vs 134
bpf_iter_tcp4.o                       401 vs 400
bpf_iter_tcp6.o                       457 vs 456
bpf_iter_test_kern4.o                  61 vs 60
bpf_iter_udp4.o                       100 vs 99
bpf_iter_udp6.o                       119 vs 118
connect4_prog.o                       343 vs 342
lsm.o                                 157 vs 155
sendmsg6_prog.o                        66 vs 65
tcp_ca_write_sk_pacing.o               32 vs 32
test_core_reloc_type_id.o              44 vs 44
test_sk_lookup.o                      708 vs 724   # <-- the only increase
test_sockmap_invalid_update.o          13 vs 12
xdping_kern.o                         193 vs 192

The reason for size increase in test_sk_lookup is that a sequence of transformations GVNPass, InstCombinePass and SimplifyCFGPass can remove several if branches when passthrough calls are not inserted.

I want to polish the implementation a bit before committing, will update the revision tomorrow.

I've spent some time today trying to figure out what to do about test_sk_lookup, will continue tomorrow as well.

eddyz87 updated this revision to Diff 453840.EditedAug 18 2022, 5:20 PM

As noted in the previous comment, while exemining the reasons for difference between the baseline and marker trasnformation interference with GVN and CSE passes was identified as a main reason for size increase of the test programs. This version includes two updates:

  • access chains are only marked with passthrough if access chain length is more than 1 (should have been done in the first version...);
  • passthrough sequence numbers are reused for identical access chains.

After these changes the stats for changes in the test programs are as follows:

Same   : 345
Differ : 18
insn count inc : 6
insn count dec : 14
# of programs with more insn : 1
# of programs with less insn : 13
# of programs with diff insn but same size : 4

bind4_prog.o                  244 vs 244
bind6_prog.o                  306 vs 306
bpf_iter_netlink.o            101 vs 100
bpf_iter_task_btf.o            37 vs 36
bpf_iter_task_file.o           56 vs 55
bpf_iter_task.o               135 vs 134
bpf_iter_tcp4.o               401 vs 400
bpf_iter_tcp6.o               457 vs 456
bpf_iter_test_kern4.o          61 vs 60
bpf_iter_udp4.o               100 vs 99
bpf_iter_udp6.o               119 vs 118
connect4_prog.o               343 vs 342
lsm.o differs                 157 vs 155
tcp_ca_write_sk_pacing.o       32 vs 32
test_core_reloc_type_id.o      44 vs 44
test_sk_lookup.o              708 vs 714
test_sockmap_invalid_update.o  13 vs 12
xdping_kern.o differs         193 vs 192

The only program that increased in size is test_sk_lookup, a few branches in this program can't be removed when __CTX__ is defined due unique sequence numbers in passthrough calls limiting the GVN pass.

The reason for size _decrease_ of the programs is the following pattern:

; without __CTX__
  %0 = load i64, ptr @"llvm.bpf_sock_ops:0:184$0:35:0"
  %1 = getelementptr i8, ptr %skops, i64 %0
  %2 = call ptr @llvm.bpf.passthrough.p0.p0(i32 0, ptr %1) ;; (a)
  %3 = load ptr, ptr %2                                    ;; (b)
; with __CTX__
  %0 = load i64, ptr @"llvm.bpf_sock_ops:0:184$0:35:0"
  %1 = getelementptr i8, ptr %skops, i64 %0
  %2 = load ptr, ptr %1                                    ;; (b)
  %3 = call ptr @llvm.bpf.passthrough.p0.p0(i32 2, ptr %2) ;; (a)

The BPFContextAccessMarkerPass pushes down the passthrough call relative to the load instruction. This enables EarlyCSE pass to reuse the result of load.

Kernel BPF test cases are passing, except for the access_ctx_sk as described in https://reviews.llvm.org/D131633#3722231 .

eddyz87 abandoned this revision.Sep 28 2022, 6:11 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 28 2022, 6:11 AM