This is an archive of the discontinued LLVM Phabricator instance.

[BPF] add new intrinsics preserve_{array,union,struct}_access_index
AbandonedPublic

Authored by yonghong-song on May 10 2019, 3:38 PM.

Details

Summary

For background of BPF CO-RE project, please refer to

http://vger.kernel.org/bpfconf2019.html

In summary, BPF CO-RE intends to compile bpf programs
adjustable on struct/union layout change so the same
program can run on multiple kernels with adjustment
before loading based on native kernel structures.

In order to do this, we need keep track of GEP(getelementptr)
instruction base and result debuginfo types, so we
can adjust on the host based on kernel BTF info.
Capturing such information as an IR optimization is hard
as various optimization may have tweaked GEP and also
union is replaced by structure it is impossible to track
fieldindex for union member accesses.

Three intrinsic functions, preserve_{array,union,struct}_access_index,
are introducted.

addr = preserve_array_access_index(base, index, dimension)
addr = preserve_union_access_index(base, di_index)
addr = preserve_struct_access_index(base, gep_index, di_index)

here,

base: the base pointer for the array/union/struct access.
index: the last access index for array, the same for IR/DebugInfo layout.
dimension: the array dimension.
gep_index: the access index based on IR layout.
di_index: the access index based on user/debuginfo types.

For example, for the following example,

$ cat test.c
struct sk_buff {
   int i;
   int b1:1;
   int b2:2;
   union {
     struct {
       int o1;
       int o2;
     } o;
     struct {
       char flags;
       char dev_id;
     } dev;
     int netid;
   } u[10];
};  

static int (*bpf_probe_read)(void *dst, int size, const void *unsafe_ptr)
    = (void *) 4;

#define _(x) (__builtin_preserve_access_index(x))

int bpf_prog(struct sk_buff *ctx) {
  char dev_id;
  bpf_probe_read(&dev_id, sizeof(char), _(&ctx->u[5].dev.dev_id));
  return dev_id;
}
$ clang -target bpf -O2 -g -emit-llvm -S -mllvm -print-before-all \
  test.c >& log

The generated IR looks like below:

...
define dso_local i32 @bpf_prog(%struct.sk_buff*) #0 !dbg !15 {
  %2 = alloca %struct.sk_buff*, align 8
  %3 = alloca i8, align 1
  store %struct.sk_buff* %0, %struct.sk_buff** %2, align 8, !tbaa !45
  call void @llvm.dbg.declare(metadata %struct.sk_buff** %2, metadata !43, metadata !DIExpression()), !dbg !49
  call void @llvm.lifetime.start.p0i8(i64 1, i8* %3) #4, !dbg !50
  call void @llvm.dbg.declare(metadata i8* %3, metadata !44, metadata !DIExpression()), !dbg !51
  %4 = load i32 (i8*, i32, i8*)*, i32 (i8*, i32, i8*)** @bpf_probe_read, align 8, !dbg !52, !tbaa !45
  %5 = load %struct.sk_buff*, %struct.sk_buff** %2, align 8, !dbg !53, !tbaa !45
  %6 = call [10 x %union.anon]* @llvm.preserve.struct.access.index.p0a10s_union.anons.p0s_struct.sk_buffs(
       %struct.sk_buff* %5, i32 2, i32 3), !dbg !53, !llvm.preserve.access.index !19
  %7 = call %union.anon* @llvm.preserve.array.access.index.p0s_union.anons.p0a10s_union.anons(
       [10 x %union.anon]* %6, i32 1, i32 5), !dbg !53
  %8 = call %union.anon* @llvm.preserve.union.access.index.p0s_union.anons.p0s_union.anons(
       %union.anon* %7, i32 1), !dbg !53, !llvm.preserve.access.index !26
  %9 = bitcast %union.anon* %8 to %struct.anon.0*, !dbg !53
  %10 = call i8* @llvm.preserve.struct.access.index.p0i8.p0s_struct.anon.0s(
       %struct.anon.0* %9, i32 1, i32 1), !dbg !53, !llvm.preserve.access.index !34
  %11 = call i32 %4(i8* %3, i32 1, i8* %10), !dbg !52
  %12 = load i8, i8* %3, align 1, !dbg !54, !tbaa !55
  %13 = sext i8 %12 to i32, !dbg !54
  call void @llvm.lifetime.end.p0i8(i64 1, i8* %3) #4, !dbg !56
  ret i32 %13, !dbg !57
}

!19 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "sk_buff", file: !3, line: 1, size: 704, elements: !20)
!26 = distinct !DICompositeType(tag: DW_TAG_union_type, scope: !19, file: !3, line: 5, size: 64, elements: !27)
!34 = distinct !DICompositeType(tag: DW_TAG_structure_type, scope: !26, file: !3, line: 10, size: 16, elements: !35)

Note that @llvm.preserve.{struct,union}.access.index calls have metadata llvm.preserve.access.index
attached to instructions to provide struct/union debuginfo type information.

For &ctx->u[5].dev.dev_id,

. The "%6 = ..." represents struct member "u" with index 2 for IR layout and index 3 for DI layout.
. The "%7 = ..." represents array subscript "5".
. The "%8 = ..." represents union member "dev" with index 1 for DI layout.
. The "%10 = ..." represents struct member "dev_id" with index 1 for both IR and DI layout.

Basically, traversing the use-def chain recursively for the 3rd argument of bpf_probe_read() and
examining all preserve_*_access_index calls, the debuginfo struct/union/array access index
can be achieved.

The intrinsics also contain enough information to regenerate codes for IR layout.
For array and structure intrinsics, the proper GEP can be constructed.
For union intrinsics, replacing all uses of "addr" with "base" should be enough.

Diff Detail

Repository
rL LLVM

Event Timeline

yonghong-song created this revision.May 10 2019, 3:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2019, 3:38 PM

Missing langref changes.
What about other targets, does this need some default expansion?

@lebedev.ri Thanks for looking at this patch. The following two other patches are correlated so I just added them here.

https://reviews.llvm.org/D61809
https://reviews.llvm.org/D61524

Could you be more specific where to add langref for this intrinsic?

Currently, I am guarding the usage of intrinsics in clang under bpf, so other targets won't be able to use this.
If there is a demand we can implement a callback function in TargetInfo so different targets can decide whether
they want to use this intrinsics.

We can implement the default expansion, do you know where is the proper place to do that?

yonghong-song retitled this revision from [BPF] add a new intrinsic preserve_di_access_index to [BPF] add new intrinsics preserve_{array,union,struct}_access_index.
yonghong-song edited the summary of this revision. (Show Details)
yonghong-song added a reviewer: lebedev.ri.

split intrinsics into three, and add langdef.

@lebedev.ri For comments "What about other targets, does this need some default expansion?", currently these intrinsics only used by bpf and lowered to GEP in an IR pass implemented by bpf backend. I can certainly implement some default expansion if it is desirable. Is the SelectionDAG the proper place to implement the default lowering?

efriedma added inline comments.
docs/LangRef.rst
16961

inst_name seems unnecessary.

Don't you need some argument to indicate the type of the array elements?

16992

How do you plan to use this?

17025

I'm not sure about depending on digging the necessary information out of debug info, as opposed to generating some dedicated data structure. It feels more complicated, since it's not clear what parts of the debug info, exactly, you're depending on.

Looking up the struct type by name seems error-prone; if you're depending on the debug info anyway, could you pass the debug info for the type directly as a metadata argument?

include/llvm/IR/IRBuilder.h
2270

Don't pass StringRefs by const reference (see https://llvm.org/docs/ProgrammersManual.html#the-stringref-class ).

yonghong-song marked an inline comment as done.May 16 2019, 1:37 PM

@eli.friedman thanks for the review. I just did some experiments with some performance sensitive bpf programs. It looks the approach I am taking here cause worse case 30% instructions generated by llvm at -O2 level.
So the current approach to convert to intrinsics blindly and later on convert back unnecessary intrinsics to regular GEP and necessary intrinsics to relocatable GEP won't work.
The main reason for performance degradation is:

  • intrinsics causing less optimal codes which may contributes 5% performance loss for a relative small routines, for a large routines, I have see the loss is more than 20%.
  • intrinsics impact inline decision and less inlining has negative impacts on the performance.

Therefore, instead of blindly converting GEPs to intrinsics, I will try a different approach, design some bpf specific intrinsics to help bpf backend generate relocatable GEP. Considering the potential performance impact,
this approach will be only used when people trade some performance for portability in case bpf performance is not super critical.

include/llvm/IR/IRBuilder.h
2270

Thanks for letting me know. Will correct this in all other occasions as well.

intrinsics causing less optimal codes which may contributes 5% performance loss for a relative small routines, for a large routines, I have see the loss is more than 20%.

Could you clarify what, exactly, the issue is here? Is the problem extra arithmetic? Extra memory operations? Something else? You could probably mitigate the impact here with a few small changes to optimizations. Or maybe you could make the intrinsic return the relevant offset, rather than actually performing the GEP itself. (The key here is that you have to make sure the offset is opaque to the optimizer, so it doesn't make illegal transforms.)

There's probably some fundamental performance loss involved in making the struct offsets opaque values, no matter how that is represented in the IR.

intrinsics impact inline decision and less inlining has negative impacts on the performance.

This is something you can probably fix with a trivial tweak to the inlining heuristics. The inliner asks the target for the cost of a call using TargetTransformInfo.

yonghong-song marked an inline comment as done.
yonghong-song edited the summary of this revision. (Show Details)

change new intrinsic format, add num_of_zeros for array, remove inst_name

This update of three patches mostly a few intrinsic format changes (array +num_of_zeros for gep, array/struct -inst_name), and correspond adjustments to fix bugs.
This will unblock other kernel bpf developers which can work in parallel in kernel side.

I have not looked at the performance aspect yet. This is what I will do next to see where is the performance loss and how to avoid them.
I have not explored the approach to add meta data to instruction/intrinsics yet. Will do that once I did some performance analysis to ensure the
current approach does not cause performance degradation for perf. critical bpf applications.

docs/LangRef.rst
16961

I include inst_name only to generate original GEP similar to what clang did. But I agree that inst_name is unnecessary. The <type2> in the above is the pointer to the array element type.

I did some analysis on the performance side, specifically, # of instructions generated.
My example is cilium at https://github.com/cilium/cilium/tree/master/bpf, bpf program bpf_lxc.c.

(1). First, I made the following change:
-bash-4.4$ git diff
diff --git a/include/llvm/Analysis/TargetTransformInfoImpl.h b/include/llvm/Analysis/TargetTransformInfoImpl.h
index a1e1f9b07aa..544bbc93a4f 100644

  • a/include/llvm/Analysis/TargetTransformInfoImpl.h

+++ b/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -781,6 +781,9 @@ public:

case Intrinsic::coro_suspend:
case Intrinsic::coro_param:
case Intrinsic::coro_subfn_addr:

+ case Intrinsic::preserve_array_access_index:
+ case Intrinsic::preserve_union_access_index:
+ case Intrinsic::preserve_struct_access_index:

  // These intrinsics don't actually represent code after lowering.
  return TTI::TCC_Free;
}

-bash-4.4$
so all previously inlined functions are all inlined. There is probably a better way to do this than
Just TTI::TCC_Free. The change here is just for demonstration purpose.

(2). Make the following change for cilium so the data array is aligned to 8 to
use store double always.

diff --git a/bpf/lib/icmp6.h b/bpf/lib/icmp6.h
index fcae3d3ed..bb397685d 100644

  • a/bpf/lib/icmp6.h

+++ b/bpf/lib/icmp6.h
@@ -224,7 +224,7 @@ static inline be32 compute_icmp6_csum(char data[80], u16 payload_len,
static inline int icmp6_send_time_exceeded(struct sk_buff *skb, int nh_off)
{

/* FIXME: Fix code below to not require this init */
  • char data[80] = {};

+ char data[80] attribute((aligned(8))) = {};

 struct icmp6hdr *icmp6hoplim;
 struct ipv6hdr *ipv6hdr;
char *upper; /* icmp6 or tcp or udp */

(3).
Compile bpf_lxc.c with and without offset relocation, and compare the number of instructions.
Note that bpf_lxc.c does not have bpf_probe_read, so here we purely measure the cost
of preserve_*_access_index in the generated codes.

The function sizes without offset reloc:
-bash-4.4$ llvm-readelf -s bpf_lxc.o | grep FUNC

826: 0000000000000000   336 FUNC    GLOBAL DEFAULT    3 __send_drop_notify
829: 0000000000000000   656 FUNC    GLOBAL DEFAULT   27 handle_policy
830: 0000000000000000   944 FUNC    GLOBAL DEFAULT   29 handle_to_container
831: 0000000000000000   720 FUNC    GLOBAL DEFAULT   17 handle_xgress
832: 0000000000000000  1880 FUNC    GLOBAL DEFAULT   15 tail_handle_arp
833: 0000000000000000 28264 FUNC    GLOBAL DEFAULT   13 tail_handle_ipv4
834: 0000000000000000 29800 FUNC    GLOBAL DEFAULT   11 tail_handle_ipv6
835: 0000000000000000  3272 FUNC    GLOBAL DEFAULT    9 tail_icmp6_handle_ns
836: 0000000000000000  2480 FUNC    GLOBAL DEFAULT    5 tail_icmp6_send_echo_reply
837: 0000000000000000  3968 FUNC    GLOBAL DEFAULT    7 tail_icmp6_send_time_exceeded
838: 0000000000000000 11824 FUNC    GLOBAL DEFAULT   23 tail_ipv4_policy
839: 0000000000000000 12568 FUNC    GLOBAL DEFAULT   25 tail_ipv4_to_endpoint
840: 0000000000000000 14144 FUNC    GLOBAL DEFAULT   19 tail_ipv6_policy
841: 0000000000000000 15328 FUNC    GLOBAL DEFAULT   21 tail_ipv6_to_endpoint

The function sizes with offset reloc enabled:
-bash-4.4$ llvm-readelf -s bpf_lxc.o | grep FUNC

851: 0000000000000000   336 FUNC    GLOBAL DEFAULT    3 __send_drop_notify
854: 0000000000000000   680 FUNC    GLOBAL DEFAULT   27 handle_policy
855: 0000000000000000   960 FUNC    GLOBAL DEFAULT   29 handle_to_container
856: 0000000000000000   744 FUNC    GLOBAL DEFAULT   17 handle_xgress
857: 0000000000000000  1904 FUNC    GLOBAL DEFAULT   15 tail_handle_arp
858: 0000000000000000 29048 FUNC    GLOBAL DEFAULT   13 tail_handle_ipv4
859: 0000000000000000 31808 FUNC    GLOBAL DEFAULT   11 tail_handle_ipv6
860: 0000000000000000  3280 FUNC    GLOBAL DEFAULT    9 tail_icmp6_handle_ns
861: 0000000000000000  2528 FUNC    GLOBAL DEFAULT    5 tail_icmp6_send_echo_reply
862: 0000000000000000  4056 FUNC    GLOBAL DEFAULT    7 tail_icmp6_send_time_exceeded
863: 0000000000000000 12224 FUNC    GLOBAL DEFAULT   23 tail_ipv4_policy
864: 0000000000000000 13088 FUNC    GLOBAL DEFAULT   25 tail_ipv4_to_endpoint
865: 0000000000000000 14792 FUNC    GLOBAL DEFAULT   19 tail_ipv6_policy
866: 0000000000000000 15992 FUNC    GLOBAL DEFAULT   21 tail_ipv6_to_endpoint

In summary, with relocable gep, we get 0-7% regression.

Look at function handle_policy(),
good code:

44:       b7 02 00 00 00 00 00 00 r2 = 0
45:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
46:       7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r2
47:       b7 02 00 00 00 01 00 00 r2 = 256
48:       7b 2a e8 ff 00 00 00 00 *(u64 *)(r10 - 24) = r2
49:       87 01 00 00 00 00 00 00 r1 = -r1
50:       73 1a e8 ff 00 00 00 00 *(u8 *)(r10 - 24) = r1
51:       bf a2 00 00 00 00 00 00 r2 = r10
52:       07 02 00 00 e8 ff ff ff r2 += -24
53:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
55:       85 00 00 00 01 00 00 00 call 1

regression:

44:       b7 02 00 00 00 00 00 00 r2 = 0
45:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
46:       7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r2
47:       7b 2a e8 ff 00 00 00 00 *(u64 *)(r10 - 24) = r2
48:       87 01 00 00 00 00 00 00 r1 = -r1
49:       73 1a e8 ff 00 00 00 00 *(u8 *)(r10 - 24) = r1
50:       71 a1 e9 ff 00 00 00 00 r1 = *(u8 *)(r10 - 23)
51:       57 01 00 00 fc 00 00 00 r1 &= 252
52:       47 01 00 00 01 00 00 00 r1 |= 1
53:       73 1a e9 ff 00 00 00 00 *(u8 *)(r10 - 23) = r1
54:       bf a2 00 00 00 00 00 00 r2 = r10
55:       07 02 00 00 e8 ff ff ff r2 += -24
56:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
58:       85 00 00 00 01 00 00 00 call 1

Look at function tail_handle_ipv6(),
good code:

49:       85 00 00 00 19 00 00 00 call 25
50:       15 07 0e 00 80 00 00 00 if r7 == 128 goto +14 <LBB4_7>
51:       55 07 54 00 87 00 00 00 if r7 != 135 goto +84 <LBB4_10>
52:       b7 01 00 00 02 00 00 00 r1 = 2
53:       63 16 34 00 00 00 00 00 *(u32 *)(r6 + 52) = r1
54:       b7 01 00 00 0e 00 00 00 r1 = 14
55:       63 16 30 00 00 00 00 00 *(u32 *)(r6 + 48) = r1
56:       bf 61 00 00 00 00 00 00 r1 = r6
57:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
59:       b7 03 00 00 04 00 00 00 r3 = 4

regression:

49:       85 00 00 00 19 00 00 00 call 25
50:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
52:       63 1a a4 ff 00 00 00 00 *(u32 *)(r10 - 92) = r1
53:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
55:       63 1a a0 ff 00 00 00 00 *(u32 *)(r10 - 96) = r1
56:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
58:       63 1a 9c ff 00 00 00 00 *(u32 *)(r10 - 100) = r1
59:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
61:       63 1a 98 ff 00 00 00 00 *(u32 *)(r10 - 104) = r1
62:       15 08 0e 00 80 00 00 00 if r8 == 128 goto +14 <LBB4_6>
63:       55 08 46 00 87 00 00 00 if r8 != 135 goto +70 <LBB4_10>
64:       b7 01 00 00 02 00 00 00 r1 = 2
65:       63 16 34 00 00 00 00 00 *(u32 *)(r6 + 52) = r1
66:       b7 01 00 00 0e 00 00 00 r1 = 14
67:       63 16 30 00 00 00 00 00 *(u32 *)(r6 + 48) = r1
68:       bf 61 00 00 00 00 00 00 r1 = r6
69:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
71:       b7 03 00 00 04 00 00 00 r3 = 4

Maybe some pathes makes conservative decisions in the case of
intrinsic calls, even if it is marked as no memory access.

I will proceed to do the following:

. considering how to limit the scope of using intrinsics. we may introduce
  another one for explicit buy-in intrinsic to identify the places where
  preserve_*_access_index is needed. Note that for a typicall bpf program,
  in most places, preserve_*_access_index intrinsics are not needed.
. if we proceed with this approach, I will also look at the # of insn regressions
 so we can still have reasonable performance even if intrinsics are used.
yonghong-song edited the summary of this revision. (Show Details)

removed typename from intrinsics, using metadata (ditype) instead

@eli.friedman @rsmith @lebedev.ri I just loaded a new set of patches. Notable changes include

  • introduce clang intrinsic base = __builtin_preserve_access_index(base), which is used to identify relocatable geps. This is a buy-in approach from user so we do not pay performance penalty for transforming gep's to intrinsics and later back to gep with some performance loss.
  • for IR intrinsics preserve_*_access_index, struct/union names are removed from the argument list. Instead a new metadata type MD_preserve_access_index is introduced. This metadata has struct/union debuginfo type and attached to preserve_*_access_index.

The noticable change in BPF backend:

  • During IR, we create a global variable to represent the relocable gep. this global also has MD_preserve_access_index metadata attached, which is available later in AsmPrinter, to avoid any type name comparison.
  • a BPF subtarget feature checkoffsetreloc is implemented to warn users of any bpf_probe_read() without relocatable gep. This is mostly for debugging.

Please let me know what you think? Thanks!

Don't think i will be of any help here.

docs/LangRef.rst
16960

what's anons ?

lebedev.ri resigned from this revision.May 23 2019, 10:28 AM
yonghong-song edited the summary of this revision. (Show Details)Jul 3 2019, 9:38 AM

@eli.friedman has reviewed a related patch (https://reviews.llvm.org/D61809) and is okay with the whole approach. @ast is also okay with the patch, so I will start to land soon. Thanks!

This revision was not accepted when it landed; it landed in state Needs Review.Jul 8 2019, 10:11 AM
This revision was automatically updated to reflect the committed changes.
jdoerfert reopened this revision.Jul 10 2019, 6:31 PM

This patch lacks tests for various parts, intrinsics, metadata, ... Also descriptions are missing.

@efriedma @eli.friedman

llvm/trunk/docs/LangRef.rst
17348 ↗(On Diff #208452)

I fail to see how what llvm.preserve.array.access.index preserves, at least given this description.

17386 ↗(On Diff #208452)

why 2 in type2, also above.

17397 ↗(On Diff #208452)

Where is the description of this metadata?

@jdoerfert Thanks. I will address your comments and submit standalone test cases for these intrinsics. I do have some BPF unit tests using these intrinsics. But I agree that some standalone unit test cases are also needed.

yonghong-song abandoned this revision.Aug 1 2019, 3:51 PM

Abandon this revision for now as one of its variants has been merged.