This is an archive of the discontinued LLVM Phabricator instance.

[Bitfield] Add more cases to making the bitfield a separate location
ClosedPublic

Authored by spetrovic on Oct 18 2017, 6:57 AM.

Details

Summary

This patch provides that bitfields are splitted even in case when current field is not legal integer type. For Example,

struct S3 {

unsigned long a1:28;
unsigned long a2:4;
unsigned long a3:12;

};

struct S4 {

unsigned long long b1:61;
unsigned long b2:3;
unsigned long b3:3;

};

At the moment, S3 is i48 type, and S4 is i72 type, with this change S3 is treated as i32 + 16, and S4 is treated as i32 + i32 + i8.
With this patch we avoid unaligned loads and stores, at least on MIPS architecture.

Diff Detail

Event Timeline

spetrovic created this revision.Oct 18 2017, 6:57 AM
hfinkel edited edge metadata.Oct 23 2017, 5:41 PM

With this patch we avoid unaligned loads and stores, at least on MIPS architecture.

Can you please explain the nature of the problematic situations?

Also, this patch does not update the comments that describe the splitting algorithm. That should be improved.

If this is just addressing the classic problem that, once we narrow a load we can't re-widen it, it might be best to solve this another way (e.g., by adding metadata noting the dereferenceable range).

Well, basically I'm just expanding the existing algorithm, why should we split fields just in case when current field is integer,
I'm not resolving specific problem with unaligned loads/stores on MIPS.

In this example:

typedef struct {

unsigned int f1 : 28;
unsigned int f2 : 4;
unsigned int f3 : 12;

} S5;

S5 *cmd;

void foo() {
cmd->f3 = 5;
}

With this patch there is improvement in code size not just on MIPS architecture, on X86 and ARM is also improved code size. If structure S5 is treated as i48 type there are extra instructions for reading it not just on MIPS. We can take results for MIPS just for example:

Output without the patch:

0000000000000000 <foo>:

 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	68220000 	ldl	v0,0(at)
18:	6c220007 	ldr	v0,7(at)
1c:	64030005 	daddiu	v1,zero,5
20:	7c62fd07 	dins	v0,v1,0x14,0xc
24:	b0220000 	sdl	v0,0(at)
28:	03e00008 	jr	ra
2c:	b4220007 	sdr	v0,7(at)

Output with the patch:

0000000000000000 <foo>:

 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	94220004 	lhu	v0,4(at)
18:	24030005 	li	v1,5
1c:	7c62f904 	ins	v0,v1,0x4,0x1c
20:	03e00008 	jr	ra
24:	a4220004 	sh	v0,4(at)

This is simple example, in more complicated examples there is more improvement.

Well, basically I'm just expanding the existing algorithm, why should we split fields just in case when current field is integer,
I'm not resolving specific problem with unaligned loads/stores on MIPS.

In this example:

typedef struct {

unsigned int f1 : 28;
unsigned int f2 : 4;
unsigned int f3 : 12;

} S5;

S5 *cmd;

void foo() {
cmd->f3 = 5;
}

With this patch there is improvement in code size not just on MIPS architecture, on X86 and ARM is also improved code size. If structure S5 is treated as i48 type there are extra instructions for reading it not just on MIPS. We can take results for MIPS just for example:

Output without the patch:

0000000000000000 <foo>:

 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	68220000 	ldl	v0,0(at)
18:	6c220007 	ldr	v0,7(at)
1c:	64030005 	daddiu	v1,zero,5
20:	7c62fd07 	dins	v0,v1,0x14,0xc
24:	b0220000 	sdl	v0,0(at)
28:	03e00008 	jr	ra
2c:	b4220007 	sdr	v0,7(at)

Output with the patch:

0000000000000000 <foo>:

 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	94220004 	lhu	v0,4(at)
18:	24030005 	li	v1,5
1c:	7c62f904 	ins	v0,v1,0x4,0x1c
20:	03e00008 	jr	ra
24:	a4220004 	sh	v0,4(at)

This is simple example, in more complicated examples there is more improvement.

I think this is part of the slippery slope we didn't want to go down. We introduced this mode in the first place only to resolve a store-to-load forwarding problem that is theoretically unsolvable by any local lowering decisions. This, on the other hand, looks like a local code-generation problem that we can/should fix in the backend. If I'm wrong, we should consider this as well.

I was looking if I can reslove this problem in backend. Example:

C code:

typedef struct {
unsigned int f1 : 28;
unsigned int f2 : 4;
unsigned int f3 : 12;
} S5;

void foo(S5 *cmd) {
cmd->f3 = 5;
}

. ll file (without the patch):

%struct.S5 = type { i48 }

define void @foo(%struct.S5* nocapture %cmd) local_unnamed_addr #0 {
entry:

%0 = bitcast %struct.S5* %cmd to i64*
%bf.load = load i64, i64* %0, align 4
%bf.clear = and i64 %bf.load, -4293918721
%bf.set = or i64 %bf.clear, 5242880
store i64 %bf.set, i64* %0, align 4
ret void

}

The load (%bf.load = load i64, i64* %0, align 4) is NON_EXTLOAD, and backend handles it later as unaligned load. So, maybe one of possible solutions in backend is for each architecture to try to catch node pattern and replace unaligned loads with normal loads in specific cases, but I'm not sure how feasible that is. At the moment, the solution in frontend seems better to me. What do you think?

wmi edited edge metadata.Nov 13 2017, 11:16 AM

I think it may be hard to fix the problem in backend. It will face the same issue of store-to-load forwarding if at some places the transformation happens but at some other places somehow it doesn't.

But I am not sure whether it is acceptable to add more finegrain bitfield access transformations in frontend. It is a tradeoff. From my experience trying your patch on x8664, I saw saving of some instructions because with the transformation we now use shorter immediate consts which can be folded into other instructions instead of loading them in separate moves. But the bit operations are not saved (I may be wrong), so I have no idea whether it is critical for performance (To provide some background, we introduced finegrain field access because the original problem introduced a serious performance issue in some critical libraries. But doing this kind of bitfield transformations in frontend can potentially harm some other applications. That is why it is off by default, and Hal had concern about adding more such transformations. ). Do you have performance number on some benchmarks to justify its importance? That may help folks to make a better trade off here.

I tried to compile some important libraries for X86 and MIPS64 within Chromium with clang/llvm. I have compared results between LLVM trunk, and LLVM trunk with my patch. There is code size improvement on many libraries, here are some results:

  • X86

libframe ~1.5%
librendering ~1.2%
liblayout_2 ~0.7%
libpaint_1 ~0.65%
libglslang ~0.5%
libediting_3 ~0.5%
libcore_generated ~0.5%
libanimation_1 ~0.5%

  • Mips64

libglslang ~3.2%
libframe ~1.5%
liblayout_0 ~1.2%
libGLESv2 ~1%
librendering ~0.7%
liblayout_1 ~0.5%
libcss_9 ~0.5%

Those are just some of libraries where we have code size improvement with this patch, I have also tried to compile LLVM with LLVM (with my patch) for x86 and MIPS64, and there is also code size improvement on both architectures. For example within clang executable for MIPS64 ~350 unaligned loads is removed. So, what do you think about it ?

This sounds as a valid improvement. Can we have this code committed?

wmi added a comment.Jan 22 2018, 10:44 PM

Thanks for the size evaluation. I regarded the change as a natural and limited extension to the current fine-grain bitfield access mode, so I feel ok with the change. Hal, what do you think?

petarj added a comment.Mar 8 2018, 8:16 AM

Is everyone OK with the patch now?

mgrang added a subscriber: mgrang.

Added @apazos and myself a reviewers since this is something we would be interested in getting to work.

With this patch we get ~1800 bytes improvement on one of our internal codebases. I also ran spec2000/spec2006 validations (for RISCV) and there were no regressions. There were also no code size improvements seen in spec.

asb added a subscriber: asb.Apr 19 2018, 4:38 AM
asb added a comment.Apr 19 2018, 6:34 AM

Hi @spetrovic - I think Hal Finkel's earlier request to update the comments of IsBetterAsSingleFieldRun remains unaddressed. It looks like at least the comment string immediately before auto IsBetterAsSingleFieldRun needs to be updated to reflect the changed behaviour.

The documentation -ffine-grained-bitfield-accesses option may also need to be updated. Iit currently reads "Use separate accesses for bitfields with legal widths and alignments.". I think "Use separate accesses for consecutive bitfield runs with legal widths and alignments" might better reflect the behaviour after this patch. Do you agree?

Here is a test case which improves with this patch (for RISCV target). It is able to detect load/store halfword for size 16 bitfields.

struct st {
  int a:1;
  int b:8;
  int c:11;
  int d:12;
  int e:16;
  int f:16;
  int g:16;
} S;

void foo(int x) {
  S.e = x;
}

GCC:

00000000 <foo>:
   0:	000007b7          	lui	x15,0x0
   4:	00a79223          	sh	x10,4(x15) # 4 <foo+0x4>
   8:	00008067          	jalr	x0,0(x1)

LLVM without this patch:

00000000 <foo>:
   0:	000105b7          	lui	x11,0x10
   4:	fff58593          	addi	x11,x11,-1 # ffff <foo+0xffff>
   8:	00b57533          	and	x10,x10,x11
   c:	000005b7          	lui	x11,0x0
  10:	00058593          	addi	x11,x11,0 # 0 <foo>
  14:	0045a603          	lw	x12,4(x11)
  18:	ffff06b7          	lui	x13,0xffff0
  1c:	00d67633          	and	x12,x12,x13
  20:	00a66533          	or	x10,x12,x10
  24:	00a5a223          	sw	x10,4(x11)
  28:	00008067          	jalr	x0,0(x1)

LLVM with this patch:

00000000 <foo>:
   0:	000005b7          	lui	x11,0x0
   4:	00058593          	addi	x11,x11,0 # 0 <foo>
   8:	00a59223          	sh	x10,4(x11)
   c:	00008067          	jalr	x0,0(x1)
spetrovic updated this revision to Diff 143918.Apr 25 2018, 6:45 AM

Comments addressed

apazos added a comment.May 9 2018, 8:59 PM

Thanks for updating the patch, @spetrovic.
Can we have this committed?
This patch has shown to produce code size improvements for a number of targets (Mips, X86, ARM, RISC-V).

asb added a comment.May 9 2018, 9:06 PM

Just wanted to explicitly say that I'm happy the updated patch reflects the changes to docs and comments I requested. @hfinkel - are you happy for this to land now?

In D39053#1093977, @asb wrote:

Just wanted to explicitly say that I'm happy the updated patch reflects the changes to docs and comments I requested. @hfinkel - are you happy for this to land now?

Yes, go ahead.

Some of these benefits might still be capturable using the regular method and backend improvements, but with this under the option it should be easier to identify where those opportunities are.

This revision was not accepted when it landed; it landed in state Needs Review.May 10 2018, 5:35 AM
This revision was automatically updated to reflect the committed changes.