Page MenuHomePhabricator

[DemandedBits] Improve accuracy of Add propagator
Needs RevisionPublic

Authored by rrika on Jan 8 2020, 3:33 PM.

Details

Summary

The current demand propagator for addition will mark all input bits at and right of the alive output bit as alive. But carry won't propagate beyond a bit for which both operands are zero (or one/zero in the case of subtraction) so a more accurate answer is possible given known bits.

I derived a propagator by working through truth tables and using a bit-reversed addition to make demand ripple to the right, but I'm not sure how to make a convincing argument for its correctness in the comments yet. Nevertheless, here's a minimal implementation and test to get feedback.

This would help in a situation where, for example, four bytes (<128) packed into an int are added with four others SIMD-style but only one of the four results is actually read.

Known A:     0_______0_______0_______0_______
Known B:     0_______0_______0_______0_______
AOut:        00000000001000000000000000000000
AB, current: 00000000001111111111111111111111
AB, patch:   00000000001111111000000000000000

Diff Detail

Event Timeline

rrika created this revision.Jan 8 2020, 3:33 PM

Can an exhaustive test be added for this?

rrika added a comment.EditedJan 9 2020, 2:19 AM

By exhaustive test you mean an .ll file, or a program looping over all values? In the process of writing this I found a mistake (the final '+1' only applies to additions, not subtractions) so I'll edit the patch in a bit.

UPDATE: The original propagator was fine. It was the source below that had a mistake. I marked the changes below with '// EDIT'

// exhaustive test for add demand propagator on 4 bit integers

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef uint8_t u8;
typedef uint8_t u16;
typedef uint32_t u32;

u8 flip_nybble[256] = {
	0x0, 0x8, 0x4, 0xC, 0x2, 0xA, 0x6, 0xE, 0x1, 0x9, 0x5, 0xD, 0x3, 0xB, 0x7, 0xF
};

static int mode; // 0=add 1=sub
static u8 a, b, ka, kb, aout;
static u8 aa, ab;

void fail (int x, int y) {
	u8 ab = 0xf & (mode ? a-b : a+b);
	u8 xy = 0xf & (mode ? x-y : x+y);
	char c = mode ? '-' : '+';
	printf(
		"propagator failed:\n"
		"  a=%01x ka=%01x aa=%01x   x=%01x\n"
		"  b=%01x kb=%01x ab=%01x   y=%01x\n"
		"a%cb=%01x    aout=%01x x%cy=%01x\n",
		a, ka, aa, x,
		b, kb, ab, y,
		c, ab, aout, c, xy
	);
	exit(1);
}

void confirm_add (u8 x, u8 y) {
	x = (x & ~aa) | a; // a and x match in alive bits
	y = (y & ~ab) | b; // b and y match in alive bits
	int ok = (((a+b) ^ (x+y)) & aout) == 0; // then a+b and x+y must match in alive bits
	if (!ok)
		fail(x, y);
}

void confirm_sub (u8 x, u8 y) {
	x = (x & ~aa) | a; // a and x match in alive bits
	y = (y & ~ab) | b; // b and y match in alive bits
	int ok = (((a-b) ^ (x-y)) & aout) == 0; // then a-b and x-y must match in alive bits
	if (!ok)
		fail(x, y);
}

void confirm_adds () {
	for (u32 xy=0; xy < 256; xy++) {
		u8 x = (xy >> 0) & 0xf;
		u8 y = (xy >> 4) & 0xf;
		confirm_add(x, y);
	}
}

void confirm_subs () {
	for (u32 xy=0; xy < 256; xy++) {
		u8 x = (xy >> 0) & 0xf;
		u8 y = (xy >> 4) & 0xf;
		confirm_sub(x, y);
	}
}


int main (int argc, char **argv) {
	(void) argc;
	(void) argv;
	for (mode=0; mode<2; mode++) // 0 = add, 1 = sub
	for (u32 t=0; t < 0x100000; t++) {
		a    = (t >>  0) & 0xf;
		b    = (t >>  4) & 0xf;
		ka   = (t >>  8) & 0xf;
		kb   = (t >> 12) & 0xf;
		aout = (t >> 16) & 0xf;

		u8 a0 = ~a & ka;
		u8 a1 =  a & ka;
		u8 b0 = ~b & kb;
		u8 b1 =  b & kb;

		// now the propagator
		// EDIT:
		u8 bound1 = mode ? a1 : a0;
		u8 bound2 = b0;
		// u8 bound1 = a0;
		// u8 bound2 = mode ? b1 : b0;
		u8 bound = bound1 & bound2;
		u8 rbound = flip_nybble[bound];
		u8 raout = flip_nybble[aout];
		u8 rprop = raout + (raout | ~rbound);
		u8 rq = (rprop ^ ~(raout | rbound));
		rq &= 0xf;

		u8 q = flip_nybble[rq];
		u8 ua = bound1 | ~bound2;
		u8 ub = bound2 | ~bound1;
		// EDIT:
		aa = aout | (q & (ua | (bound1 + bound2 + 1)));
		ab = aout | (q & (ub | (bound1 + bound2 + 1)));
		// aa = aout | (q & (ua | (bound1 + bound2 + (mode?0:1))));
		// ab = aout | (q & (ub | (bound1 + bound2 + (mode?0:1))));

		aa &= 0xf;
		ab &= 0xf;

		if (mode == 0)
			confirm_adds(a, b, aa, ab, aout);
		else
			confirm_subs(a, b, aa, ab, aout);
	}
	return 0;
}

By exhaustive test you mean an .ll file, or a program looping over all values?

No, a unit test.
See llvm/unittests/IR/ConstantRangeTest.cpp for inspiration.

In the process of writing this I found a mistake (the final '+1' only applies to additions, not subtractions) so I'll edit the patch in a bit.

(That's why such a test should exist)

nikic added a subscriber: nikic.Jan 9 2020, 2:50 AM
rrika updated this revision to Diff 237532.Jan 12 2020, 5:22 AM

Here's an update to confirm that the unit test is going in the right direction. DemandedBits::determineLiveOperandBits is private, so right now it seems to be necessary to build up all this IR before each query. I have yet to read up on what allocation responsibilities are, because creating new llvm contexts in a hot loop doesn't feel like the way it's meant to be done.

nikic added a comment.Jan 12 2020, 5:43 AM

@rrika Performing these kinds of tests on IR is both quite inefficient and unnecessary complex. It would be better to extract the code logic (which just operates on APInts) into a separate file, which will allow you to easily test it, and also share the logic with other places doing demanded bits analysis, such as InstCombine and SelectionDAG.

lebedev.ri requested changes to this revision.Jan 13 2020, 1:36 AM

@rrika Performing these kinds of tests on IR is both quite inefficient and unnecessary complex. It would be better to extract the code logic (which just operates on APInts) into a separate file, which will allow you to easily test it, and also share the logic with other places doing demanded bits analysis, such as InstCombine and SelectionDAG.

This revision now requires changes to proceed.Jan 13 2020, 1:36 AM
rrika added a comment.Jan 13 2020, 5:03 AM

@rrika Performing these kinds of tests on IR is both quite inefficient and unnecessary complex. It would be better to extract the code logic (which just operates on APInts) into a separate file, which will allow you to easily test it, and also share the logic with other places doing demanded bits analysis, such as InstCombine and SelectionDAG.

Are you suggesting factoring out the logic for addition/subtraction only or all opcodes? Where should this new function live?

nikic added a comment.Jan 13 2020, 5:25 AM

@rrika Performing these kinds of tests on IR is both quite inefficient and unnecessary complex. It would be better to extract the code logic (which just operates on APInts) into a separate file, which will allow you to easily test it, and also share the logic with other places doing demanded bits analysis, such as InstCombine and SelectionDAG.

Are you suggesting factoring out the logic for addition/subtraction only or all opcodes? Where should this new function live?

At least to start with, only addition/subtraction is fine. I'm not totally sure where it would live, but as we have KnownBits in Support, possibly there?

rrika updated this revision to Diff 252731.Mar 25 2020, 8:25 PM

Changes:

  • The test no longer builds a temporary IR. The propagator is instead exposed as a static member function on llvm::DemandedBits.
  • The propagator now also recognizes a pair of one bits as blocking carry.
RKSimon added a subscriber: regehr.

Adding @regehr and @nikic as they might have some thoughts on this

If possible I'd like to see some tests in InstCombine/InstSimplify that demonstrates the benefit of this.

Does anyone have comments about the exhaustive tests? 4 bits shouldn't take very long to complete but I'm not sure about general policy on this approach.

llvm/include/llvm/Analysis/DemandedBits.h
76

Would it make sense to move this into KnownBits so DAG/GlobalIsel can reuse it more easily?

This seems okay to me in general, assuming i'm not misreading the tests and they are actually exhaustive.

Adding @regehr and @nikic as they might have some thoughts on this

If possible I'd like to see some tests in InstCombine/InstSimplify that demonstrates the benefit of this.

I too would like to see some such test.

Does anyone have comments about the exhaustive tests? 4 bits shouldn't take very long to complete but I'm not sure about general policy on this approach.

That is fine by me. This is what is being done for all the good ConstantRange tests already anyway.

llvm/include/llvm/Analysis/DemandedBits.h
64–66

I find the naming scheme confusing. Why [a]live?
What we are deducing here, is what bits of operand OperandNo
are actually *demanded* given the demanded bits of result of operation, no?

llvm/lib/Analysis/DemandedBits.cpp
482–483

Assert only one carry is selected?

llvm/unittests/IR/DemandedBitsTest.cpp
17–38

I think there's some intersection with helpers defined in KnownBitsTest.cpp
It would be best to move them into one common header (KnownBitsTest.h?) and use them consistently.

18

I think this is just ForeachKnownBits() from KnownBitsTest.cpp?

31

This looks like to be ForeachNumInKnownBits()?

lebedev.ri requested changes to this revision.Tue, May 5, 11:48 PM

as per my last comment

This revision now requires changes to proceed.Tue, May 5, 11:48 PM