This PR tries to match full multiplication pattern i64 x i64 -> i128 done by 4 i32 x i32 -> i64 multiplication and meshing the results of those.
This pattern has two outputs: high & low parts and it makes the matching a bit difficult especially when you consider this is my first pattern matcher.
Currently high and low parts are mapped independently what result in generation of two multiplications. I have 3 ideas how to fix this, but suggestions welcome:
- Find another pass capable of merging the same multiplications. I tried InstCombine, but instead of merging 2 identical i128 multiplications it rather truncates on of them.
- Separate pattern matching from instruction rewrite. Firstly find all patterns and remember them in a worklist. Later try to map patters for low and high by their arguments.
- When on of the patterns is found, try to find the pattern for the other part by traversing basic block further.
IIUC, this is not a problem with _correctness_, right? We are protected against removing an instruction whose output still has live uses? But we're worried that the intermediate outputs will all have so many uses that we'll end up generating our MUL *and* keeping all those intermediate instructions, and so the codegen will be bigger than if we'd left it alone.
If I've understood the problem correctly, then I think @chfast's proposed solution is correct: you should do this optimization only if every intermediate result is completely dead (or can be replaced by a corresponding intermediate result of the new code). The vast majority of cases where we want this optimization to fire will be cases where all the intermediate results are dead.