tensor.pad is lowered to tensor.generate + tensor.insert_slice during bufferization. For best performance with constant padding values, users should vectorize the IR before bufferizing it.
This change also relaxes tje restriction that no new ops that bufferize to a memory write should be added during bufferization. Since bufferization has been split into two steps a while ago (tensor copy insertion + bufferization), it is reasonable to allow this now.
Some comments and warnings that this will bufferize to tensor.generate which in turn will jump the abstraction gap and bufferize to loops.
Also a // TODO: Reevaluate when we have a higher-level representation on buffers for generate we can avoid that jump.