Folding of the tosa.transpose operation is both time and memory
intensive as the underlying ElementsAttr is processed as a sequence of
Attributes. This change attempts operate on the underlying raw data of
the ElementsAttr.
In an example resnet50 network, this change reduces the time spent in
folding transpose ops from 35s to 1.5s.
I haven't looked at compiler explorer, but I'd be willing to bet that the mixing of signed and unsigned arithmetic in the indexing is not generating the best code. May be worth going completely signed (unless if factoring this to not be hot per below).