The outer parallel loops of a linalg operation is lowered to
loop.parallel, with the other loops lowered to loop.for. This gets the
lowering to loop.parallel on par with the loop.for lowering. In future
the reduction loop could also be lowered to loop.parallel.
Also add a utility function that returns the loops that are
created. This requires change to the EDSC builders to return the
created ops.
Is there a need to match all of the trailing 'step %{{.*}}'? You always print step right?