If a user only cares about the overall latency, then the best/quickest way is to change method Pipeline::run() so that it returns the total number of cycles to the caller.
When the simulation pipeline is run, the number of cycles (or an error) is returned from method Pipeline::run().
The advantage is that no hardware event listener is needed for computing that latency. So, the whole process should be faster (and simpler - for that particular use case).