implicit_task_end callbacks in nested parallel regions did not always give the correct thread_num, since the inner parallel region may have already been finalized.
Now, the thread_num is stored at the beginning of the implicit task and retrieved at the end, whenever necessary.
A testcase was added as well.