Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per Thread Fast Forwarding #138

Open
vijay4454 opened this issue Sep 27, 2016 · 13 comments
Open

Per Thread Fast Forwarding #138

vijay4454 opened this issue Sep 27, 2016 · 13 comments

Comments

@vijay4454
Copy link

Hello

I am wondering if there is an easy way to do fast forwarding for a specific thread (thread 0) in a single process multiple thread simulation. I have a pthread program that I need to simulate on large core count system. One of the cores is OoO while others are simple in-order cores. I need ZSim to ignore/not count the cycles spent by the main thread (that runs on the OoO core) in specific functions.

I tried implementing this feature in ZSim by instrumenting the binary and placing specific handlers before and after those specific functions (whose names I pass to the PIN tool through pin_cmd.cpp). Inside the handler code (which takes thread ID as argument), I invoke the EnterFastForward() or ExitFastForward() as appropriate. However, I realize that it fast forwards the entire process, which means it is fast forwarding the other threads besides thread 0.

Is there an easy way to get around this problem and fast forward just thread 0? If not, what would you recommend is the least intrusive/easiest way to do this?

Thanks

@gaomy3832
Copy link
Contributor

The easiest way is to change your code. Normally I would imagine the main thread spawns a bunch of worker threads and then wait in idle until all workers finish. You can reorganize your code to move the work in the main thread before or after the parallel section, then fast forward this part has no effects on worker threads. If the work in the main thread has to happen in parallel with the workers (due to communication, synchronization, etc.), then you probably should not fast forward it since it will affect the performance of your region of interest.

@hlitz
Copy link

hlitz commented Sep 27, 2016

Define a new Magic op that, when called (by each thread individually), reads out the per-thread cycle count and subtract it later.

On Sep 27, 2016, at 8:00 AM, vijay4454 [email protected] wrote:

Hello

I am wondering if there is an easy way to do fast forwarding for a specific thread (thread 0) in a single process multiple thread simulation. I have a pthread program that I need to simulate on large core count system. One of the cores is OoO while others are simple in-order cores. I need ZSim to ignore/not count the cycles spent by the main thread (that runs on the OoO core) in specific functions.

I tried implementing this feature in ZSim by instrumenting the binary and placing specific handlers before and after those specific functions (whose names I pass to the PIN tool through pin_cmd.cpp). Inside the handler code (which takes thread ID as argument), I invoke the EnterFastForward() or ExitFastForward() as appropriate. However, I realize that it fast forwards the entire process, which means it is fast forwarding the other threads besides thread 0.

Is there an easy way to get around this problem and fast forward just thread 0? If not, what would you recommend is the least intrusive/easiest way to do this?

Thanks


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #138, or mute the thread https://github.com/notifications/unsubscribe-auth/ADZKfnfGklfbht-SJhd42Jaul51yBr5oks5quS-UgaJpZM4KHxEX.

@vijay4454
Copy link
Author

vijay4454 commented Sep 28, 2016

I am developing a simulator for some new architecture with a new programming model. One, I don't want to impose too many constraints on how the programs should be written. Second, there is a specific reason why I want to ignore cycles spent inside specific functions. Cycles spent inside the current implementation of these functions have no real world deployment significance.

@gaomy3832
Copy link
Contributor

As Heiner suggested above, you can define a new magic op. The current fast-forward is deferred to the end of phase to be synced. Take a look at the logic in Join(), TakeBarrier() in zsim.cpp to see how to do immediate join and leave of the threads (set cids, call sched->join()/leave(), set fPtrs, etc.).

@vijay4454
Copy link
Author

Thanks hlitz and gaomy. I implemented this by counting the sum total of cycles spent inside the API calls for each thread, and then simply writing out that count to a per-core statistic like the regular cycle count. It required changes to simple_core.cpp, ooo_core.cpp, etc in addition to zsim.cpp and pin_cmd.cpp. I did not add a new magic call. Simulator seems to be working properly with this change.

It was a bit of an intrusive change to the simulator, but I guess that is ok as long as it works without simulation slowdown.

@benpatsai
Copy link

If you do that, remember that the whole memory hierarchy does see what happened during your magic functions. If those functions are very short and/or non-memory intensive then I think it's fine.

If that's not the case, I would suggest you leverage the NullCore, a perfect IPC=1 core, to better model what you want. Basically, add one NullCore in your system. When encountering magic functions, schedule that thread to the NullCore. Once it finishes those functions, schedule it back to the OOO core.

@vijay4454
Copy link
Author

vijay4454 commented Sep 28, 2016

@benpatsai. Thanks, that's a very good point. I ignored that.

I had tried something similar to your suggestion. I had fast forwarded the thread (the main thread in the code) on entering each magic function and exited fast forward on exit of each magic function. But the problem I faced was that a different thread gets scheduled on the core on which thread 0 was initially running. I want thread 0 to ALWAYS run on the first configured core (which I configure as OoO) and rest of the threads run on the other cores that are configured as simpleCores. This happens because thread 1 is created (using pthread_create) after thread 0 encounters the magic call and goes into fast-forward, thus leaving core 0 free for the taking for thread 1.

I think I will face the same issue if I follow your NullCore suggestion, won't I?

@benpatsai
Copy link

So that boils down to how to schedule a particular thread from a particular core to another core. One example implementation can be:

  1. Implement a magic op to distinguish the main thread from other threads (like register thread)
  2. Make the process have multiple core masks, one for the OOO core + the Null core, and one for in-order cores.
  3. For non-main thread, use the in-order core mask. For the main thread, use the other mask. You can achieve this by setting the mask vector in the ThreadInfo for a thread.
  4. When running into the magic functions, schedule the main thread between those two cores within that mask.

@vijay4454
Copy link
Author

@benpatsai. Thanks a lot for your suggestion. However, I think the approach of just measuring and subtracting cycle count in certain functions should work fine for my use case. The functions code should not alter the caching effects by much.

@vijay4454
Copy link
Author

vijay4454 commented Oct 21, 2016

@benpatsai. I am trying to implement your suggestion of having two core masks and scheduling main thread between the two cores in the mask for the main thread. I am not quite sure how to implement the point 4 in your answer above. Can you direct me to the relevant code in the simulator that will help me figure it out?

@benpatsai
Copy link

@vijay4454, you can look at process_tree.{cpp,h} to see how the mask is parsed from the config file. And by tracing down ProcessTreeNode.mask, you should be able to learn how/when the scheduler uses it to schedule threads to a set of cores.

@gaomy3832
Copy link
Contributor

gaomy3832 commented Nov 4, 2016

@vijay4454, you may want to take a look at my pending pull request #114, which implements such affinity scheduling. I did it through the standard sched_get/setaffinity syscalls. You can reuse the internal logic with whatever interface you want to use.

@vijay4454
Copy link
Author

Thanks benpatsai & gaomy3832! I have been able to implement this and get it working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants