Server freeze - channel timeout

BesoC
BesoC ✭✭
edited April 2019 in Photon Server
Situation:
We have Photon standalone, on-premise server, serving PvP games. Each game is played by 2 peers and has its own PoolFiber. Each action from peer is directed to corresponding game using RequestReplyChannel.

Under a load of some 700 simultaneous users and 350 games we experience a 'freeze' situation - e.g. request received from peer, goes to channel and then - timeout.

Strange thing is that we are logging requests on game side, in channel subscriber function. That log is empty - as if subscriber was not called at all. How is it possible?

Server has 24 cores, 16 GB RAM

Thanks

Best Answer

«1

Answers

  • @BesoC unfortunately, I cannot say much about the issue? But I have a question: why do you use the channel when it is cheaper just enqueue message or action to rooms fiber

    best,
    ilya
  • BesoC
    BesoC ✭✭
    @Ilya peer enqueues message to match's fiber using a channel because message should be executed in match's fiber and peer should receive a result of execution.
  • it looks like you are using channels in GO style or at least as one to one communication channel. if yes then it is easier to use fibers directly. so, that you will understand better what is going on. But in general, I would assume that for some reason fiber subscribed to channel somehow is blocked. Probably you are locking there

    best,
    ilya
  • BesoC
    BesoC ✭✭
    @Ilya Thanks for answers. they are quite interesting to me and raise 2 questions:
    1. Can you please show the way of class sending a message to other class' fiber and wait for result other then using RequestReplyChannel?
    2. How can a channel be blocked? All threads are busy?
  • I'm not sure what is your RequestReplyChannel?

    best,
    ilya
  • BesoC
    BesoC ✭✭
    @Ilya Actually it's not mine :)
    ExitGames.Concurrency.Channels.RequestReplyChannel Class
  • @BesoC I think I know what is wrong with your code. We had this issue with our http code. When you block thread it may happen that task which releases this thread will be enqueued by scheduler exactly to this thread. and you get a dead lock

    just do not block fiber in your peer. Use states. also when you enqueue action to fiber you may put everything you need to send an answer back - a reference to peer or what you need, right?

    best,
    ilya
  • BesoC
    BesoC ✭✭
    edited April 2019
    @Ilya OMG hope this is not the cause - we have the whole bunch of tested and working code that relies on channels :( .

    Please take a look at this example:
    
    var reply = gamePeer.Game.PlayerActionChannel.SendRequest((gamePeer.PlayerId, playerActionRequest.ActionId, playerActionRequest.ActionParameter));
    
    Here, peer is a wss ClientPeer, game itself is running completely inside server's fiber. It does not matter to game whether peer is wss, tcp or udp. Does the 'http code' you mention above address to this code?

    The situation happens only with a big number of simultaneous connections ( >700) and parallel running games (>300). Rewriting code so a reference to peer is passed to game fiber is possible as a last resort. But will it guarantee avoiding deadlocks?

    Lastly, does a patch exists or are you planning an update to Photon Server?
  • BesoC
    BesoC ✭✭
    edited April 2019
    @Ilya OMG hope this is not the cause - we have a whole bunch of tested and working code relying on channels :( .

    Please take a look at this code:
    var reply = gamePeer.Game.PlayerActionChannel.SendRequest((gamePeer.PlayerId, playerActionRequest.ActionId, playerActionRequest.ActionParameter));
    Here, gamePeer is a wss ClientPeer. Game is a reference to game object that has its own pool fiber. It does not matter whether peer is wss, tcp or udp client. Is 'http code issue' you mentioned above related to this situation?

    Situation happens only under a big load - around 700 simultaneous client connections and 300 parallel running games. Rewriting the code that uses channels is possible as a last resort, but will it guarantee us from such kind of deadlocks?

    Lastly - is there a patch available addressing this issue or are you planning to update Photon Server?
  • @BesoC the issue I've described happens only for plugins using HttpRequest call with sync mode. In this case, we are blocking executing thread with the help of mutex/event and it stays blocked till http queue returns a result. Http Queue returns result by enqueueing action to thread pool. this action appears in action queue of thread which is blocked right now. So it cannot be unblocked.
    The more loading you have the often it happens.

    in your case, you should do reply.Recieve then you will have that issue. But as I said before, blocking is really bad idea. So, you should rethink this part and use async approach. So, rewriting is not last resort, it is the only way to go for you

    Sorry, for bad news

    best,
    ilya

  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov As I understand, this issue happens only if if plugin is used. Is websocket Photon.SocketServer.ClientPeer class affected by it?
    chvetsov said:

    in your case, you should do reply.Recieve then you will have that issue.

    I don't quite understand this part. Can you please explain it?

  • yes, the issue happens only in plugin related code. peers are not affected.

    reply.Receive method does blocking if there is no message yet.

    best,
    ilya
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov Hmm, then it's not the reason in our case - we're using websocket peer.

    Yes, we are blocking a peer thread until game process its message and returns response. Can this be a cause of 'freezes'?

    Whole server 'freezes' - new clients cannot make connections, existing clients just 'hang', messages are not delivered through channels to games.... Just if someone puts server on a 'pause'. Then, after 2-3 minutes, server simply 'resumes' working.

    Do you think that blocking peers' channels can cause such level of disaster?

    Highly unlikely, but maybe a config file causes trouble?
    <IOPool NumThreads="6"/> <ThreadPool InitialThreads="12" MinThreads="12" MaxThreads="12"/> <ENetThreadPool InitialThreads="6" MinThreads="6" MaxThreads="6"/>
  • hmm, if you have this then look into your code. There is nothing wrong with configs

    check different locks you have to see whether you have lock/wait timeouts close to freezing time. That all that I may say

    best,
    ilya
  • @BesoC, please, keep us informed about the results of your investigations

    best,
    ilya
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov Can you please advise what is being blocked in reply.Receive - just concrete peer's execution or whole thread serving that peer?


    chvetsov said:

    @BesoC, please, keep us informed about the results of your investigations

    We are rewriting our code according to your suggestions. I'll certainly keep you informed
  • @BesoC if reply.Receive blocks thread then it blocks one of thread pool threads. and it also blocks peers RequestFiber. That means that operations will not be handled by this peer. they will be kept in the queue of the Peer.RequestFiber.

    best,
    ilya.
  • But anyway you have to check your old code. It may have some other bug unrelated to this issue. And you may face it again later. Add logs, add counters. you have to find out what is the reason for the entire server freeze

    best,
    ilya
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov
    chvetsov said:

    if reply.Receive blocks thread then it blocks one of thread pool threads. and it also blocks peers RequestFiber. That means that operations will not be handled by this peer. they will be kept in the queue of the Peer.RequestFiber.

    While blocking peer's RequestFiber is totally OK and is the desired outcome, blocking whole thread is not acceptable for us.



    chvetsov said:

    But anyway you have to check your old code. It may have some other bug unrelated to this issue. And you may face it again later. Add logs, add counters. you have to find out what is the reason for the entire server freeze

    Photon's counter set has a lot of counters. can you please advise what counters might be most useful in our situation?

  • >While blocking peer's RequestFiber is totally OK and is the desired outcome, blocking whole thread is not acceptable for us.
    you should know for how long thread is blocked. Probably if not longer 1 ms, then I would say this is ok

    >Photon's counter set has a lot of counters. can you please advise what counters might be most useful in our situation?

    You need to add your own counters

    best,
    ilya
  • BesoC
    BesoC ✭✭
    @chvetsov I was under impression that only RequestFiber is being blocked, while thread itself can continue and execute some other peer's RequestFiber. Is it correct?
  • No, it is not correct? Fiber will not continue execution of task if one of its actions waits for some sync object. If this would be not so, we/you/all would have concurrency issues

    best,
    ilya
  • BesoC
    BesoC ✭✭
    @chvetsov I agree, Fiber will stop and remain blocked. I was addressing the underlying thread that runs that fiber. In my understanding, while Fiber remains blocked, this thread can switch to execute some other peer's Fiber. When first peer's fiber unblocks, that thread, or some other one, will continue Fiber's execution.
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov
    chvetsov said:

    the fibers task/action is a unit of execution. A thread cannot switch to other fiber during fibers action execution. Only after if a current bunch of tasks is exhausted

    Thanks for clearing that out.



    Is Peer's OnDisconnect() execute inside Peer's Fiber?
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov
    We have rewritten our code so we are not using channels anymore + we use async/await wherever possible, including database access. We are using Npgsql for working with database. now, under load server freezes, we restarted server and in error log we have following:

    10324: 13:38:38.433 - Service is stopped... 10324: 13:38:38.433 - Server shutting down... 10324: 13:38:38.433 - Shutdown monitoring enabled, 30000ms before process abort 10324: 13:38:38.433 - Notifying CLR applications of shutdown... 10324: 13:38:38.433 - RequestApplicationStop: 2 10324: 13:38:38.433 - RequestStop: defalt app domain 10324: 13:38:38.449 - Shutting down WebSocket Outbound Connection Manager... 10324: 13:38:38.449 - Shutting down TCP Outbound Connection Manager... 10324: 13:38:38.449 - Shutting down ENet Outbound Connection Manager... 10324: 13:38:38.449 - Disconnecting all outbound peers... 10324: 13:38:38.449 - Shutting down ENet host... 10324: 13:38:38.449 - Disconnecting all peers... 10324: 13:38:38.449 - Shutting down socket servers... 10324: 13:38:38.465 - Shutting down ENet thread pool... 10324: 13:38:38.465 - Shutting down TCP inactivity timers... 10324: 13:38:38.465 - WaitingForShutdownToComplete: WebSocket:0.0.0.0:12030 10324: 13:38:38.465 - OnShutdownCompleted: WebSocket:0.0.0.0:12030 10324: 13:38:38.465 - WaitingForShutdownToComplete: WebSocket:0.0.0.0:12031 10324: 13:38:38.465 - OnShutdownCompleted: WebSocket:0.0.0.0:12031 10324: 13:38:38.465 - Shutting down CLR applications... 10324: 13:38:38.480 - CManagedHost::OnStackOverflow() 2916: 13:38:38.480 - CManagedHost::OnDefaultAction() - OPR_ThreadRudeAbortInNonCriticalRegion - eRudeAbortThread 2916: 13:38:38.480 - CManagedHost::OnDefaultAction() - OPR_AppDomainRudeUnload - eRudeUnloadAppDomain 10324: 13:38:38.480 - Shutting down business logic thread pool... 10324: 13:38:38.480 - Shutting down I/O thread pool... 10324: 13:38:38.480 - Destroying servers 10324: 13:38:38.496 - Destroying WebSocket Outbound Connection Manager 10324: 13:38:38.496 - Destroying TCP Outbound Connection Manager 10324: 13:38:38.496 - Destroying TCP inactivity timer 10324: 13:38:38.496 - Destroying ENet timer wheel 10324: 13:38:38.496 - Destroying ENet host 10324: 13:38:38.496 - Destroying ENet outbound host 10324: 13:38:38.496 - Destroying HTTP inactivity timer 10324: 13:38:38.496 - Destroying HTTP host 10324: 13:38:38.496 - Destroying HTTP Chunk host 10324: 13:38:38.496 - Flushing ENET buffer allocator... 10324: 13:38:38.496 - Flushing TCP buffer handle allocator... 10324: 13:38:38.496 - Flushing stream socket allocator... 6996: 13:38:38.496 - Unloading Domain: ApplicationName = 'DominoGameServer', DomainId='2' 10324: 13:38:38.496 - Flushing inbound datagram socket allocator... 10324: 13:38:38.496 - Flushing outbound datagram socket allocator... 10324: 13:38:38.496 - Destroying CLR dispatcher 10324: 13:38:38.496 - Destroying CLR applications 10324: 13:38:38.496 - ~CCLRApplicationCollection() - Waiting up to 10 seconds for the CLR to release all of our objects 13524: 13:38:39.060 - CManagedHost::OnStackOverflow() 13524: 13:38:39.075 - CManagedHost::OnStackOverflow() 13524: 13:38:39.107 - CManagedHost::OnStackOverflow() 13524: 13:38:39.122 - CManagedHost::OnStackOverflow() 13524: 13:38:39.138 - CManagedHost::OnStackOverflow() 13888: 13:38:40.062 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 14316: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 14232: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 5912: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 13744: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 14172: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 4636: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 12872: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 1144: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 14068: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 13816: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction 6688: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
    there are a lot of CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction records.

    Also a lot of

    ERROR PhotonHostRuntime.PhotonDomainManager - UnhandledException: System.AppDomainUnloadedException: Attempted to access an unloaded AppDomain.
    in PhotonCLR.log

    Can it be that there are too many threads to be handled by Photon or .Net?

    P.S. We have changed Photon config as follows (we have 24 cores):

    <IOPool NumThreads="8"/> <ThreadPool InitialThreads="16" MinThreads="16" MaxThreads="16"/> <!-- <ENetThreadPool InitialThreads="6" MinThreads="6" MaxThreads="6"/> -->
    Upon starting and writing 'server is running' in log, Photon server took a considerable time (about 1 minute) to start actually serving peers
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov
    I found old topic in this forum CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction. Looks like our async calls have created too many additional threads for Photon server to manage. Perhaps async/await and Task.Run(async () => await myAsync()).Wait(); patterns should not be used at all inside PoolFiber?
  • BesoC
    BesoC ✭✭
    edited April 2019
    @chvetsov I have found rather funny record in system EventLog :)

    The Photon Socket Server: DominoGameServer service terminated with the following service-specific error: Insert the diskette for drive Photon Socket Server: DominoGameServer.
    I believe you have that diskette, right? :)
  • @BesoC you have missed the most important line in your log.

    10324: 13:38:38.480 - CManagedHost::OnStackOverflow()

    You have a stack overflow exception. Please follow instructions here: https://doc.photonengine.com/en-us/server/current/app-framework/stackoverflowexception

    best,
    ilya