Server freeze - channel timeout

BesoC · April 2019

Situation:
We have Photon standalone, on-premise server, serving PvP games. Each game is played by 2 peers and has its own PoolFiber. Each action from peer is directed to corresponding game using RequestReplyChannel.

Under a load of some 700 simultaneous users and 350 games we experience a 'freeze' situation - e.g. request received from peer, goes to channel and then - timeout.

Strange thing is that we are logging requests on game side, in channel subscriber function. That log is empty - as if subscriber was not called at all. How is it possible?

Server has 24 cores, 16 GB RAM

Thanks

chvetsov · April 2019

the fibers task/action is a unit of execution. A thread cannot switch to other fiber during fibers action execution. Only after if a current bunch of tasks is exhausted

best,
ilya

chvetsov · April 2019

@BesoC unfortunately, I cannot say much about the issue? But I have a question: why do you use the channel when it is cheaper just enqueue message or action to rooms fiber

best,
ilya

BesoC · April 2019

@Ilya peer enqueues message to match's fiber using a channel because message should be executed in match's fiber and peer should receive a result of execution.

chvetsov · April 2019

it looks like you are using channels in GO style or at least as one to one communication channel. if yes then it is easier to use fibers directly. so, that you will understand better what is going on. But in general, I would assume that for some reason fiber subscribed to channel somehow is blocked. Probably you are locking there

best,
ilya

BesoC · April 2019

@Ilya Thanks for answers. they are quite interesting to me and raise 2 questions:
1. Can you please show the way of class sending a message to other class' fiber and wait for result other then using RequestReplyChannel?
2. How can a channel be blocked? All threads are busy?

chvetsov · April 2019

I'm not sure what is your RequestReplyChannel?

best,
ilya

BesoC · April 2019

@Ilya Actually it's not mine

ExitGames.Concurrency.Channels.RequestReplyChannel Class

chvetsov · April 2019

@BesoC I think I know what is wrong with your code. We had this issue with our http code. When you block thread it may happen that task which releases this thread will be enqueued by scheduler exactly to this thread. and you get a dead lock

just do not block fiber in your peer. Use states. also when you enqueue action to fiber you may put everything you need to send an answer back - a reference to peer or what you need, right?

best,
ilya

BesoC · April 2019

@Ilya OMG hope this is not the cause - we have the whole bunch of tested and working code that relies on channels

.

Please take a look at this example:


var reply = gamePeer.Game.PlayerActionChannel.SendRequest((gamePeer.PlayerId, playerActionRequest.ActionId, playerActionRequest.ActionParameter));

Here, peer is a wss ClientPeer, game itself is running completely inside server's fiber. It does not matter to game whether peer is wss, tcp or udp. Does the 'http code' you mention above address to this code?

The situation happens only with a big number of simultaneous connections ( >700) and parallel running games (>300). Rewriting code so a reference to peer is passed to game fiber is possible as a last resort. But will it guarantee avoiding deadlocks?

Lastly, does a patch exists or are you planning an update to Photon Server?

BesoC · April 2019

@Ilya OMG hope this is not the cause - we have a whole bunch of tested and working code relying on channels

.

Please take a look at this code:

var reply = gamePeer.Game.PlayerActionChannel.SendRequest((gamePeer.PlayerId, playerActionRequest.ActionId, playerActionRequest.ActionParameter));

Here, gamePeer is a wss ClientPeer. Game is a reference to game object that has its own pool fiber. It does not matter whether peer is wss, tcp or udp client. Is 'http code issue' you mentioned above related to this situation?

Situation happens only under a big load - around 700 simultaneous client connections and 300 parallel running games. Rewriting the code that uses channels is possible as a last resort, but will it guarantee us from such kind of deadlocks?

Lastly - is there a patch available addressing this issue or are you planning to update Photon Server?

chvetsov · April 2019

@BesoC the issue I've described happens only for plugins using HttpRequest call with sync mode. In this case, we are blocking executing thread with the help of mutex/event and it stays blocked till http queue returns a result. Http Queue returns result by enqueueing action to thread pool. this action appears in action queue of thread which is blocked right now. So it cannot be unblocked.
The more loading you have the often it happens.

in your case, you should do reply.Recieve then you will have that issue. But as I said before, blocking is really bad idea. So, you should rethink this part and use async approach. So, rewriting is not last resort, it is the only way to go for you

Sorry, for bad news

best,
ilya

BesoC · April 2019

@chvetsov As I understand, this issue happens only if if plugin is used. Is websocket Photon.SocketServer.ClientPeer class affected by it?

chvetsov said:
in your case, you should do reply.Recieve then you will have that issue.

I don't quite understand this part. Can you please explain it?

chvetsov · April 2019

yes, the issue happens only in plugin related code. peers are not affected.

reply.Receive method does blocking if there is no message yet.

best,
ilya

BesoC · April 2019

@chvetsov Hmm, then it's not the reason in our case - we're using websocket peer.

Yes, we are blocking a peer thread until game process its message and returns response. Can this be a cause of 'freezes'?

Whole server 'freezes' - new clients cannot make connections, existing clients just 'hang', messages are not delivered through channels to games.... Just if someone puts server on a 'pause'. Then, after 2-3 minutes, server simply 'resumes' working.

Do you think that blocking peers' channels can cause such level of disaster?

Highly unlikely, but maybe a config file causes trouble?

<IOPool 
NumThreads="6"/>
<ThreadPool
InitialThreads="12"
MinThreads="12"
MaxThreads="12"/>
<ENetThreadPool 
InitialThreads="6"
MinThreads="6"
MaxThreads="6"/>

chvetsov · April 2019

hmm, if you have this then look into your code. There is nothing wrong with configs

check different locks you have to see whether you have lock/wait timeouts close to freezing time. That all that I may say

best,
ilya

chvetsov · April 2019

@BesoC, please, keep us informed about the results of your investigations

best,
ilya

BesoC · April 2019

@chvetsov Can you please advise what is being blocked in reply.Receive - just concrete peer's execution or whole thread serving that peer?

chvetsov said:
@BesoC, please, keep us informed about the results of your investigations

We are rewriting our code according to your suggestions. I'll certainly keep you informed

chvetsov · April 2019

@BesoC if reply.Receive blocks thread then it blocks one of thread pool threads. and it also blocks peers RequestFiber. That means that operations will not be handled by this peer. they will be kept in the queue of the Peer.RequestFiber.

best,
ilya.

chvetsov · April 2019

But anyway you have to check your old code. It may have some other bug unrelated to this issue. And you may face it again later. Add logs, add counters. you have to find out what is the reason for the entire server freeze

best,
ilya

BesoC · April 2019

@chvetsov

chvetsov said:
if reply.Receive blocks thread then it blocks one of thread pool threads. and it also blocks peers RequestFiber. That means that operations will not be handled by this peer. they will be kept in the queue of the Peer.RequestFiber.

While blocking peer's RequestFiber is totally OK and is the desired outcome, blocking whole thread is not acceptable for us.

chvetsov said:
But anyway you have to check your old code. It may have some other bug unrelated to this issue. And you may face it again later. Add logs, add counters. you have to find out what is the reason for the entire server freeze

Photon's counter set has a lot of counters. can you please advise what counters might be most useful in our situation?

chvetsov · April 2019

>While blocking peer's RequestFiber is totally OK and is the desired outcome, blocking whole thread is not acceptable for us.
you should know for how long thread is blocked. Probably if not longer 1 ms, then I would say this is ok

>Photon's counter set has a lot of counters. can you please advise what counters might be most useful in our situation?

You need to add your own counters

best,
ilya

BesoC · April 2019

@chvetsov I was under impression that only RequestFiber is being blocked, while thread itself can continue and execute some other peer's RequestFiber. Is it correct?

chvetsov · April 2019

No, it is not correct? Fiber will not continue execution of task if one of its actions waits for some sync object. If this would be not so, we/you/all would have concurrency issues

best,
ilya

BesoC · April 2019

@chvetsov I agree, Fiber will stop and remain blocked. I was addressing the underlying thread that runs that fiber. In my understanding, while Fiber remains blocked, this thread can switch to execute some other peer's Fiber. When first peer's fiber unblocks, that thread, or some other one, will continue Fiber's execution.

chvetsov · April 2019

the fibers task/action is a unit of execution. A thread cannot switch to other fiber during fibers action execution. Only after if a current bunch of tasks is exhausted

best,
ilya

BesoC · April 2019

@chvetsov

chvetsov said:
the fibers task/action is a unit of execution. A thread cannot switch to other fiber during fibers action execution. Only after if a current bunch of tasks is exhausted

Thanks for clearing that out.

Is Peer's OnDisconnect() execute inside Peer's Fiber?

chvetsov · April 2019

@BesoC yes

BesoC · April 2019

@chvetsov
We have rewritten our code so we are not using channels anymore + we use async/await wherever possible, including database access. We are using Npgsql for working with database. now, under load server freezes, we restarted server and in error log we have following:



10324: 13:38:38.433 - Service is stopped...
10324: 13:38:38.433 - Server shutting down...
10324: 13:38:38.433 - Shutdown monitoring enabled, 30000ms before process abort
10324: 13:38:38.433 - Notifying CLR applications of shutdown...
10324: 13:38:38.433 - RequestApplicationStop: 2
10324: 13:38:38.433 - RequestStop: defalt app domain
10324: 13:38:38.449 - Shutting down WebSocket Outbound Connection Manager...
10324: 13:38:38.449 - Shutting down TCP Outbound Connection Manager...
10324: 13:38:38.449 - Shutting down ENet Outbound Connection Manager...
10324: 13:38:38.449 - Disconnecting all outbound peers...
10324: 13:38:38.449 - Shutting down ENet host...
10324: 13:38:38.449 - Disconnecting all peers...
10324: 13:38:38.449 - Shutting down socket servers...
10324: 13:38:38.465 - Shutting down ENet thread pool...
10324: 13:38:38.465 - Shutting down TCP inactivity timers...
10324: 13:38:38.465 - WaitingForShutdownToComplete: WebSocket:0.0.0.0:12030
10324: 13:38:38.465 - OnShutdownCompleted: WebSocket:0.0.0.0:12030
10324: 13:38:38.465 - WaitingForShutdownToComplete: WebSocket:0.0.0.0:12031
10324: 13:38:38.465 - OnShutdownCompleted: WebSocket:0.0.0.0:12031
10324: 13:38:38.465 - Shutting down CLR applications...
10324: 13:38:38.480 - CManagedHost::OnStackOverflow()
2916: 13:38:38.480 - CManagedHost::OnDefaultAction() - OPR_ThreadRudeAbortInNonCriticalRegion - eRudeAbortThread
2916: 13:38:38.480 - CManagedHost::OnDefaultAction() - OPR_AppDomainRudeUnload - eRudeUnloadAppDomain
10324: 13:38:38.480 - Shutting down business logic thread pool...
10324: 13:38:38.480 - Shutting down I/O thread pool...
10324: 13:38:38.480 - Destroying servers
10324: 13:38:38.496 - Destroying WebSocket Outbound Connection Manager
10324: 13:38:38.496 - Destroying TCP Outbound Connection Manager
10324: 13:38:38.496 - Destroying TCP inactivity timer
10324: 13:38:38.496 - Destroying ENet timer wheel
10324: 13:38:38.496 - Destroying ENet host
10324: 13:38:38.496 - Destroying ENet outbound host
10324: 13:38:38.496 - Destroying HTTP inactivity timer
10324: 13:38:38.496 - Destroying HTTP host
10324: 13:38:38.496 - Destroying HTTP Chunk host
10324: 13:38:38.496 - Flushing ENET buffer allocator...
10324: 13:38:38.496 - Flushing TCP buffer handle allocator...
10324: 13:38:38.496 - Flushing stream socket allocator...
6996: 13:38:38.496 - Unloading Domain: ApplicationName = 'DominoGameServer', DomainId='2'
10324: 13:38:38.496 - Flushing inbound datagram socket allocator...
10324: 13:38:38.496 - Flushing outbound datagram socket allocator...
10324: 13:38:38.496 - Destroying CLR dispatcher
10324: 13:38:38.496 - Destroying CLR applications
10324: 13:38:38.496 - ~CCLRApplicationCollection() - Waiting up to 10 seconds for the CLR to release all of our objects
13524: 13:38:39.060 - CManagedHost::OnStackOverflow()
13524: 13:38:39.075 - CManagedHost::OnStackOverflow()
13524: 13:38:39.107 - CManagedHost::OnStackOverflow()
13524: 13:38:39.122 - CManagedHost::OnStackOverflow()
13524: 13:38:39.138 - CManagedHost::OnStackOverflow()
13888: 13:38:40.062 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
14316: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
14232: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
5912: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
13744: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
14172: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
4636: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
12872: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
1144: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
14068: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
13816: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction
6688: 13:38:40.078 - CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction

there are a lot of CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction records.

Also a lot of



ERROR PhotonHostRuntime.PhotonDomainManager - UnhandledException: 
System.AppDomainUnloadedException: Attempted to access an unloaded AppDomain.

in PhotonCLR.log

Can it be that there are too many threads to be handled by Photon or .Net?

P.S. We have changed Photon config as follows (we have 24 cores):


<IOPool 
			NumThreads="8"/>
		<ThreadPool
			InitialThreads="16"
			MinThreads="16"
			MaxThreads="16"/>
			<!--
		<ENetThreadPool	
			InitialThreads="6"
			MinThreads="6"
			MaxThreads="6"/>
        -->

Upon starting and writing 'server is running' in log, Photon server took a considerable time (about 1 minute) to start actually serving peers

BesoC · April 2019

@chvetsov
I found old topic in this forum CManagedHost::OnFailure() - FAIL_OrphanedLock - eNoAction. Looks like our async calls have created too many additional threads for Photon server to manage. Perhaps async/await and Task.Run(async () => await myAsync()).Wait(); patterns should not be used at all inside PoolFiber?

BesoC · April 2019

@chvetsov I have found rather funny record in system EventLog


The Photon Socket Server: DominoGameServer service terminated with the following service-specific error: 
Insert the diskette for drive Photon Socket Server: DominoGameServer.

I believe you have that diskette, right?

chvetsov · April 2019

@BesoC you have missed the most important line in your log.

10324: 13:38:38.480 - CManagedHost::OnStackOverflow()

You have a stack overflow exception. Please follow instructions here: https://doc.photonengine.com/en-us/server/current/app-framework/stackoverflowexception

best,
ilya

Server freeze - channel timeout

Best Answer

Answers