Running mesos agents over an unreliable network

I have mesos agents located in three datacenters with a usually reliable WAN connection. Occasionally though all the running tasks in a DC get killed and it gets traced back to a WAN connection interruption.

This hasn’t been a big problem until recently when a fail over link had high enough latency that the agents would disconnect and kill all running tasks approx every half hour for about 12 hours. I tried to figure out which configuration options need to be tweaked for the master and agents to wait longer before killing tasks and this is what I came up with.

Current setup:

Three DC: DC1 (central), DC2 and DC3. Mesos 0.27.2 with custom python scheduler 3 node Zookeeper 3.4.5 cluster in DC1 with 3 HA mesos masters. Zookeeper observer nodes in DC2 and DC3 Agents in DC2 connect to Zookeeper observer in DC2

From my research there are several timeouts that are at play here:

1) Zookeeper ticktime and synclimit. Unfortunately the zookeeper read-only observer feature is not available yet, so when the observer loses connection it drops connections to the agents. There isn’t an agent zk_session_timeout configuration option, but it looks like the agent force expires the zk session after 10 sec (the master default). If zk reconnects in less than 10sec the session still expires, but the master is detected and everything works.

2) Mesos master agent_ping_timeout and max_agent_ping_timeout. The master shuts down the agent after this timeout (75 sec by default). This causes the slave to restart and kill all running tasks.

3) agent_reregister_timeout and max_agent_reregister_timeout. If there was a master failover during a WAN outage, then this timeout may be triggered. But the default is 10min so that shouldn’t be a problem.

Here are my conclusions for my setup. Please let me know if I missed anything.

1) Since the ZK observers in DC2 and DC3 do not affect main ZK cluster when disconnected, changing ticktime or synclimit is not necessary.

2) Increase max_agent_ping_timeout on masters so that (agent_ping_timeout * max_agent_ping_timeout) is longer than most WAN outages. In my case most outages are less than 10 mins so I am trying max_agent_ping_timeout = 40. This means I do not need to increase reregister timeout. Unfortunately max_agent_ping_timeout is a global configuration and I cannot set this value differently for agents in the different DCs.

Running mesos agents over an unreliable network

Pages