Konrad Scherer
MONDAY, 24 OCTOBER 2016

Book Review: 'Money: Master the Game' by Tony Robbins

After listening to a Tim Ferris podcast with Tony Robbins, I decided to read his latest book “Money: Master the Game”. I was skeptical of Tony Robbins and his style of motivation speaking. The book has a very informal speaking style with lots of bold text where you feel Tony is waving his hands frantically. But the content of the book is superb if a little long winded. I have read a few books about finance and investing, but this was the first to take a full life time look at saving, investing and retiring.

The first part is about saving regularly, starting early and avoiding excessive fees which are usually hidden. I feel my family is doing well here, but making saving automatic is a good reminder that saving is against our basic nature. Money finds ways to get spent.

The next part was making the case that we probably require less money than we think to sustain the retirement lifestyle we think we want. There were three levels of lifestyle and each had a worksheet that required estimates of how much we currently spend. I should have filled out these worksheets, but I did not because I have this fantasy of managing our finances with something like hledger which will provide these answers. When I imagine my retirement, it isn’t filled with high expense activities like travelling the world on a yacht or having a private jet. Imaging retirement is something I want to do more of with my family. It will make this kind of planning easier.

The part about investing had some real gems. I was aware of the importance of asset allocation and having investments that are not correlated, but the Ray Dalio “all weather” portfolio was fascinating. It was the first time I had heard of a portfolio that had so little downside risk with such substantial upside. I always assumed that any investment with high return required accepting extra risk. This is an investment strategy that outperformed the “market” or S&P 500 over decades with almost no loss in capital (maximum loss was less than 4%). The “secret” is a large allocation of long term bonds with a small allocation in gold and commodities. The logic that the economy has four seasons which are the combination of growth and inflation. Having assets that do well in each “season” has finally shown me what proper diversification looks like. Since stocks and bonds are correlated, the classic advice of stock and bond diversification is problematic. Right now the world economy is in a period of low inflation and low growth. When it switches (and it will) to higher inflation and “negative” growth (what a silly term), the standard advice for asset allocation will cause big problems.

For me the action item from this has been to look very carefully at the current asset allocation of my portfolio. Right now I am following a very contrarian style and my largest holding are real return bonds, short US equities and long commodities like gold and energy. But this is very short term focused and hopefully a longer focus will expose me to less risk and volatility.

The next part focused on what Tony referred as the “back of investment mountain”. I have spent a lot of time thinking about how to invest, but not what to do with that investment. I assumed I would retire at some point and spend the rest of my retirement managing my pile of investments. The book again showed me options that I was not aware of. Apparently there are “hybrid” annuities that provide payments for life while growing with the equity market but with full capital preservation! Frankly it sounds too good to be true, but I have made a note to investigate this further. The possibility of “getting out of the game” by having an income without having to worry about it is very appealing. I am skeptical because I don’t understand why an insurance company would take on this much long term risk. At least I think the premium must be very high to offset the risk, but Tony insists this plan is now available to all US citizens and I intend to see if I can find something similar in Canada.

There is a section with interviews of some of the greatest investors ever like Charles Schwab, Ray Dalio, Warren Buffet, etc. This part was nice, but did not contain helpful specific advice that wasn’t mentioned in other parts of the book. There was one brief mention of technical trading which I found strange because it actually goes against most of the advice in the book. Technical trading assumes the past stock price behavior can be used to predict the future stock price. It is true that some people have become very rich that way, but to me it is too much like gambling without any acknowledgment that the stock represents a company or group of companies with assets and revenue and people. On the other hand technical trading increases volatility which can be useful to contrarian investors like myself.

The last chapter is about the power of giving. I am very motivated to give my time and energy to my family and friends but the giving of money is complicated. All things being equal, I would like any donations to do the most good possible. Even defining what I mean by good is difficult: less suffering?, more education?, more opportunity? more equality? less disease? Maybe whatever makes me feel the happiest is the simplest approach and I need to accept that it will probably not be the most efficient. If my money isn’t making me happy, why even bother working so hard to accumulate it in the first place? My action item is to manage our finances better and look for ways to give more money in ways that will make me and my family happier.

I learned a lot from this book. It has also changed my opinion of Tony Robbins. The book was a real gift to me and I am planning my future differently because of it.

Rating: Highly Recommended




TUESDAY, 12 JULY 2016

Running mesos agents over an unreliable network

I have mesos agents located in three datacenters with a usually reliable WAN connection. Occasionally though all the running tasks in a DC get killed and it gets traced back to a WAN connection interruption.

This hasn’t been a big problem until recently when a fail over link had high enough latency that the agents would disconnect and kill all running tasks approx every half hour for about 12 hours. I tried to figure out which configuration options need to be tweaked for the master and agents to wait longer before killing tasks and this is what I came up with.

Current setup:

Three DC: DC1 (central), DC2 and DC3. Mesos 0.27.2 with custom python scheduler 3 node Zookeeper 3.4.5 cluster in DC1 with 3 HA mesos masters. Zookeeper observer nodes in DC2 and DC3 Agents in DC2 connect to Zookeeper observer in DC2

From my research there are several timeouts that are at play here:

1) Zookeeper ticktime and synclimit. Unfortunately the zookeeper read-only observer feature is not available yet, so when the observer loses connection it drops connections to the agents. There isn’t an agent zk_session_timeout configuration option, but it looks like the agent force expires the zk session after 10 sec (the master default). If zk reconnects in less than 10sec the session still expires, but the master is detected and everything works.

2) Mesos master agent_ping_timeout and max_agent_ping_timeout. The master shuts down the agent after this timeout (75 sec by default). This causes the slave to restart and kill all running tasks.

3) agent_reregister_timeout and max_agent_reregister_timeout. If there was a master failover during a WAN outage, then this timeout may be triggered. But the default is 10min so that shouldn’t be a problem.

Here are my conclusions for my setup. Please let me know if I missed anything.

1) Since the ZK observers in DC2 and DC3 do not affect main ZK cluster when disconnected, changing ticktime or synclimit is not necessary.

2) Increase max_agent_ping_timeout on masters so that (agent_ping_timeout * max_agent_ping_timeout) is longer than most WAN outages. In my case most outages are less than 10 mins so I am trying max_agent_ping_timeout = 40. This means I do not need to increase reregister timeout. Unfortunately max_agent_ping_timeout is a global configuration and I cannot set this value differently for agents in the different DCs.




TUESDAY, 21 JUNE 2016

Dell FX2 and Intel X710 nics

What follows is an attempt to document a 6 month long debugging odyssey. This is easily the strangest computer behavior I have ever debugged or tried to understand.

Background

I manage a cluster of bare metal servers used for coverage build testing of Wind River Linux. The collection of git repos alone is over 15GB and the resulting IO traffic is high enough that using a public cloud is not cost effective. The current sweet point for price to build performance to rack space is a chassis that squeezes 4 blade servers into a 2U chassis. We have a bunch of the Dell C6220 series servers and then I decided to try the Dell FX2 chassis. The selling points for me were the full Dell iDRAC and the network IO aggregator system. The IO module theoretically would allow me to network 4 X 10GbE per system (160 GbE total) to a redundant switch pair providing 80 GbE uplink capability. We already had a good experience with the M-8024K module for the M1000e chassis.

Hardware setup

The first chassis was installed and networked. The IO modules were setup in same way as the M-8024K which is as a VLAN access port. The network was configured as VLAN 105, but with the access port this detail is hidden from the systems. The main problem is that the RedHat and Debian installers do not support VLAN configuration of the network devices so all my machines have this configuration which allows me to use Foreman for automated PXE installs.

Using newest Ubuntu installer

Things were finally ready for me in January 2016. I started the PXE install of Ubuntu 14.04. This failed because the kernel drivers for the X710 nic were only integrated in Linux 4.2 and the Ubuntu installer with the 3.13 kernel could not detect the X710 nic.

Luckily Ubuntu rebuilds the 14.04 installer image with the 15.10 kernel. I switched to the newest version of the installer and the installer was able to detect the X710 nic.

The first hiccup

This time the kernel and initrd were downloaded, the initial DHCP succeeded but then DNS lookup to download the preseed failed. This was strange but not unheard of. It had happened a long time ago but I hadn’t seen it in years. I was quick to blame our Microsoft DNS servers and replaced all DNS names in the preseed with IP addresses. This allowed the installation to complete and the machine booted Ubuntu as usual. Then things started to get really strange. DHCP on boot would occasionally fail and then I noticed that DNS would occasionally time out and then succeed right afterwards. This made using programs like Puppet impossible.

I completed the install of the other 3 servers and noticed that occasionally that DHCP would fail during the install process. This was mystifying to me because the PXE boot process uses the same DHCP setup to download the kernel and initrd and I never saw it fail.

I checked the DHCP server and the server logs showed that the DHCP received the request and was sending the offer back to the server, but that offer was never received. Running ethtool did not show any dropped or corrupted packets reported by the nic.

After the installation of the remaining 3 servers in the chassis was complete, I opened a support case with Dell.

Approx two weeks:

TOR access port config, IOA VLAN 1 untagged, Hosts untagged = problem

Round one - TOR switch config

The configuration of the TOR Cisco switch connecting to the IOA was the subject of the first round of debugging. The IOA comes by default in a “no touch” default configuration and it made sense to verify the setup of the TOR switch. It took a few weeks to get together all the people involved: myself, on site IT, IT networking, Dell tech support and Dell networking specialist. After many hours, the TOR switch was changed from an access port to a VLAN 105 tagged port. This resulted in all traffic being dropped until the the IOA was changed to make the 105 VLAN untagged. But the DNS/DHCP problem persisted.

Round One - Approx one month

TOR VLAN 105, IOA VLAN 105 untagged, Hosts untagged = problem

Round Two - Internal reproducer

Moving up levels of Dell support always takes time. While waiting for networking support to become available I started experimenting. I wanted to see if I could reproduce the problem without involving the TOR switch so I setup dnsmasq on blade #1 as a dns caching proxy. I then added a fake host entry into /etc/hosts so I could be sure that dnsmasq was being queried and started running nslookup queries on blade #2. To my surprise, I was able to reproduce the problem even with the network traffic completely internal to the FX2 chassis.

Round Two - Approx one month

IOA VLAN 1 untagged, Hosts untagged = problem

Round Three - A solution?

Then I decided to investigate if VLAN tagging at the Linux host level would change things. I PXE booted blade #3 with the IOA configured VLAN 105 untagged and when DHCP failed, I switched the IOA to VLAN 1 untagged and used the secondary install console to change the network config from em1 to em1.105. I was able to complete the install and boot the machine.

Amazingly the DHCP/DNS problems went away! It took some time to fix my Puppet configuration to work with the VLAN tagging and get everything working. I was also able to demonstrate that the problem using blade #1 and #2 was present with VLAN 1 untagged and not present with VLAN 1 untagged and linux host configured for VLAN 105.

Round Three - Approx one month

IOA VLAN 1 untagged, Hosts tagged 105 = no problem!

Round Four - Debugging the IOA

Now the focus turned exclusively to the IOA. With the help of Dell network support, we disabled the outbound ports of the IOA and ran tcpdump on the hosts. We were able to see packets being sent from the DNS “server” and not being received by the client about 25% of the time. About 5-10% of the time the initial DNS query would not even make it to the DNS server.

It was around this time that a second FX2 chassis with identical hardware arrived, but with a newer IOA firmware version. Full of hope I did a PXE install on a blade, but with the exact same problem.

The Dell networking support team attempted to reproduce the problem on their internal lab, but even with an FX2 chassis, Ubuntu 14.04 install on an FC630 with the X710 nic they were unable to reproduce the problem. To ensure the systems were configured identically we went through the entire BIOS setup line by line to compare. I even tried installs using UEFI and “Legacy BIOS” modes with no change in behavior.

I then got a crash course in F10 network configuration. It took a while to find the proper command line incantations, but we setup counters on the various ports to count incoming and outgoing packets. We setup fixed ARP entries and tried to reduce the network traffic as much as possible. Unfortunately the outgoing port counters did not work, but from the incoming counters it looked like the IOA was not seeing the packets come in the interface.

Round Four - Approx two months

IOA functioning as designed.

Bonus: I learned how to use the Dell iDRAC virtual media feature to transfer files to and from a system without network access.

Round Five - Debugging the X710 nic

Now another Dell Linux support tech was brought in and he confirmed that the Linux config was correct. We then tried a firmware upgrade for the X710 nic. This involved a failed upgrade attempt using an ISO upgrade package (only works with Legacy bios mode), a DRAC upgrade with HTML5 support and finally using the iDRAC upgrade functionality to upgrade the NIC firmware.

Unfortunately the firmware upgrade made things even worse!! DHCP worked but I could not ping inside the chassis. To make things even more bizarre, ARP would occasionally work but ping would not!

At this point, we decided to replace the Intel X710 nic with the Broadcom BCM57840 nic with a similar feature set to see how/if the problem changed.

Round Five - Approx one month

Several failed firmware upgrades and violations of the laws of networking.

Round Six - Something goes right for a change

A technician swapped out the Intel Nics for Broadcom Nics. I redid another PXE install (luckily it is completely automated) and everything works as expected! No DHCP/DNS errors or any hint of strange behavior.

We finally had a solution and the remaining Intel X710 nics were swapped out over a few weeks.

Final setup:

TOR 105 access port, IOA VLAN 1 untagged default config, Linux host untagged.

Recap

  1. Bug not reproducible by support
  2. Intermittent dropping of UDP packets without connection to non Dell hardware
  3. Enabling VLAN tagging on the host “solved” the problem
  4. Incorrect hardware counters
  5. Firmware upgrades make things worse
  6. Debugging requires coordination of at least 3 teams
  7. Root cause never determined
  8. Everyone involved agreed it was one of the strangest problems they have ever debugged

Number of people involved

At Wind River: myself, IT and IT networking

At Dell: 2 tech support, 2 networking support, 1 Linux support, 2 managers

Total time consumed: approx 2-3 man months over 6 months of calendar time.

Conclusion

The reality is that Dell shipped us something broken. The open question is whether testing could have found this problem before the hardware shipped. Dell was unable to reproduce the problem internally and without knowing the root cause of the problem, I can only speculate.

Ideally I would like to know the cause of the problem, know that it was fixed and that no one else will have to suffer through this. But that would be the fairy tale ending and life doesn’t work that way. The case is considered closed and I will get back to all the tasks I had to put on hold for this.

Pages