After listening to a Tim Ferris podcast with Tony Robbins, I decided
to read his latest book “Money: Master the Game”. I was skeptical of
Tony Robbins and his style of motivation speaking. The book has a very
informal speaking style with lots of bold text where you feel Tony is
waving his hands frantically. But the content of the book is superb if
a little long winded. I have read a few books about finance and
investing, but this was the first to take a full life time look at
saving, investing and retiring.
The first part is about saving regularly, starting early and avoiding
excessive fees which are usually hidden. I feel my family is doing
well here, but making saving automatic is a good reminder that saving
is against our basic nature. Money finds ways to get spent.
The next part was making the case that we probably require less money
than we think to sustain the retirement lifestyle we think we
want. There were three levels of lifestyle and each had a worksheet
that required estimates of how much we currently spend. I should have
filled out these worksheets, but I did not because I have this fantasy
of managing our finances with something like hledger which will
provide these answers. When I imagine my retirement, it isn’t filled
with high expense activities like travelling the world on a yacht or
having a private jet. Imaging retirement is something I want to do
more of with my family. It will make this kind of planning easier.
The part about investing had some real gems. I was aware of the
importance of asset allocation and having investments that are not
correlated, but the Ray Dalio “all weather” portfolio was
fascinating. It was the first time I had heard of a portfolio that had
so little downside risk with such substantial upside. I always assumed
that any investment with high return required accepting extra
risk. This is an investment strategy that outperformed the “market” or
S&P 500 over decades with almost no loss in capital (maximum loss was
less than 4%). The “secret” is a large allocation of long term bonds
with a small allocation in gold and commodities. The logic that the
economy has four seasons which are the combination of growth and
inflation. Having assets that do well in each “season” has finally
shown me what proper diversification looks like. Since stocks and
bonds are correlated, the classic advice of stock and bond
diversification is problematic. Right now the world economy is in a
period of low inflation and low growth. When it switches (and it will)
to higher inflation and “negative” growth (what a silly term), the
standard advice for asset allocation will cause big problems.
For me the action item from this has been to look very carefully at
the current asset allocation of my portfolio. Right now I am following
a very contrarian style and my largest holding are real return bonds,
short US equities and long commodities like gold and energy. But this
is very short term focused and hopefully a longer focus will expose me
to less risk and volatility.
The next part focused on what Tony referred as the “back of investment
mountain”. I have spent a lot of time thinking about how to invest,
but not what to do with that investment. I assumed I would retire at
some point and spend the rest of my retirement managing my pile of
investments. The book again showed me options that I was not aware
of. Apparently there are “hybrid” annuities that provide payments for
life while growing with the equity market but with full capital
preservation! Frankly it sounds too good to be true, but I have made a
note to investigate this further. The possibility of “getting out of
the game” by having an income without having to worry about it is very
appealing. I am skeptical because I don’t understand why an insurance
company would take on this much long term risk. At least I think the
premium must be very high to offset the risk, but Tony insists this
plan is now available to all US citizens and I intend to see if I can
find something similar in Canada.
There is a section with interviews of some of the greatest investors
ever like Charles Schwab, Ray Dalio, Warren Buffet, etc. This part was
nice, but did not contain helpful specific advice that wasn’t
mentioned in other parts of the book. There was one brief mention of
technical trading which I found strange because it actually goes
against most of the advice in the book. Technical trading assumes the
past stock price behavior can be used to predict the future stock
price. It is true that some people have become very rich that way, but
to me it is too much like gambling without any acknowledgment that the
stock represents a company or group of companies with assets and
revenue and people. On the other hand technical trading increases
volatility which can be useful to contrarian investors like myself.
The last chapter is about the power of giving. I am very motivated to
give my time and energy to my family and friends but the giving of
money is complicated. All things being equal, I would like any
donations to do the most good possible. Even defining what I mean by
good is difficult: less suffering?, more education?, more opportunity?
more equality? less disease? Maybe whatever makes me feel the happiest
is the simplest approach and I need to accept that it will probably
not be the most efficient. If my money isn’t making me happy, why even
bother working so hard to accumulate it in the first place? My action
item is to manage our finances better and look for ways to give more
money in ways that will make me and my family happier.
I learned a lot from this book. It has also changed my opinion of Tony
Robbins. The book was a real gift to me and I am planning my future
differently because of it.
Rating: Highly Recommended
I have mesos agents located in three datacenters with a usually
reliable WAN connection. Occasionally though all the running tasks in
a DC get killed and it gets traced back to a WAN connection
interruption.
This hasn’t been a big problem until recently when a fail over link
had high enough latency that the agents would disconnect and kill all
running tasks approx every half hour for about 12 hours. I tried to
figure out which configuration options need to be tweaked for the
master and agents to wait longer before killing tasks and this is what
I came up with.
Current setup:
Three DC: DC1 (central), DC2 and DC3.
Mesos 0.27.2 with custom python scheduler
3 node Zookeeper 3.4.5 cluster in DC1 with 3 HA mesos masters.
Zookeeper observer nodes in DC2 and DC3
Agents in DC2 connect to Zookeeper observer in DC2
From my research there are several timeouts that are at play here:
1) Zookeeper ticktime
and synclimit
. Unfortunately the zookeeper
read-only observer feature is not available yet, so when the
observer loses connection it drops connections to the agents. There
isn’t an agent zk_session_timeout
configuration option, but it looks
like the agent force expires the zk session after 10 sec (the master
default). If zk reconnects in less than 10sec the session still
expires, but the master is detected and everything works.
2) Mesos master agent_ping_timeout
and max_agent_ping_timeout
. The
master shuts down the agent after this timeout (75 sec by
default). This causes the slave to restart and kill all running tasks.
3) agent_reregister_timeout
and max_agent_reregister
_timeout. If there
was a master failover during a WAN outage, then this timeout may be
triggered. But the default is 10min so that shouldn’t be a problem.
Here are my conclusions for my setup. Please let me know if I missed anything.
1) Since the ZK observers in DC2 and DC3 do not affect main ZK cluster
when disconnected, changing ticktime
or synclimit
is not necessary.
2) Increase max_agent_ping_timeout
on masters so that
(agent_ping_timeout * max_agent_ping_timeout)
is longer than most
WAN outages. In my case most outages are less than 10 mins so I am
trying max_agent_ping_timeout
= 40. This means I do not need to
increase reregister timeout. Unfortunately max_agent_ping_timeout
is
a global configuration and I cannot set this value differently for
agents in the different DCs.
What follows is an attempt to document a 6 month long debugging
odyssey. This is easily the strangest computer behavior I have ever
debugged or tried to understand.
Background
I manage a cluster of bare metal servers used for coverage build
testing of Wind River Linux. The collection of git repos alone is over
15GB and the resulting IO traffic is high enough that using a public
cloud is not cost effective. The current sweet point for price to
build performance to rack space is a chassis that squeezes 4 blade
servers into a 2U chassis. We have a bunch of the Dell C6220 series
servers and then I decided to try the Dell FX2 chassis. The selling
points for me were the full Dell iDRAC and the network IO aggregator
system. The IO module theoretically would allow me to network 4 X
10GbE per system (160 GbE total) to a redundant switch pair providing
80 GbE uplink capability. We already had a good experience with the
M-8024K module for the M1000e chassis.
Hardware setup
The first chassis was installed and networked. The IO modules were
setup in same way as the M-8024K which is as a VLAN access port. The
network was configured as VLAN 105, but with the access port this
detail is hidden from the systems. The main problem is that the RedHat
and Debian installers do not support VLAN configuration of the network
devices so all my machines have this configuration which allows me to
use Foreman for automated PXE installs.
Using newest Ubuntu installer
Things were finally ready for me in January 2016. I started the PXE
install of Ubuntu 14.04. This failed because the kernel drivers for
the X710 nic were only integrated in Linux 4.2 and the Ubuntu
installer with the 3.13 kernel could not detect the X710 nic.
Luckily Ubuntu rebuilds the 14.04 installer image with the 15.10
kernel. I switched to the newest version of the installer and the
installer was able to detect the X710 nic.
The first hiccup
This time the kernel and initrd were downloaded, the initial DHCP
succeeded but then DNS lookup to download the preseed failed. This was
strange but not unheard of. It had happened a long time ago but I
hadn’t seen it in years. I was quick to blame our Microsoft DNS
servers and replaced all DNS names in the preseed with IP
addresses. This allowed the installation to complete and the machine
booted Ubuntu as usual. Then things started to get really
strange. DHCP on boot would occasionally fail and then I noticed that
DNS would occasionally time out and then succeed right
afterwards. This made using programs like Puppet impossible.
I completed the install of the other 3 servers and noticed that
occasionally that DHCP would fail during the install process. This was
mystifying to me because the PXE boot process uses the same DHCP setup
to download the kernel and initrd and I never saw it fail.
I checked the DHCP server and the server logs showed that the DHCP
received the request and was sending the offer back to the server, but
that offer was never received. Running ethtool did not show any
dropped or corrupted packets reported by the nic.
After the installation of the remaining 3 servers in the chassis was
complete, I opened a support case with Dell.
Approx two weeks:
TOR access port config, IOA VLAN 1 untagged, Hosts untagged = problem
Round one - TOR switch config
The configuration of the TOR Cisco switch connecting to the IOA was
the subject of the first round of debugging. The IOA comes by default
in a “no touch” default configuration and it made sense to verify the
setup of the TOR switch. It took a few weeks to get together all the
people involved: myself, on site IT, IT networking, Dell tech support
and Dell networking specialist. After many hours, the TOR switch was
changed from an access port to a VLAN 105 tagged port. This resulted
in all traffic being dropped until the the IOA was changed to make the
105 VLAN untagged. But the DNS/DHCP problem persisted.
Round One - Approx one month
TOR VLAN 105, IOA VLAN 105 untagged, Hosts untagged = problem
Round Two - Internal reproducer
Moving up levels of Dell support always takes time. While waiting for
networking support to become available I started experimenting. I
wanted to see if I could reproduce the problem without involving the
TOR switch so I setup dnsmasq on blade #1 as a dns caching proxy. I
then added a fake host entry into /etc/hosts so I could be sure that
dnsmasq was being queried and started running nslookup queries on
blade #2. To my surprise, I was able to reproduce the problem even
with the network traffic completely internal to the FX2 chassis.
Round Two - Approx one month
IOA VLAN 1 untagged, Hosts untagged = problem
Round Three - A solution?
Then I decided to investigate if VLAN tagging at the Linux host level
would change things. I PXE booted blade #3 with the IOA configured
VLAN 105 untagged and when DHCP failed, I switched the IOA to VLAN 1
untagged and used the secondary install console to change the network
config from em1 to em1.105. I was able to complete the install and boot the
machine.
Amazingly the DHCP/DNS problems went away! It took some time to fix my
Puppet configuration to work with the VLAN tagging and get everything
working. I was also able to demonstrate that the problem using blade
#1 and #2 was present with VLAN 1 untagged and not present with VLAN 1
untagged and linux host configured for VLAN 105.
Round Three - Approx one month
IOA VLAN 1 untagged, Hosts tagged 105 = no problem!
Round Four - Debugging the IOA
Now the focus turned exclusively to the IOA. With the help of Dell
network support, we disabled the outbound ports of the IOA and ran
tcpdump on the hosts. We were able to see packets being sent from the
DNS “server” and not being received by the client about 25% of the
time. About 5-10% of the time the initial DNS query would not even
make it to the DNS server.
It was around this time that a second FX2 chassis with identical
hardware arrived, but with a newer IOA firmware version. Full of hope
I did a PXE install on a blade, but with the exact same problem.
The Dell networking support team attempted to reproduce the problem on
their internal lab, but even with an FX2 chassis, Ubuntu 14.04 install
on an FC630 with the X710 nic they were unable to reproduce the
problem. To ensure the systems were configured identically we went
through the entire BIOS setup line by line to compare. I even tried
installs using UEFI and “Legacy BIOS” modes with no change in behavior.
I then got a crash course in F10 network configuration. It took a
while to find the proper command line incantations, but we setup
counters on the various ports to count incoming and outgoing
packets. We setup fixed ARP entries and tried to reduce the network
traffic as much as possible. Unfortunately the outgoing port counters
did not work, but from the incoming counters it looked like the IOA
was not seeing the packets come in the interface.
Round Four - Approx two months
IOA functioning as designed.
Bonus: I learned how to use the Dell iDRAC virtual media feature to
transfer files to and from a system without network access.
Round Five - Debugging the X710 nic
Now another Dell Linux support tech was brought in and he confirmed
that the Linux config was correct. We then tried a firmware upgrade
for the X710 nic. This involved a failed upgrade attempt using an ISO
upgrade package (only works with Legacy bios mode), a DRAC upgrade with
HTML5 support and finally using the iDRAC upgrade functionality to
upgrade the NIC firmware.
Unfortunately the firmware upgrade made things even worse!! DHCP
worked but I could not ping inside the chassis. To make things even
more bizarre, ARP would occasionally work but ping would not!
At this point, we decided to replace the Intel X710 nic with the
Broadcom BCM57840 nic with a similar feature set to see how/if the
problem changed.
Round Five - Approx one month
Several failed firmware upgrades and violations of the laws of
networking.
Round Six - Something goes right for a change
A technician swapped out the Intel Nics for Broadcom Nics. I redid
another PXE install (luckily it is completely automated) and
everything works as expected! No DHCP/DNS errors or any hint of
strange behavior.
We finally had a solution and the remaining Intel X710 nics were
swapped out over a few weeks.
Final setup:
TOR 105 access port, IOA VLAN 1 untagged default config, Linux host untagged.
Recap
- Bug not reproducible by support
- Intermittent dropping of UDP packets without connection to non Dell
hardware
- Enabling VLAN tagging on the host “solved” the problem
- Incorrect hardware counters
- Firmware upgrades make things worse
- Debugging requires coordination of at least 3 teams
- Root cause never determined
- Everyone involved agreed it was one of the strangest problems they
have ever debugged
Number of people involved
At Wind River: myself, IT and IT networking
At Dell: 2 tech support, 2 networking support, 1 Linux support, 2
managers
Total time consumed: approx 2-3 man months over 6 months of calendar
time.
Conclusion
The reality is that Dell shipped us something broken. The open
question is whether testing could have found this problem before the
hardware shipped. Dell was unable to reproduce the problem internally
and without knowing the root cause of the problem, I can only
speculate.
Ideally I would like to know the cause of the problem, know that it
was fixed and that no one else will have to suffer through this. But
that would be the fairy tale ending and life doesn’t work that
way. The case is considered closed and I will get back to all the
tasks I had to put on hold for this.