troubleshooting
- Russ's course on OE
- dwell time - time to detect the failure
- resilience doesn't equal redundancy
- failure domain - set of subsystems impacted by a failure in other subsystem
- aggregation - abstracting reachability info
- summarization - abstracting topology info
- contrary to the popular belief - tend to larger failure domains - ?
- be conservative in continuous observation (extensive measuring can change the way the network works [or doesn't]), be liberal in instrumenting to measure (have the possibility to measure more)
- know what normal looks like
- 2 am rule - if you cant explain the config at 2 am to a non native speaker, you shouldn't configure it (unless no other solution exists)
- LOC = Loss Of Carrier
- if you haven't found the trade-off, you haven't looked hard enough
- what changed last; what can be checked quickly; measure and half split
- show commands, logs, debugs, pcaps