Regular readers of this blog probably see me as an extremely clever, flawless hunk who knows a lot, never makes mistakes, and is traditionally handsome but with a modern style. And of course, you’re not wrong. Except, here’s the twist: you’re dead wrong.
We bloggers are always keen to share knowledge we’ve learned via books, learned via real-world configuration, learned via experience. But one thing people often leave out of the narrative is that sometimes, that experience comes from making mistakes. After all, it’s embarrassing to publicly admit that we didn’t something wrong. But we all know that our mistakes often yield the greatest lessons. With that in mind, today I want to literally make history with my fingers, by writing about a thing I learned the hard way.
Fancy a bonus challenge? Read this post slowly, pause as you go, and see if you can think of troubleshooting steps I might have missed, and what the problem might be. If you can honestly figure it out before you get to the end, I’ll send you £10,000,000 in hard cash. Which, thanks to modern-day politics, translates to about $0.85.
On this particular day, my job is very easy: bring up a customer’s new FTTC circuit with a Juniper SRX, and make sure it can access other sites in their MPLS VPN.
Spoiler alert: I am incorrect about this job being easy.
A colleague has pre-configured the device. We have an engineer on site to plug everything in. Once it’s up, all I have to do is confirm that the networks functions as expected, and we’re good to go. It should be a nice, quick, elegant, simple job, that gives us an early finish and early trip to the pub.
Let’s say that this network is 192.168.10.0/24. There’s a DHCP pool from .50 to .100, with three machines on the network that have static IPs. The router is 192.168.10.1.
The engineer plugs everything in, and the internet connection comes up. Hooray!
We run a speed test, and the speeds are great. Hooray!
He asks the users on site if they have connectivity, and they do. Hooray!
I check the DHCP binding table, and I see about ten machines in there, so I know that all the machines are getting IPs successfully. Hooray!
The engineer can ping other sites in their MPLS VPN. Hooray!
Their wifi works. Hooray!!
Just before we go, we quickly do a test print. And… no luck. Okay, no problem: it’s happened before, and a reboot cleared any stale ARP entries. So, we reboot, and… aaaand… aaaaaaaaand…. nothing.
The IP address of all the printers at every site ends in the same number. Let’s say that one of them is 192.168.10.5. Usually, I can ping them. This time, no such joy. And let’s be clear: pinging printers *is* a joy. A true joy. One of the truest joys known to humankind.
I check the ARP cache, and I notice that I actually don’t see any machines outside of the DHCP range. I’ve brought about 20 of this customer’s lines online, and I usually see the three static IP machines in the ARP table. And if I don’t, a ping usually fixes that.
This particular SRX is set up to use ge-0/0/5 for the WAN, and then ge-0/0/0 to ge-0/0/4 as the LAN. Each interface is set as an access port, all tied to irb.100.
I type “show ethernet-switching table”, and I notice that there’s actually more machines in this table than there are in the ARP cache. I grab one of the missing MAC addresses, and pop it into a MAC vendor lookup website. Sure enough, the address is owned by the vendor of the printer.
So, our SRX is seeing these static IP machines at layer 2, but not at layer 3. Any ideas yet?
My mind is whizzing. My first thought is that perhaps we’ve got the LAN range wrong. But no: the customer confirmed that 192.168.10.x is correct.
The engineer on site wondered if there was a problem with their switch, and he went above and beyond to try different combinations of cable, to no avail. He noticed that the cable that plugged the switch into the SRX was a bit broken – the plastic bit that clips the ethernet cable in had come off – but that can’t be it, because DHCP machines have no problems at all pinging other machines on the MPLS VPN network. Everything seemed fine with the physical network – which is lucky, because the printer was heavier than a tank, and plugging it directly into the router wasn’t an option. Unless we could employ Mr T or Terry Crews for the job. And something tells me that their day-rate might be a little outside our budget.
I wondered if their old setup was on a different, tagged VLAN. But nope: we got the config of their old router, and the setup was very simple.
Okay, time to go deeper, and inspect the traffic on the network for clues.
I ran the command “monitor traffic interface ge-0/0/1 no-resolve layer2-headers” to see if I could see anything interesting. What was surprising to me was that I was really only seeing spanning tree stuff. The only ARP requests were from one machine outside the DHCP range. This machine was calling out to find out where the default gateway was. The SRX replied, but the machine didn’t seem to be accepting it: the machine just kept sending the same ARP request out, every second. Perhaps this makes sense: if the problem machines are printers then maybe they won’t actively be calling out to the network, at least not regularly.
I double-check and triple-check the config, and everything seems fine. At my wits end, I pass the config over to a colleague – my boss, in fact – just to give it a second pair of eyes. And when he finds the problem in about 20 seconds, I feel… what’s a word that means feeling even more embarrassed than “extremely embarrassed”?
So here’s a lesson I learned that day: it turns out that on Juniper routers, you can actually configure a physical LAN interface with a /32 subnet mask.
You can’t do this on a Cisco router. If you try, you get this error:
That’s right, Cisco: bad mask! Naughty mask! Get back in your bed, mask!!
I must have checked the config about five times over. I checked the mode of the interfaces; I made sure there were no firewall filters; I checked the SRX was in switching mode and not transparent-bridging mode; and of course, I checked the IP addresses so many times. But for some reason, my eyes totally missed that /32. Because of course they did. I mean, if this LAN was truly a /32, how on earth were machines on the LAN getting connectivity out to the rest of the MPLS network? Wouldn’t the return traffic hit the SRX and drop? As far as the SRX is concerned the LAN is only one IP big. So when traffic comes from the MPLS to the SRX, how does it know what to do with it?
It was only when I was in the shower that evening – enjoying some “quality alone time” with my “body” – that it suddenly hit me why this setup works on a Juniper device. And to understand it, we need to talk about something quite unique that Junos does when it gives out an IP by DHCP: it adds a /32 route into the routing table for that particular machine. Yep: the subnet mask on the LAN interface was wrong, but it didn’t matter, because our cheeky SRX was adding routes into the routing table for all the machines it had given an IP address to. What a scamp!!
Check out this output I labbed up. Each of these machines was given its IP address automatically, and they’re all in the routing table.
chris.parker@NetworkFunTimes> show route protocol access-internal inet.0: 14 destinations, 14 routes (14 active, 0 holddown, 0 hidden) + = Active Route, - = Last Active, * = Both 192.168.10.113/32 *[Access-internal/12] 2d 22:13:14 > to 192.168.10.15 via irb.100 192.168.10.114/32 *[Access-internal/12] 2d 12:43:03 > to 192.168.10.15 via irb.100 192.168.10.115/32 *[Access-internal/12] 2d 12:02:45 > to 192.168.10.15 via irb.100
And as such, when the traffic came back, the SRX had a route to that specific machine.
For what it’s worth, Juniper gives this reason for why you’re allowed to use /32s on physical interfaces:
Confession time: based on that short explanation, I’m not entirely sure I understand the advantage. For example, f you’ve got two interfaces that are talking OSPF, and you’re giving each one a /32, isn’t that the same as using a /31? I suppose it gives you more flexibility in the IPs you choose, because if I’m reading that right then the IPs don’t have to be in the same subnet (which is fine for OSPF point-to-points), but I’m not sure that’s a good thing. So perhaps there’s something I’m missing.
Having said that, I thought of another reason to use a /32 on a physical LAN interface: if the SRX adds each DHCP machine to the routing table, that means we can configure a /32 on our LAN interface to prevent anyone from joining the network who hasn’t received an IP from DHCP – just like in the problem I had to troubleshoot! If someone tries giving themselves an IP address manually, they’ll have no connectivity to the outside world. Of course, there’s quite a few other ways of achieving this if you’ve got a fancy switch, and I can’t imagine this “solution” is anyone’s idea of a best practice. So basically, don’t do this. Unless you want to. But even then, probably don’t.
THE LESSONS LEARNED
That was a pretty stressful day. As you can imagine, I’ve definitely got some takeaways from all this, both on a personal-growth level and on a technical level:
— First of all, if you’ve read a config twice and you don’t see a problem, get a friend to look over it. It’s amazing how quickly our eyes and brain become blind to something that we’ve stared at for a while. And hey: if you don’t have a friend nearby, why not ask your enemy’s enemy?
— Second of all, when you’re reading a config/diagram/code to check it’s all okay, and you’re still not seeing anything, try reading it out loud, slowly. Say what you actually see, not what you think you see. This forces your brain to actually look at every single little detail, and in doing so it helps you to avoid skimming over bits that you subconsciously just assume will be correct – the bits that are inevitably the problem.
— Third, if the problem seems unusual, trust nothing. Don’t trust your own work, don’t trust your colleagues’ work, give no-one and nothing the benefit of the doubt – not even yourself.
— Fourth, don’t just learn not to make the same mistake again: think about the ways that your troubleshooting failed you, and think how you can speed that up in the future. For example, maybe get that second pair of eyes to look at the config after five minutes instead of two hours. You’ll save everyone a lot of bother!
— Fifth, just because something isn’t possible on one vendor, doesn’t mean it isn’t possible on another vendor. Keep your mind open, and never, ever say the phrase “Well, the configuration is definitely fine”. Because even if you’re 99.99999% sure it is, there’s a 0.00001% chance it isn’t.
— Sixth, related to the one above: Different vendors may adhere to the same RFCs, but they still operate in different ways. If something is happening that doesn’t match your understanding of the fundamental principles of networking, there’s a good chance there’s some kind of configuration quirk that makes total sense once you know about it. (Equally, there’s a chance there’s just a bug in the firmware!)
— And finally: automation is your friend. Use automation, use it well, and stop mistakes happening in the first place. In fact, I fully plan to spend time learning JSNAPy, so that this entire process can become nice and elegant, and all elements can be tested with just a few key strokes. Embrace our robot overlords, and the reliability and consistency they bring to the automated table.
Have you got any stories of mistakes you’ve learned from? If you’ve blogged about it, or if you feel like writing about it below, please do leave a comment. I’d love to hear about it!
By the way, thank you so much for reading my blog. If you enjoyed it, you’d make my day if you shared this post on your social media of choice – the more readers I get, the more I want to write even more cool posts for you. And hey: if you want to find out when I make new posts, follow me on Twitter! Let’s be internet friends, you and I.