Egress IP decision tree updated
I had another one of those weird problems which made me revisit the egress IP decision tree again.
Network architecture, protocols, and implementation guides
View All TagsI had another one of those weird problems which made me revisit the egress IP decision tree again.
I read a great post on LinkedIn the other day about delivering Anycast DNS in Azure using Infoblox and Azure Virtual WAN. It immediately reminded me of the time I deployed Anycast DNS using Infoblox BloxOne DDI and OSPF in a major retailer's network. As I have been working with Azure Route Server on some Anycast Load Balancing projects not too long ago I thought was about time I tried it out with Infoblox NIOS.
Infoblox have since resolved this and I have tested it successfully.
I wanted to settle down today in a particularly dull meeting and have a go at setting up an Infoblox NIOS instance in Azure using the Azure Marketplace offering. I have used Infoblox in anger before and I know it is a solid product so I was keen to get it up and running in the lab so that I could have a play with the Anycast DNS features with Azure Route Server.
Services like Azure Storage are really great, and they are super secure, but they seem to make infosec people a bit nervous. The idea of data being secured by identity rules only and not behind a firewall feels a bit too open for some people. I am a big fan of the zero trust security model but that puts all the trust into your identity provider and the way you manage identities and that is a big ask for some organisations.
I was out for a long walk over the weekend and I have now got a rather nasty blister on my heel. If you'd configured a health check just to monitor my feet, I'd be marked as dead right now.
One of the weirdest birthday presents I got this year was from Microsoft - Azure Firewall Prescaling. It's a solution to a problem that's been around for a while. And one that quite a lot of people didn't even know existed.
Azure Firewall is a great product, but it's not without its limitations. One of the biggest issues has been around scaling. Sure, Azure Firewall can scale up and down based on demand. But this scaling can take time. In high-demand situations, this delay can lead to dropped packets and degraded performance.
The scale back in can also cause issues with long-lived TCP connections. Why? Because there's been little control over when the scaling events happen. And which instances are terminated.
One of the downsides of private previews is that they are under NDA so you can't really talk about them. However, I can now talk about Azure Private Link Direct Connect because it's in public preview now. It solves one of the problems that has been bugging me for a while with Private Link Services (PLS) which is that you have to use a load balancer or an application gateway in front of the service.
The Mad Men of advertising had a knack for selling ideas, lifestyles, and products with flair and persuasion. But what if they turned their talents to selling networks? How might they approach the task of convincing businesses to invest in robust networking solutions?
We've taken a look at 168.63.129.16, the magic IP address in Azure, and now it's time to explore 169.254.169.254, the instance metadata IP address. This IP address is used by Azure virtual machines to access instance metadata, which provides information about the VM and its environment.
While looking at the magic ip I touched upon the idea of Azure Service Tags. They're supported within NSGs and Azure Firewall rules and are essentially Microsoft managed IP address groups that represent specific services within the Azure ecosystem.
From time to time I have a conversation about Azure networking and a topic comes up that I need to dig into a little more. Normally the documentation is pretty good but sometimes I just need to bottom out all the behaviour. Today's topic is 168.63.129.16, an IP address that keeps coming up in various places in Azure.
I've been thinking about why network documentation always feels incomplete. You know the feeling - you've got spreadsheets full of device details, beautiful network diagrams, and configuration backups. But when something breaks at 3am, you're still calling Dave from the pub because he's the only one who knows why VLAN 247 exists.
The problem isn't that we don't document things. It's that we're only capturing the bottom layer of what we actually need.
Azure Global Load Balancer is often overlooked in favour of Azure Traffic Manager when it comes to global load balancing. Both are very capable options if all you want is to distribute traffic across multiple regions. However, Azure Global Load Balancer has a few tricks up its sleeve that make it a more interesting choice in some scenarios. The main one is that it uses Anycast for its frontend IP addresses.
I build a lot of labs and demos in Azure, and I often start by creating resources manually in the portal. It's quick and easy to get something up and running. I am also keen to keep my Azure Lab environment costs as low as possible so I try to only run resources when I am using them. With a busy family life, three kids, a spaniel and a rather involved job, I don't have the time to be constantly building and tearing down environments so I use Terraform where I can to define the labs so I can spin them up and down as needed.
I've sat on both sides of countless technical interviews over my years in networking. There's this familiar dance that happens when discussing OSPF: the candidate confidently states "OSPF uses Dijkstra's algorithm for route calculation," and I'll nod approvingly. But here's the thing - in hundreds of these exchanges, I've never once asked a candidate to explain what that actually means, and no one's ever asked me to explain it either.
There are a few things going on with ExpressRoute Gateways and they are related to Public IPs. First of all the retirement of Basic SKU Public IPs for ExpressRoute Gateways is something to be aware of as it has a hard end date and will require a migration to a different SKU. The second one is the HOBO (Hosted On Behalf Of) public IP feature which has an interesting drawback.
Netbox open sourced their STDIO MCP server a while back and I have been playing around with it since then. The installation requires some local dependencies and the setup process was a bit tricky, but I managed to get it up and running with some trial and error. I found it substantially harder to set up and wouldn't necessarily trust that the sort of people who would benefit from having access to it would be able to easily set it up so I wanted to create a more user-friendly installation process by building an MCP server that runs remotely as a proxy to the Netbox API.
I have been working on a comprehensive approach to bringing network automation and documentation into a development style workflow. Rather than replacing the traditional ITSM approach to change management it moves infrastructure towards a CI/CD approach to releases with automation and baked in documentation.
I have been playing around with enforza.io for a while and it's a great solution for low cost internet egress across AWS and Azure. The platform give an easy to manage low cost NVA which can be scaled out to cloud spokes to give consistent egress policy. As HA (High Availability) is crucial for any production environment, I wanted to investigate how easy it was to combine more than one enforza instance to achieve a highly available egress solution.
Adam Stuart has a rather excellent rundown of the various ways you can approach SD-WAN connectivity into Azure cloud, providing comprehensive technical guidance for Azure-based deployments. Much of the same applies in AWS although I have often said that AWS networking is more complex and akin to something dreamt up by a stoned developer who couldn't even spell BGP. One of the legacy options included at the end of Adam's article is the cloud edge topology where you deploy physical hardware into a carrier neutral facility (CNF) like Equinix and use that as an interconnect between your SD-WAN and an ExpressRoute or Direct Connect circuit. This got me thinking about the uncertainty many organisations face when deciding how their overall cloud connectivity should evolve.
This article explores the journey from simple single-site connectivity to sophisticated multi-cloud SD-WAN architectures, examining the trade-offs, and implications of each approach. We'll walk through real-world topologies that organisations I have worked with commonly implement, from basic VPN connections to cloud-native SD-WAN NVA hubs, helping you understand which approach might be right for your organisation's scale and maturity level.
I was reminded today of a post I wrote for my old blog about using a dual AS for BGP ASN migrations and changes. It was written some time ago but the principles still apply today. I thought it was worth reposting here as it got lost in one of my blog's own migrations.
In a recent blog post I wrote: "As network engineers we are used to the declarative model of configuration management and so this fits nicely into that mindset - you declare what you want and Terraform will make it so." But declaring what you want is only half the battle. The real challenge lies in how you structure that declaration to handle the messy reality of business requirements whilst maintaining the automation benefits that drew us to declarative tools in the first place.
There is an excellent Terraform provider for Netbox that allows you to manage your Netbox resources using Terraform. This is particularly useful for automating the management of network devices, IP addresses, and other resources in a consistent and repeatable manner. I have been working through the process of setting this up and have found it to be a powerful tool for a documentation first and a documentation as code approach to network management.
I've had some concerns around Model Context Protocol being a new fad to put another front end on poorly managed data, like the search appliances Google sold a decade and a half ago, but I had a play with the MCP server for Netbox and it's pretty handy.
The impending deadline of Azure IP armageddon is nearly upon us. In March '26 a fairly major shift is taking place in Azure which will see a change to the default behaviour for outbound internet for Azure VMs. The change itself has been fairly well discussed but you can now get ahead of the curve with Azure Private Subnet and start building things as they will be after March 2026.
I happened upon the diagram below within the pages on default outbound internet access and it seemed a little counterintuitive. The decision flow seems to suggest that a VM will use the egress IP of a NAT gateway preferably over an assigned PIP.
The topic of documentation always comes up in tech and it's one that has had a lot of attention in world of software development which has led to some excellent solutions. In the world of infrastructure, however, the solutions have not been as readily available or adopted.
A couple of days ago, I saw a meme targeted at network engineers that mentioned "the VLAN add disaster." I immediately understood what it meant. It feels like such a well-known thing now, enough to warrant a place in a meme, that it's become part of our professional zeitgeist over the last decade in networking.
Most people think of Azure Public Load Balancer as that thing that spreads web traffic across multiple servers. Fair enough - that's what it does best. But it's actually a Swiss Army knife for network address translation that can solve some tricky connectivity problems you might not expect.
So here's something I definitely didn't see coming when I started my blog, I just got the email that I've been awarded Microsoft Most Valuable Professional (MVP) status for Cloud and Datacenter Management - On-Premises Networking.
The shift from traditional network perimeters to zero trust architectures represents one of the biggest changes in cybersecurity thinking over the past two decades. But there's a dangerous misconception floating around that zero trust means ditching network security controls for identity-based systems. This misunderstanding has led many organisations to roll out incomplete solutions that create new vulnerabilities while trying to fix legacy security problems.
Any time I have to do anything with OSPF I remind myself how it can be so damn awkward about MTU. A little while ago I was busy trying to integrate some Juniper SRX firewalls into a perimeter around some Cisco Nexus 7K and reached a problem that looked like MTU, smelled like MTU, quacked like MTU but we couldn't work out how it was MTU. Here's how it was MTU and what we learned.
In my previous post, I shared some basic latency tests across Azure networks. The results were pretty predictable: the closer things are physically, the faster they communicate. Not exactly groundbreaking.
But when I expanded my testing to include longer distances and different connection methods, I stumbled onto something genuinely surprising: PrivateLink connections can actually be faster than direct VNET peering - sometimes significantly so.
I had a look recently at Azure Subnet Peering which was mostly undocumented at the time except for a blog post by Jose Moreno, one of the gurus of Azure Networking.
I've already written a bit about various firewalls and their performance with FQDN filtering. I've also made the case for a 'less is more' approach to egress security where it makes sense. But the topic of FQDN filtering keeps coming up, so I thought I'd share a few more thoughts on it.
When I set out to explore network latency in Azure, I had a simple goal: to understand how physical distance affects performance. After all, we've all heard that farther apart means slower connections. But I wanted specifics - exactly how much slower? And how consistent is that performance? I also wanted to see how long lived TCP connections performed across the Azure network.
I'm sharing what I've learned from my first round of tests, setting a baseline that we can build on later.
I like to point out to people that it's easier to train a network person on cloud than it is to train a cloud person on networks. It's a glib generalisation but it holds true for the most part because there is so much to networking that comes from history and quite a lot of grounding that a seasoned network engineer or architect will already understand. A big chunk of the AWS and Azure networking certification covers BGP and that's one of the reasons they are considered quite hard for some but quite easy for others. BGP is a topic that many very experienced network engineers in enterprise networking can get through their entire career without touching, but for those who operate at scale or work with MSP and telco networks it's bread and butter.
I've been exploring a pretty niche feature preview that you can find documented here. In some cases, you might want to expand the size of a subnet, but if you have a constrained IP space, you might not have contiguous space available. Here's what I've found you can do.
A little while ago I set out to find a way to measure the overhead that FQDN filtering places on HTTPS web traffic. I've shared the results here, but I thought I'd take the opportunity to discuss the methodology in a bit more detail.
When you create a VNet in Azure or a VPC in AWS, you need to allocate a CIDR range for your subnets. There are key differences between these cloud providers when expanding networks, which can create challenges. Knowing these rules from the start helps you plan your CIDR ranges better. I'll start with what's similar across AWS and Azure, then look at the differences.
In a recent conversation about IPv6 adoption at a Western technology company, I witnessed a familiar scene play out. Engineers and architects discussed IPv6 implementation as an optional future consideration rather than an immediate necessity. 'We don't really need it yet', was the prevailing sentiment. This perspective, common among Western organisations, reveals a profound blindspot born of privilege – one that unconsciously perpetuates digital inequality on a global scale.
Anyone who has accidentally advertised too many prefixes and watched their ISP BGP peerings collapse (I'm looking at you, BT) knows that prefix limits are a common safeguard in networking. While exploring anycast configurations in Azure, I carefully noted the official Route Server prefix limit of 1,000 routes. However, I recently discovered something far more interesting in the fine print about how Azure actually calculates this limit.
I've recently been exploring one of the sneaky under-the-radar features that could be a game changer in the near future: Azure Subnet Peering. This is a feature that's already there in the API but not really documented or productised yet.
I've been asked to explain networks to people with no experience several times and it's hard to know where to start. There's so much history and so many computer science concepts that have led us to where we are today. I've always believed that to truly understand something, you need to be able to explain it to someone else. My goal here isn't just to explain the bits that make the internet work, but also to organise my own understanding and explore areas where I've taken things on faith instead of questioning why they exist. I'll start from nothing and rebuild the internet from scratch, solving the same problems that got us where we are today.
Let's explore solutions for global site load balancing.
I was chatting with a friend who's studying for his AZ-700 exam, and he showed me this rather neat way of finding your IP address using Google DNS.
The reason I fell down the rabbit hole with regard to finding my public IP was because of a section in an old Azure networking book my friend was reading. It said:
To allow Azure internal communication between resources in Virtual Networks and Azure services, Azure assigns public IP addresses to VMs, which identifies them internally. Let's call these public IP addresses AzPIP (this is an unofficial abbreviation). You can check the Azure internal Public IP address bound to the VM with the command dig TXT short o-o.myaddr.google.com.
When packets travel through a cloud network, they face many decision points. Among these, one stands out as really important: the initial routing decision. At its heart is an algorithm that might seem strange at first - the Longest Prefix Match (LPM). Why do we prioritise longer prefix matches? Why not shorter ones, or why not simply use the first match we find? The answer lies in a fascinating mix of computing efficiency, network design, and how cloud computing has evolved.
In IT operations, there's a metric that network teams know all too well: Mean Time to Innocence (MTTI). It's how long it takes for a network team to prove they're not responsible for an outage or performance issue. While that might sound funny, it highlights a serious problem in how we structure our infrastructure teams.
The Business Case Challenge
I've found that traditional justifications for SD-WAN adoption have often focused on cost savings versus MPLS or enhanced network features. However, these arguments frequently fall short under scrutiny. The fundamental challenges that limited VPN adoption in enterprise networks – including performance consistency, reliability, and operational complexity – remain relevant despite improvements in internet infrastructure.
I wrote an article about a rather neat solution for global application delivery in Azure via anycast, however there are some limitations which exist to prevent transit routing in Azure that I'd like to discuss further.
The Case for Application-Level Controls
I've noticed that an organisation's approach to securing outbound internet traffic often reflects its security maturity more than its technical requirements. System-to-system communication, such as API calls to cloud services, presents fundamentally different challenges compared to user browsing. Understanding these differences is crucial for implementing effective security controls without unnecessary complexity or risk.
There used to be a great little website for route summarisation and it did it far more intelligently than Cisco kit does it. It looks like the site has dropped off the internet which is a shame but there is a handy python library called netaddr with has the same capabilities.
I have written a little wrapper for it which will regex the prefixes out of a ‘show ip bgp’ and then list the summary routes. You pass the output of ‘show ip bgp’ as a text file, it’s the only argument the script expects.