Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
NAT between identical networks using VRF (blog.oddbit.com) similar stories update story
92 points by ytch | karma 2541 | avg karma 6.62 2023-02-20 07:03:40 | hide | past | favorite | 67 comments



view as:

This is a very real problem when using a VPN into another network with same ip-space.

Though I'm not sure the complexity is worth it. Suppose it would be nice if tailscale implemented this type of substitution-routing-NAT this for users.


I mean they kinda do by assigning each node a cgnat space IP?

They have NAT routing support for exposing nodes that don't/can't run tailscale. Meant that ip space

Another possible option for the VPN use case, depending on what you're trying to do and how you're doing it, is to just put the VPN into the VRF (or a netns) and only bind connections you intend to run through the VPN to that context. It's not particularly fantastic either but it's a little more straightforward.

> This is a very real problem when using a VPN

I've been really careful about selecting virtual network IP spaces throughout due to this concern.

The hardest part is accounting for the point-to-site VPNs on top of everything else (i.e. employees connecting from home/Starbucks/airport).

For internal corporate networks, I typically will provision smaller targeted networks in order to get around this. Lopping-off all of 10.0/16 or 192.168/16 for your internal LAN will put you at a fairly high statistical likelihood of overlapping domestic ISP address spaces. Smaller, odd networks like 192.168.111/24 are much easier to plan around.


One network I used allocated from 172.16.0.0 – 172.31.255.255 but then docker wouldn't work if you were in 172.17

Ah what a fun issue to figure out and resolve 10 minutes before endofday,especially if you are the one handling docker and some dude in another country is handling the vpn that wants access to said docker.

For me, this was WSL2's virtual machine and (IIRC) not the default bridge network, but the next one created by docker-compose.

I ran into this problem a long time ago with EKS. I picked a network for EKS, but it also needed a network for docker running on each node, and this happened to conflict with a network that we had peered to AWS (we were an ISP, this was our internal management network).

I learned on that day to put everything in our IP address management system (netbox). Allocate private networks like we allocated public networks to our customers. (It would have been nice if I allocated Docker's defaults before they had been allocated to an internal network, but ... hindsight 20/20. At least with that network in the IPAM system, people could figure out why it broke though.)


I have had this happen with Docker compose too in a much less complicated setup.

When you add a network, it gives you a network. It allocates a pretty big address space each time, and eventually it'll run out of 172.16.0.0/12 space and move on to 192.168.0.0/16 networks.

Why can I get to this from the guest network but not the client network? Oh...

It sucks, but I manually set networks now. And keep track of them.


Yeah, for EKS you pick an IP range for docker, but it can't be changed after the cluster is created. So remember to pick right.

For virtual network IP space you can go with the 0.0.0.0/8 space (sometimes called "zeronet") as long as you are using kernel 5.4+ everywhere you want that network. Obviously 0.0.0.0/32 is still excluded.

Windows (and probably macOS?) still block that as of today so it can be an extra safe (and large) space if your only worry is "I don't want my virtual network space to collide with outside or client sourcing networks".


There is an extremely low chance what you find a collision with 30/8.

Tailscale does do something along these lines, but it's kind of surprising... If there is a tailscale route announced, it will take precedence over other routes, even if they are more specific. The primary use case for this seems to be if you are at a coffee shop, you can access a tailscale routed network that has the same IP range as the coffee shop.

The down side of this is that if you have, say, a wider announced network routed over your tailscale (say 10.0.0.0/16), and then you have machines within that network also connect to the tailscale (say a machine with 10.0.1.1/24), it now will try to reach machines on its local network interfaces via routing over the tailnet. Pretty unexpected.

https://github.com/tailscale/tailscale/issues/6231


We had this issue with Docker, which uses 172.19.0.0/16 for the default bridge. You can change it but it was weird figuring that out the first 2 months.

Tailscale does a fun thing that's mapping overlapping ipv4 networks into different ipv6 networks:

https://tailscale.com/kb/1201/4via6-subnets/


Can we let IPv4 and NAT die already?

No. Even in IPv6-only-land, multihomed networks without PI space will have good reason to NAT.

If you generate a ULA address using the algorithm that is recommended the likelihood of a collision in IPv6 addressing space in networks is absolutely miniscule.

If you're using ULAs, you don't NAT to avoid addressing collisions; you NAT so that your traffic is routable on the Internet. If you try to pass traffic sourced from a ULA to your upstream without NAT, it or its response is going to get dropped on the floor.

You wouldn't route traffic over the NAT using ULA to the outside world. You'd use GUA space for that.

Collisions between two private networks is very low was my primary point, and thus NAT is not a thing that needs to exist.


Yes, exactly. And if you have two upstreams, there's no single GUA prefix that makes sense to use in all situations. You make your routing decision, then you NAT (er, sorry, NPTv6, which is Totally Not The Same Thing As NAT) to the GUA prefix corresponding with the network that you're egressing from.

If you don't need Internet connectivity, yes, NAT-free ULAs work fine.


NAT requires kernel connection tracking. NPT explicitly does not. There's a lot of useful implications to this.

Stateful NAT requires kernel connection tracking. Stateless NAT does not, and is still a form of NAT. It's sometimes used in IPv4 networks, even!

Didn't you mean stateful NAT when you were making the comparison?

That wasn't my intent, but I see how it reads that way now. The parenthetical was me griping about naming, not meant to update the meaning of the sentence. In the dual-upstream scenario, I'd use stateless NAT with a single on-link prefix (GUA or ULA).

> you NAT so that your traffic is routable on the Internet

Erm... why the hell would you use NAT for that?

One of the features of IPv6 is first-class support for multiple IP addresses on a single interface. Your interface should have a one (or more) routable IP addresses that should be used for packets traveling to the public internet and one (or more) ULAs for reaching internal networks.


This is how it was envisioned, yes, but in practice there are issues with source address selection when you have multiple prefixes on-link for multihoming purposes. See https://www.rfc-editor.org/rfc/rfc5220 for a description of the problem.

> If you're using ULAs, you don't NAT to avoid addressing collisions; you NAT so that your traffic is routable on the Internet.

Note that "IPv6 NAT" really should be NPTv6:

* https://en.wikipedia.org/wiki/IPv6-to-IPv6_Network_Prefix_Tr...

It allows for 1:1 mapping of external IPv6 addresses to internal IPv6 addresses, without the silliness of port mapping and such.

Of course your firewall/network device can still have a default-deny rule so that only responses to internally-initiated requests get through. Stateful firewalls are still effective (and were invented before NAT).


But why would you use ULA addresses for anything other than internal networking? What benefits does it provide exactly?

And even for internal networking, why not just use properly addressable IPv6 addresses?

Because you're gonna need a firewall either way.

I have a couple of /40s to my name. My internal network has an assignment of /48. It's not announced (and is also firewalled off, not that it matters, since there's no routing table entry for it). The chance of collision is nil, because nobody should ever be using addresses within my IP space.

Endpoints that require external connectivity simply have 2 addresses on it. One that's routable, and one that isn't.


> What benefits does it provide exactly?

Address stability. If you use the prefix your ISP gives you via DHCPv6-PD or whatever, it might change on you, and then all of your hard-coded configs are wrong.

> And even for internal networking, why not just use properly addressable IPv6 addresses?

It costs $250/yr (and hours of bureaucracy navigation) to do this in the ARIN service region. If you've got a prefix, it's certainly a good way to use it! But we can't expect everyone to get PI space.


I finally decided to learn IPv6 and deploy it on my home network this year, and the lack of stability in the prefix I get from my ISP has been by far the biggest letdown. It basically neuters the whole “you don’t need NAT any more!” dream of IPv6.

I’ve taken to having both a ULA prefix and a public prefix for hosts in my subnet, but the public one is basically worthless because it changes seemingly every week. I had to put a ton of effort into making a templated pf.conf updated by a dhcpcd hook so that my firewall rules update automatically, but it’s still a shitshow. When my prefix changes, my router doesn’t seem to want to rescind the old RA’s so now I have two public prefixes floating around and half my hosts can’t get to the Internet any more. I had to drop the lifetime to <1hr to mitigate it but it’s a complete joke. If ipv4 fallback didn’t work I’d have a broken network every week.

At this point I’m considering just using NPTv6 and dropping the concept of routable IP’s for my internal hosts altogether. It’s just not worth it. At which point, it’s a stretch to even say IPv6 is worth it.


> At which point, it’s a stretch to even say IPv6 is worth it.

ISPs being retarded is not a fault of IPv6 though.


Sure, I absolutely agree, Comcast deserves to be named and shamed here. But “not worth it” to me doesn’t mean I’m making a judgment against IPv6 as a protocol suite, just that, in practice Comcast’s shitbaggery makes the whole effort hard to justify.

They can give you a static IPv6 prefix, too. But you have to pay extra for a static IPv4 address to get it (which makes what kind of sense?) and you must rent their equipment (ie their router, not just a modem) to get it. So that’s easily $30-$40 more a month (equipment rental plus static IP charge) they’re holding your network hostage for. Pay up or get re-prefixed every week.

It really makes me want to puke that they’re literally incentivized to fuck up my network to try and make the extra upcharge seem worth it. There’s no reason whatsoever they couldn’t just give me the same prefix forever. There’s no shortage of IPv6 space. If I had literally any other choice in ISP I’d drop them in a heartbeat. They should all be thrown in jail.

Somewhere out there there’s a Comcast engineer whose management told them to intentionally configure their DHCP6-PD server to forget (and likely intentionally shuffle) delegations, to pressure customers into ponying up for a static IP. Maybe you’re reading this post some time in the future. I hate you and I wonder how you sleep at night.


I've had the same IPv6 prefix (/60) from Comcast for over 2 years now. I set up my DHCPv6 client with a stable GUID and even after my equipment being off for a couple of days I still pulled the exact same prefix delegation to my CPE.

In my old house where I lived for ~9 years, I had the same IPv6 prefix (/60) from Comcast for a little over 5 years since I turned on IPv6 in 2015 and had not changed until I moved to SF.

Sounds like there's something wrong with your CPE where it is not sending the same GUID to the DHCPv6 server and thus is getting a new prefix delegation each time.


For the first few weeks using IPv6 I used a UniFi security gateway with some pretty standard config (you don’t get to adjust your DUID or anything) and a Comcast business gateway as my modem (it’s also a router, so if there’s a DUID misconfiguration it’s Comcast’s fault). So the Comcast gateway got a /56 and further delegated a /60 to my USG (the Comcast modem has no bridge mode, this is how it has to work), and my prefix still changed 3 or 4 times.

I’ve since changed to my own modem and my own OpenBSD box with a statically configured DUID (randomly generated UUID persisted via the config file) in my dhcpcd.conf. My prefix still changed a few times.

I’ve heard a lot of people saying their prefix has been mostly stable, but it hasn’t been the case for me. Maybe my account is misconfigured on Comcast’s end, maybe something else on their end is wrong, but I’ve checked everything and it looks right on my end.

(My IPv4 address has remained perfectly stable this whole time too. Only my IPv6 prefix seems to be constantly changing. It’s the exactly the opposite of what I’d want, I could care less if my IPv4 address changes, I only need my IPv6 prefix to be stable.)


> Address stability. If you use the prefix your ISP gives you via DHCPv6-PD or whatever, it might change on you, and then all of your hard-coded configs are wrong.

Right. I've been using my own address space at home for a while now that I've forgot how dumb ISPs can be.

I've got a BGP session on a VPS that's located in the same facility as my ISP (my upstream and by ISP are even peering there, over both, v4 and v6. The only thing missing for me is IXP access, which the VPS provider offers, at €50/month, so in this instance it's not worth paying for, but damn I'd love to give my own ISP the routes to my home network myself), so I'm just running a WireGuard tunnel from that box to my home (mainly because my ISP still doesn't offer any IPv6 whatsoever). This setup actually costs me less than a static IP from my ISP would. They charge €15/month for a static IP. My setup costs me €6/month for the VPS and additional €5/month for the BGP session.

I could even do iBGP between the VPS and my home router over that tunnel, but that's far too hardcore. I think the /56 I've given for my home network will be enough for a couple of lifetimes. After all, it's 256 subnets of /64, and how many VLANs do I need at home? :D

Unfortunately not everyone can do this. :(


Yea and the home routers rarely support ULAs. Hell, even UniFi which is supposed to be semi-pro: nothing, can't even disable DHCPv6 when you do DHCPv6 PD.

And then there is SRM on Synology... the amount of bugs I have reported on that is insane.

Which makes me think: is DNSMASQ the right tool for the job? It's extremely complicated in my opinion. If this, then that, but not when you do that thing over there.


UniFi technically supports it but you have to be willing to edit your gateway.config.json to do it, which is very error prone and super easy to screw up. So yeah, it’s still pretty awful. I switched to a plain OpenBSD box as my router because I hated UniFi’s IPv6 support so much.

Sadly the UDM range doesn't support gateway.config.json.

I think the only OS that handles IPv6 correct is Microtik. But they have no decent all-in-ones.


[dead]

This will never happen, I'm afraid :/

Just give it another 25 years.

IPv6 started rolling out... let's call it in 2000, which is probably a touch late but close enough, and https://www.google.com/intl/en/ipv6/statistics.html puts it at 43% of traffic, but that's positive biased (because it's only traffic on the public internet). So... I'm going with no, as a society we can't and/or won't.

> I'm going with no, as a society we can't and/or won't.

But we definitely should. The number of IPv6-only networks is growing by the day.

The next generation of tinkerers and internet engineers will have no IPv4 address space at all, because it makes no sense whatsoever paying $14k (at current prices) for /24 of legacy IP address space.

Unfortunately, the best we can hope for is for giant corporations like Cloudflare, Microsoft and Amazon to buy up most of the IPv4 space.


Tl;dr: remap each side to a different /24 using iptables and vrf device. If you dont want to mess around with iptables this can also be trivially achieved using OVS


[dead]

If each network also had unique dns services, then I think in theory you could run a split horizon DNS server on the middleman that rewrites results from the other network as the “alt” address.

E.g. in the example middleman could rewrite a query for `outernode0` that comes from the inner node network to resolve to 192.169.3.10 (instead of .2.10).

I’m not sure which dns servers allow rewriting logic like that though.


The usual suspect, BIND, has a feature called Response Policy Zones that can serve different responses to different ACL matches (such as source IP). It can even autogenerate the modified records if you ask it politely.

Of course a screwdriver also has a feature called Quick Eyeball Removal which may be preferable. That or I'm just bitter about having done this :).


You can also achieve similar goals in dnsmasq via --localise-queries and --dynamic-host. It's a bit clunky but not complicated.

I run coredns as internal dns and the advice for this solution from them is to run another coredns and solve it in config management. While not a pleasing answer its probably the easiest thing to do. Someone did write a plugin but it requires a fork to get some hooks and that was declined its PR.

I am piggybacking my question on this discussion: as a beginner to Linux networking, are there good resources one can learn from (such as the use of ip link for route management as discussed in the article?)

Check the FRR project.

https://frrouting.org/


Who are this powerful sorcerers that implement those network features? It’s one thing to figure out how to do that, but there must’ve been somebody who crafted all those components first. Truly impressed.

Network providers have been using VRF for decades. I was doing Cisco admin of VRF networks in the 2000s.

True, i didn’t think about that. The Linux implementation probably came afterwards and was based on the experience with those dedicated network devices.

So, whilst possible to do, and this blog is a brilliant guide to doing it, but there are better ways.

if this is a network you own either end, then if you really do have the same subnets issued to different sites, then you would be better served having virtual interfaces/VLANs to have a second IP address.

Or, just use ipv6, and you have enough IPs to do what you want. You can use 6to4 or nat64 to get IPv4 connectivity. most modern networks terminated services on the edge with loadbalancers, so you can use them as 6->4 bridges.


Very cool. Surprised nftables doesn't support NETMAP yet though!

[dead]

If you want to do VRF on Linux and don’t have a philosophical objection to systemd, I strongly recommend using systemd-networkd. VRF, wireguard, and many other complicated things are easy to configure out of the box and don’t require extra scripts or custom units.

My home router is a Raspberry Pi running Ubuntu server. VRF, Wireguard, VLANs, bonding, llrp, route advertisement, prefix delegation, dhcp server and leases, all configured only with systemd-networkd.


Network overlap alright. If the hosts do not overlap I bet you can enable end to end comm with just /32 host routes

All of this just because people didn't want to figure out IPv6!

Any system connecting to the internet will need IPv4, if you go down the IPv6 route you're going to have to maintain a dual-stack infrastructure (with host, firewall, rotuer configs etc) and get at least 2x the amount of headaches whenever anything goes wrong.

What is the business case for all that extra pain?


I was looking for some information on how VRFs compare to Linux network namespaces & found https://www.toddpigram.com/2017/03/vrf-for-linux-contributio... which does a good job of explaining why it makes sense to use VRFs even though there is some functionality overlap with network namespaces.

I also found some information about using VRFs in the documentation for RHEL 8: https://access.redhat.com/documentation/en-us/red_hat_enterp...


Legal | privacy