Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't overwrite old DNS results with empty list #1260

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johnmaguire
Copy link
Collaborator

This should improve DNS static_host_map entries for some users, such as the scenario described in this comment: #909 (comment)

The hostnamesResults.ips variable is used to build the list of available underlay IPs for a given VPN IP. All hostnames are queried on static_map.cadence and hr.ips is updated when the new list of resolved IPs is different from the old list. A more detailed explanation can be found here.

This could cause slow reconnects after connectivity loss:

  1. Have a static_host_map entry for a Lighthouse that points at a DNS name only
  2. Form a connection to the Lighthouse based on the result of the DNS query
  3. Lose internet connection
  4. DNS lookup fails, overwriting the IPs for the Lighthouse with an empty list
  5. Internet connectivity is restored
  6. No underlay route to Lighthouse until static_map.cadence

In addition to the above fix, I renamed NewHostnameResults to NewHostnamesResults which matches the struct's name and included vpnIp in logs related to the DNS resolution.

Some notes:

  • If a static IP address is defined in static_host_map in addition to a DNS name, prior DNS resolutions will still be overwritten
  • If a static_host_map entries defines multiple DNS names, and one resolves while another fails, any prior results for the failed DNS name will still be overwritten
  • A cleaner solution would involve tracking the results by hostname. This would allow us to avoid the previous two pitfalls. One could imagine these living in RemoteList.dnsCache, using the hostname as map key, similar to RemoteList.cache which uses the Lighthouse/self vpnIp as map key:

    nebula/remote_list.go

    Lines 198 to 201 in 3e6c755

    // These are maps to store v4 and v6 addresses per lighthouse
    // Map key is the vpnIp of the person that told us about this the cached entries underneath.
    // For learned addresses, this is the vpnIp that sent the packet
    cache map[netip.Addr]*cache

Example logs:

-- Demonstrating the new behavior
ERRO[0870] DNS resolution failed for static_map host     error="lookup foobar.johnmaguire.me: no such host" hostname=foobar.johnmaguire.me network=ip4 vpnIp=100.100.0.3
ERRO[1140] DNS resolution failed for static_map host     error="lookup home.johnmaguire.me: no such host" hostname=home.johnmaguire.me network=ip4 vpnIp=100.100.0.3
INFO[1140] No IPs resolved for hostnames, refusing to overwrite existing IPs  hostnames="[{home.johnmaguire.me 4242} {foobar.johnmaguire.me 4242}]" vpnIp=100.100.0.3

-- Demonstrating a known issue above (static IP in static_host_map)
ERRO[1140] DNS resolution failed for static_map host     error="lookup home.johnmaguire.me: no such host" hostname=home.johnmaguire.me network=ip4 vpnIp=100.100.0.4
INFO[1140] DNS results changed for host list             newSet="map[127.0.0.1:2424:{}]" origSet="&map[127.0.0.1:2424:{} 162.194.213.135:4242:{}]" vpnIp=100.100.0.4

Using config:

static_host_map:
  "100.100.0.3": ["home.johnmaguire.me:4242", "foobar.johnmaguire.me:4242"]
  "100.100.0.4": ["127.0.0.1:2424", "home.johnmaguire.me:4242"]

lighthouse:
  hosts:
    - "100.100.0.4"
    - "100.100.0.5"

Where foobar.johnmaguire.me was never resolvable but home.johnmaguire.me was resolvable at startup and then became unresolvable.

Comment on lines +136 to +137
if len(netipAddrs) == 0 && len(*origSet) != 0 {
l.WithFields(logrus.Fields{"hostnames": r.hostnames}).Info("No IPs resolved for hostnames, refusing to overwrite existing IPs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to behave in this scenario. LookupNetIP is asking the local resolver to either perform the DNS lookup or return its cached results. If it fails, then presumably there are no cached results because the TTL has expired. Should we keep hitting IPs when the TTL from prior queries has expired?
We could consider doing something else - Fast re-query on lookup errors? User could double the DNS lookup timeout? Or, never remove IPs from the list, just keep hitting whatever IPs we learned in the past? (I believe this is how the static IPs work today on reconfigs.)

@nbrownus nbrownus added this to the v2.0.0 milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants