TLD DNS Failure Modes

Contents

For the last 12 years I've had a personal project checking each Top Level Domain (TLD)1 delegated in the root zone2 with some very simple DNSSec enabled queries. I get an email every morning reporting what problems were observed amongst the delegated TLDs overnight. If I detect a failure, I'll run additional checks on all reachable nameservers for the domain, to try to grab as much troubleshooting information as possible.

Occasionally I've emailed TLD operators to explain an outage I think they may not be aware of. But my success rate when contacting these operators hasn't been very good.

This year I've observed several outages or failures that I believe are examples of the types of issues that smaller TLD operators continually struggle with. So I'll list a few examples from the last couple of months and provide some suggestions about how to avoid these issues.

Terminology

For explanations and references to some of the terms used in this article go to the Terminology section

Operator change or permanent loss of capability: .bw

Nameserver Administrator
dns1.nic.net.bw BTC
master.btc.net.bw BTC
pch.nic.net.bw PCH
ns-bw.afrinic.net Afrinic

I'll start with .bw (Botswana). The .bw domain is delegated3 to 4 nameservers. In a recurring theme for smaller ccTLDs in Africa, the operator uses a nameserver provided by PCH and one provided by Afrinic in addition to 2 nameservers managed by the Botswana Telecommunications Corporation (BTC).

One of the nameservers (master.btc.net.bw) operated by BTC has been broken for several weeks at the time of writing. The zone still contains NS records pointing to this host and the delegation at the root has not been changed.

The other nameserver operated by BTC has a copy of the .bw zone that is several days older than the other 2 nameservers. This second BTC host does not have dnssec records in the .bw zone. It's possible that the differences between the zone on the PCH and Afrinic hosts and the BTC hosted zone is limited entirely to dnssec records and signing events and that the delegation entries within the .bw zone are completely consistent amongst all 3 hosts. But since the SOA serials4 are different, there's no easy way to check. I could compare the zone files if I had access. Unfortunately zonefile access for ccTLDs is too challenging for me to contemplate, so I won't know for sure if my guess is correct.

Trim broken hosts

Operators tend to be nervous about making changes to their zone file, even when those changes might remove data that's causing errors for their consumers. This might be why .bw remains delegated to the broken nameserver. Given the length of time the host has been broken, there's no reason to believe it would continue to be well administered if it were recovered. More likely it would be a source of repeat outages. The root delegation should be updated to remove the host and the zone file should be updated to remove master.btc.net.bw from the NS set.

Some DNSSec is worse than none

With one host unreachable, that leaves 3 nameservers answering queries for .bw. Two of those contain DNSSec entries, one does not. I can only assume that the lack of DNSSec material on the second BTC host is due to a capability limit, because no sane person would choose to deliberately configure their domain this way. Since there's no DS in the root, there's no benefit to supplying DNSSec responses, but it does cause some problems. For example I see bogus denial of existence replies from the DNSSec capable nameservers. Unless DNSSec capability for the BTC host is imminent or it will be imminently removed, I recommend removing all DNSSec material and taking steps to ensure that all three working nameservers have the same version of the zone.

TLD where'd you go?: .xn--mgbai9azgqp6j (پاکستان)

Nameserver Administrator
NS.NTC.NET.PK NTC
NS1.NTC.NET.PK NTC
NS2.NTC.NET.PK NTC

All three of the nameservers to which the .xn--mgbai9azgqp6j (Pakistan) domain is delegated are unreachable. They have been unreachable for some time and I have no expectation that if one or more becomes available, that they will stay in that state for any significant period. The issues with .xn--mgbai9azgqp6j are all too frequent in my experience. The IDN5 version of a ccTLD represents the aspirations of a nation to have their own identity on the Internet, presented using their language and scripts. Unfortunately the technical reality is that almost nothing works seamlessly for IDN addresses and this is particularly true for right to left scripts. So it isn't surprising to me when I see an IDN ccTLD turned off or abandoned and subsequently left in a broken state.

Some further observations

Although unreachable, I can see that the hosts all fall into 2 networks, both of which have cyber.net.pk as the domain for the contact in their reverse zones. Interestingly, they've used two different email addresses, which might mean 2 different teams. Probably due to a misunderstanding of the functionality of SOA records, one of the reverse zones actually contains an escape character to allow an @ in the admin email field.

There is a DS in the root zone for .xn--mgbai9azgqp6j, which may cause problems if the zone is ever set up on new infrastructure and returned to operations. It should be removed.

I suspect the operator will be unwilling to remove .xn--mgbai9azgqp6j from the root zone for fear that re-adding it in the future may be more difficult. This is especially true if it was originally delegated as part of the IDN Fast Track program. However from a technical perspective, removing it would be better than leaving a broken delegation in the root.

TLD operations can be hard to maintain: .xn--wgbh1c (مصر)

nameserver Administrator
ns1.dotmasr.eg NTRA
ns2.dotmasr.eg NTRA
ns3.dotmasr.eg NTRA
ns4.dotmasr.eg PCH

The domain .xn--wgbh1c (Egypt) has lame6 delegations and replies with some odd records on it's remaining functional nameservers. I draw a distinction between difficult to fix or identify problems and problems that an IT generalist could observe and resolve. I see plenty of issues that fall into the first category, especially amongst ccTLDs and we can surmise that these problems continue because the operator lacks the necessary skills or has lost them and thus can't resolve the issues quickly. However it is into the second category that the issues with .xn--wgbh1c fall. A lack of DNS expertise should not be preventing the operator from being aware of and fixing the current problems.

Sometimes you just need to let go

The domain is delegated to 4 nameservers: ns[1-4].dotmasr.eg. ns4.dotmasr.eg has an address consistent with a PCH hosted nameserver. However the delegation is lame. Has someone not paid the bill? These services are often provided free, so perhaps there's another explanation. This domain isn't the only case where a delegation to PCH is now lame. The issue for .xn--wgbh1c has continued for months now, so the operator should be aware and should be capable of resolving the issue.

With the PCH host no longer answering queries for the domain, it should be removed from the NS set and the root delegation. There's no value in retaining it, even if there is a desire to make use of PCH's services in the future.

Ancient hints and a fast watch

I observed a few more issues with .xn--wgbh1c relating to DNSSec responses and operations. When querying the operator hosted nameservers, they would return hints for the nameserver addresses. These hints were signed with expired signatures (expired 2020). The dotmasr.eg domain does contain DNSKEY records including the ZSK used to generate the expired signatures, but the dotmasr.eg zone is not actually signed. This suggests a breakdown in technical operations that has not been addressed for years.

The .xn--wgbh1c appears to be signed correctly, however the root zone currently has both SHA-1 and SHA-256 DS entries. The SHA-1 entry should be removed.

A final amusing item I came across was the apex TXT record for .xn--wgbh1c which appears to provide the time for when the zone file was last regenerated in unix timestamp format. However when I queried, it was 30 mins into the future. This wont cause any DNS issues unless the time source is shared by whatever is signing the zone. But I've not seen any evidence of rrsig entries with inception times in the future.

Not all gTLDs were a good idea: .creditunion

Nameserver Administrator
ns1.uniregistry.net Tucows
ns2.uniregistry.info Tucows
ns5.uniregistry.net PCH
ns6.uniregistry.info PCH

While being an IDN ccTLD domain seems to increase the likelihood that everyone will forget about you, some gTLDs also manage to become so unloved by their own operators as to simply stop working properly. I'm not quite sure why PCH may have hosted .creditunion, but whatever the reason was, that's history and the delegation has been lame for a couple of months now. I no longer pay attention to which registry operator is going out of business this week. However I'm sure there is a connection between operator changes and a failure to realise that their TLD no longer works properly.

The fix is simple. The delegation to the PCH addresses should be removed. ICANN monitors each gTLD operator, as part of their Registry Agreement. Therefore I find it unlikely that the people capable of making the changes to the root zone and within the domain, remain unaware of the problem or the appropriate resolution. Why such a simple fix hasn't been carried out and why the operators haven't been forced to fix it, remains a mystery to me.

Are ccTLD operators given good advice?: .ls

Nameserver Administrator
ns1.nic.ls Lesotho NIC
ns2.nic.ls Lesotho NIC
ls-ns.anycast.pch.net PCH
ns-ls.afrinic.net Afrinic

For many months now some nameservers for .ls (Lesotho) have responded to DNS queries with untrusted DNSSec material while others have no DNSSec records at all. I mentioned earlier in this article that partially responding with DNSSec can lead to inconsistent answers to consumers and errors. In addition to the inconsistencies, since .ls has no DS in the root, the DNSSec records which are returned to consumers, will provide no benefit.

As is common for smaller ccTLDs, this domain is delegated to nameservers managed by the operator and PCH. The zone serial appears to be the unix timestamp scheme favoured by many TLDs. If you're wondering, I always recommend this approach since it allows for simple automated generation and easy troubleshooting.

Based on my observations the primary server (ns1.nic.ls), which is managed by the operator, generates or receives from the registry backend, an unsigned version of the TLD zone file. The PCH server either transfers this zone file from the primary or another hidden primary and then signs the zone inline, resulting in a serial value which is a few minutes after that of the primary. This discrepancy between serial values isn't an issue for the nameservers, the PCH instance will keep track of the unsigned serial internally. But this can become confusing when humans are trying to troubleshoot zone transfer issues. I see a lot of this in areas where network paths are unreliable and connectivity between nameservers can fail for long periods.

Complexity for its own sake?

So if there's no benefit to signing and serving a copy of the zone and it creates errors and inconsistencies for consumers, why do operators do it? It's possible the PCH host signs in-line by default, but that strikes me has a really poor design choice, so I'll assume that isn't the case for now. I suspect that the benefits of DNSSec are overplayed and the risks connected to inconsistent versions of a zone are underplayed in the forums within which many ccTLD operators educate themselves on TLD operations. Regardless of the reason, the TLD should be consistent when queried by a dns consumer. Therefore assuming the operator's nameservers are not DNSSec capable, there should be no DNSSec records returned by any of the TLD nameservers.

Too many chefs: .td

Nameserver Administrator
pch.nic.td PCH
ns-td.afrinic.net Afrinic
anycastdns1.nic.td Gransy
anycastdns2.nic.td Gransy

.td (Chad) came to my attention due to bogus responses, just like some of the other ccTLDs I've spoken about here that have some DNSSec capable hosts and some that aren't. However .td has a subtly different design and some additional problems that make me wonder if the operator is still actively involved in administering the zone file and nameservers.

Same same, but different - DNSSec

If you've read this far you wont be surprised that .td uses a PCH nameserver in addition to an Afrinic nameserver and two instances from Gransy. However, unlike the other small TLDs I've covered in this post, the primary server for .td is PCH. So it almost makes sense that the zone is DNSSec signed. Almost. There is no DS in the root zone for .td. For some reason the secondary servers do not have the DNSSec signed version of the zone file. I'm fairly confident that each of the secondaries is capable of hosting DNSSec signed zones. So what's going on here?

The primary server listed in a TLD zone's SOA record is often not the source from which a TLDs nameservers fetch the zone file. Typically there'll be a hidden primary nameserver that receives updates or the full zone file from a registry system. It will notify the publicly visible nameservers of changes to the zone who then transfer the zone from it. This can prevent one external operator depending on another external operator. Do each of the public nameservers transfer an unsigned zone and only the PCH instance inline signs? It is hard to determine from the outside. But what we can see is that once again we have a TLD's authoritative nameservers giving inconsistent and bogus replies.

How many extra operators is too many?

Uniquely amongst the group of TLDs I've discussed, .td exclusively outsources the nameserver function for TLD hosting to other organisations. The .td operator does not maintain their own hosts at all. Given the operational mistakes I've mentioned already, I should probably applaud this sensible approach of letting experts provide you with services they can maintain. However .td appears to have taken things too far and perhaps lost track of who is responsible for what, which has resulted in an ongoing outage for one of its providers. Let's see who they use.

  1. PCH provide the primary (possibly not a true primary, see above) nameserver, which also locally signs its own copy of the .td zone file.
  2. Afrinic provide a secondary nameserver which does not sign and therefore hosts a non DNSSec version of the zone
  3. deSEC provide hosting for the nic.td subdomain. Both the PCH and Gransy nameservers use labels under nic.td. Unfortunately the entries for the Gransy nameservers have either been removed or were never present. This makes the Gransy nameservers unresolvable.
  4. Gransy provide 2 anycast nameserver instances. Their domain names are unresolvable, but they appear to have the current (unsigned) version of the .td zone file.

With 4 separate DNS providers, is it any surprise that there's so many inconsistencies in the way the zone is hosted and reachable?

The Same Problems

The outlier in the TLDs I've mentioned in this article is .creditunion. There's no good reason for the issues that TLD is experiencing and the regulatory process to govern gTLDs should have already forced it to be fixed. I don't think there's a long term operational lesson here, except to hold ICANN more accountable in the way it enforces the technical availability requirements of its gTLD contracts.

The remainder of the TLDs I've spoken about all suffer from a lack of DNS operational knowledge on the part of the responsible organisation. This knowledge is distinct from simply understanding how DNS works. The role of Registry Operator involves coordinating suppliers of DNS and registration system functions over long periods of time. In the case of ccTLDs this might include dealing with changes in national governments. The lengthy periods of time involved and the lack of prestige and financial rewards in operating a small TLD mean that ccTLD operators may struggle to retain skilled staff. Typically Registry Operators will obtain training for their staff when making significant changes to their TLD, but the periods in between such changes may be many years long.

I suspect that in isolation, were I to present the current state configuration of their TLDs to each of the Registry Operators involved, they would immediately grasp the issues and have at least a general understanding of the appropriate fix. But due to capability reasons they may not be aware of the issues or may not be aware of the scale of impact. And due to a lack of confidence in their own DNS understanding, they are unwilling to make changes to their TLDs for fear of making things worse.

I can't say for certain whether the broken state of the ccTLDs described herein can be directly attributed to poor advice or lack of awareness. But I feel that many of the advisory organisations concerned with developing nations' Internet registries can do much better. For example I think the ccNSO within ICANN should prioritise operational best practices and DNS stability when considering their meeting and working group agendas.

More to come

The TLDs I've spoken about here all had recent long running failures or full outages. But there'll be more, many more in the future. I don't bother writing blogs for transient or short term outages, but I will mention them on Mastadon, so check there if you can't get enough of the exciting drama that is Top Level Domain administration. I'll be sure to write more about any future long running outages I observe.

Terminology


  1. Top Level Domain: Examples of TLDs include .com, .au, .ae, .cloud and .ninja. They can also be called Internet Registries, although that term encompasses the additional registration back end responsibilities of TLD operators.

    ccTLDs are Country Code TLDs and are typically governed by an operator assigned by the national government of the country to which the ccTLD belongs. ccTLDs do not have an enforceable contract with ICANN.

    gTLDs are Generic TLDs and are mostly, but not exclusively, commercial ventures. In all cases gTLDs have an enforceable contract with ICANN.

    More detail under Top-Level Domain ↩︎

  2. Zone is used to describe the DNS content for a domain name as stored on an authoritative DNS server. While rarely actually configured as a file today, it is not uncommon to hear or read about a TLD's zone file. When used in this manner, people tend to refer to all the apex records for the TLD, the applicable NS records for the TLD, as well as the delegation entries for all it's child domains that make up the bulk of a common TLD zone file. More detail under Zone ↩︎

  3. A delegation is the mechanism by which a parent domain (the root zone in the case of TLDs) points a DNS consumer to the responsible authoritative servers for a child domain. More detail under Delegation ↩︎

  4. The zone Serial is a field within the SOA record and is used to indicate the current version of the zone. Each content change to the zone should also result in a higher value than the previous serial number. Older DNS management practices that didn't automate DNS content deployment often used a scheme loosely borrowed from ISO-8601 where the values for YYYYMMDD preceded a single or double digit increment value for changes on a given day. Today most modern systems will use unix epoch values. More detail under SOA field names ↩︎

  5. Internationalised Domain Names (IDN) use an algorithm containing ASCII characters to represent unicode characters in DNS. More detail under Internationalized Domain Name ↩︎

  6. Lame delegations occur when a domain is delegated to a host that is not configured to respond for that domain. More detail under Lame Delegation ↩︎