Upgrading windows server and lack of redundancy

TL;DR Redundancy is important!!!

I enjoy projects and sometimes when I can’t afford new projects I just make something up. In this case, I’ve had an AD Domain Controller on Server2012r2 for.. around 4 years now. All the systems, both windows and linux, are joined to the domain. Permissions are controlled solely by group membership in AD and things work great! The domain controller is also the DHCP and DNS server for the domain (as you might expect since I have a DC…) Anyway, I decided hey, it’s 2017, why am I still running server 2012r2? Let’s upgrade to 2016! Well, I’ll kick this off by saying upgrades are the worst, I never do it and I steer clients away from it all the time. Sadly, I was lazy and I did an upgrade. It actually went very well, I had to do a couple extras but very smooth over all. I still plan to do a clean rebuild in the near future.

On to the actual issues (sorry, we will be lacking some details, but upgrading windows server is very smooth, there isn’t much for me to say about it unless requested… As I mentioned, this server also provides DNS. Since most machines on my network are Windows, which caches DNS by default, I didn’t think taking DNS down would be a big deal. Well, I was wrong, so wrong.

We’ll start with the first issue. I cut cable/sat a while back. We use Plex, Hulu, Amazon, and Netflix solely in our house (and an HD antenna for local HD news). Upon beginning the upgrade process, I was immediately alerted by my children that something was amiss. They couldn’t connect to plex, they couldn’t access PBS Kids, and I had recently ripped the Moana disc to the plex server.. this was a HUGE problem. So I start clicking around on the TV and nothing is working, finally I notice the error message in the top right saying “Not Connected” well, there we go! I reboot the TV, nothing, I unplug the TV and turn it back on, nothing. I look up at the ceiling where the (brand new) Meraki MR32 AP is and notice the blinking orange light… WTF

I check my cell phone, no wireless… well ok, let’s see what’s up. I pull up the meraki dashboard and am quickly informed that the controller lost connection to my APs 30 minutes ago.. OK easy, the internet is down, let me reset the modem or fail over to the backup connection. I get downstairs, things are working fine on my desktop… OK… let’s google. Well, as it turned out, if the AP can’t communicate to the cloud controller and DNS is not functioning, it simply stops responding. In addition, somehow my APs also thought they couldn’t find the uplink.. which is strange since they’re POE :/ OK well, the kids can read some books (should’ve kept a bd player in the house somewhere.. oops) while the wife and I go downstairs to watch our shows. We get downstairs to our hard-wired TV and I launch hulu.. nothing.. well! my AndroidTV doesn’t cache dns.. how nice.

At this point, I’ve realized I made a horrible choice in not having a secondary DNS controller. In my defense, I used to when I ran a tomatoUSB edge router, but upon upgrading to real gear… well, it wasn’t a linux box anymore so I never set anything up. I check the upgrade progress and I was 80% complete! Well not too much longer. Interestingly, immediately upon completion of the upgrade, DNS was back up and running and things were magical again. The APs instantly reconnected and all was good. I made a note on my desk pad to build a secondary DNS server in the morning.

I got up to take the kids to school this morning and upon my return, i immediately set out to build a dns slave server. I didn’t really have the resources for another windows box for AD DNS, so I took the girls linux KVM box which has plenty of resources to handle bind9 and installed bind9. Setup was pretty simple, first I added the new IP to the AD DNS nameservers tab, then I authorized zone transfers to IPs listed under nameservers. On the linux host, I ran an

apt-get install bind9

and began configuration. I’m pretty good with bind, I actually wrote a nice HOWTO back in the late 90s for the CLarksville Linux Users Group, which no longer exists and took my HOWTO with it… 🙁

Let’s begin with the named.conf.local file

zone "dznet.pwnz.org" {
  type slave;
  check-names ignore;
  masters { 192.168.128.30; };
  file "/var/lib/bind/dznet.pwnz.org";
  allow-transfer { none; };
};

As you can see, I allow only the connection to the AD DNS server and I dont allow transfers from the slave. The check-names ignore is probably no longer necessary, but it was in the mid 2000s to work in conjunction with server 2000 DNS. Next I setup the options file to restrict who can query the system

allow-query { 192.168.128.0/26; 192.168.3.0/24; };

The first subnet is the main subnet in my house (which will need to be expanded soon, stupid IoT) and the second is my OpenVPN subnet. Obviously, I want both to be able to query when necessary, but no one else.

Once I finished these two very short snippets, I started the bind9 daemon and watched the zone transfer occur. To verify function, I went to my trust test box and ran

dig @192.168.128.37 dznet.pwnz.org +norecurs

, this does a look up on the root domain but doesn’t allow recursing, meaning ONLY the node I specified can perform the lookup. Sort of a lookup or bust deal here. The results, as expected showed my slave was working perfectly


; <<>> DiG 9.10.3-P4-Ubuntu <<>> @192.168.128.37 dznet.pwnz.org +norecurs
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42299
;; flags: qr aa ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dznet.pwnz.org.                        IN      A

;; ANSWER SECTION:
dznet.pwnz.org.         600     IN      A       192.168.128.30

;; AUTHORITY SECTION:
dznet.pwnz.org.         3600    IN      NS      dznet-dc.dznet.pwnz.org.
dznet.pwnz.org.         3600    IN      NS      dznet-girls.dznet.pwnz.org.

;; ADDITIONAL SECTION:
dznet-dc.dznet.pwnz.org. 3600   IN      A       192.168.128.30
dznet-girls.dznet.pwnz.org. 3600 IN     A       192.168.128.37

;; Query time: 0 msec
;; SERVER: 192.168.128.37#53(192.168.128.37)
;; WHEN: Tue Feb 28 09:15:02 CST 2017
;; MSG SIZE  rcvd: 140

Of course, I still went into AD and fully shut down DNS services to verify things were still operating properly... They weren't, I forgot to update option 6 in DHCP and renew the lease! Once I did this, with the AD DNS shut down, everything was working. I turned AD DNS back up and still working! Hopefully, I don't run into this issue anymore, especially since I had a calendar event to upgrade my ESXi in a couple weeks, which is where the DC, CUCM, NAS, and Asterisk PBX all live! WHEW.

Author: Will

I’m a Cisco Unified Communications consultant who dabbles in everything. I’ve been a Linux user since ’96, an Asterisk user since ’02, a Cisco route/switch guy since 2000 and various other things along the way. I have various degrees and certifications.
If you enjoy my blog, please consider sending me a donation!
bitcoin:37z9aQxJRTER6JyfNCb7NF5DsPinmmSPaj

Leave a Reply

Your email address will not be published. Required fields are marked *