Chapter 12: System Administration Fundamentals — Introduction to Digital Computation

Networks don't run themselves. Servers don't patch themselves. User accounts don't provision themselves. Someone has to do all of it — and when they do it badly, the network in Chapter 11 becomes the news story in tomorrow's headlines.

The Sysadmin Role

Sysadmin work splits cleanly in two: work you schedule, and work that schedules itself — usually at 3 AM. The planned side is deploying a new server, onboarding a new employee, applying patches on the regular cycle, testing backups before you need them. The reactive side is diagnosing a crashed service, recovering a deleted file, working through a security incident while the phone keeps ringing. Good administrators spend most of their time on the planned side, because solid processes shrink the reactive pile.

The job itself is responsibility for the health, availability, and security of an organization's IT infrastructure — the servers, the network, the user accounts, the backups, and the services that tie them together. In a small business, one person handles all of it alongside help-desk support. In a large organization, the responsibilities divide: network engineers manage switches and routers; systems engineers manage servers and virtualization; security engineers own firewalls, monitoring, and incident response; help-desk analysts handle end-user requests. However the work is divided, the goal is the same — keep things running, keep them secure, and know what to do when something breaks.

Data Centers

Chapter 11 introduced rack-mounted servers. Those racks live somewhere, and that somewhere matters enormously. A data center is a building whose only job is to keep computing infrastructure running — 24 hours a day, 365 days a year, regardless of weather, power fluctuations, or hardware failures. Every design decision is about eliminating single points of failure.

Power is the first concern. Data centers connect to the utility grid through multiple independent feeds, backed by banks of UPS (uninterruptible power supply) units that bridge the gap during outages, backed by diesel generators that take over within seconds and can run for days. A large enterprise data center might consume as much power as a small town. Cooling is the second concern: all that computing generates enormous heat. Precision air conditioning units maintain the room within a narrow temperature range, typically 65–75°F. Modern data centers use a hot aisle/cold aisle layout — server racks are arranged so their intake (front) faces one aisle and exhaust (rear) faces another, separating cold intake air from hot exhaust air to maximize cooling efficiency.

Organizations have three options for where to run their infrastructure: on-premises (in a server room they own and operate), colocation (renting space in someone else's data center — you own the servers but they provide the building, power, cooling, and connectivity), or cloud (renting virtualized compute resources with no physical hardware to manage). Most enterprises use a mix. The video below walks through what a large-scale data center actually looks like.

Video Inside a Google Data Center

A tour of a Google data center showing the physical infrastructure: power systems, cooling, server hardware, and the scale of modern cloud computing. Credit: Google.

User Account Management

Every employee has a user account — a record in Active Directory (AD) storing their username, password hash, group memberships, email address, and other attributes. AD is the central authority for authentication and authorization in most Windows-based organizations. When you log in to a Windows workstation at a company, your credentials are verified against AD. When you open a shared drive, AD is consulted to check whether your account belongs to a group with read permission. Every service covered in Chapter 11 — file shares, email, domain login — traces back to an AD account.

Creating a new account is one of the first tasks a sysadmin learns: create the account in AD with the employee's name, set a temporary password requiring change at first login, add them to the appropriate security groups, and assign them a mailbox. Security groups are how permissions are managed at scale. Instead of granting permissions to individual users, you grant them to groups — a "Finance" group gets access to the finance share; an "IT Admins" group gets elevated privileges on servers. A new Finance employee gets added to the Finance group and immediately inherits all Finance permissions. When an employee leaves, you disable the account immediately (revoking all access at once), then delete it after a retention period. Disable before delete is standard practice: a disabled account can be re-enabled if you need to access the mailbox or audit history; a deleted account cannot.

Principle of Least Privilege. Users should have exactly the access they need to do their job — no more. An accountant doesn't need admin rights on the domain controller. A contractor doesn't need access to HR files. Over-provisioned accounts are a major attack vector: if a user's credentials are stolen, the attacker inherits everything that user had access to. Auditing group memberships regularly to remove stale permissions is a core sysadmin responsibility.

Group Policy

Group Policy is Active Directory's mechanism for centrally enforcing settings on every domain-joined machine. A Group Policy Object (GPO) is a bundle of settings that is applied when a user logs in or a computer starts up. Common uses: require a screensaver lock after 15 minutes of inactivity; redirect each user's Documents folder to a file server so it's automatically backed up; map shared network drives at login; disable USB storage devices on machines in sensitive areas; enforce password complexity requirements and expiration intervals. One change to a GPO propagates to thousands of machines the next time they check for updates.

From an end-user perspective, Group Policy is the answer to most "why can't I do this on my work computer?" questions — IT has explicitly prevented it via policy. From a security perspective, Group Policy is one of the most powerful tools available: it enforces baseline security settings across the entire organization without requiring anyone to touch each machine individually.

DNS in Practice

DNS was covered conceptually in Chapter 10. The sysadmin view is different: you maintain the records. An organization's DNS splits into two zones. Internal (private) DNS serves the corporate intranet, resolving names like fileserver.corp.local or mail.corp.local to private IP addresses — these are never exposed to the public internet. External (public) DNS is what the internet queries when someone visits your website or delivers email to your domain; it's configured separately and managed through a domain registrar or public DNS hosting provider.

Every DNS record has a TTL (Time to Live) — a number of seconds that resolvers are allowed to cache the answer before querying again. Set TTLs short (300 seconds) before making a major DNS change, so the old record expires quickly and the new one propagates fast. In steady state, long TTLs (86,400 seconds, or one day) reduce query load. Changing a TTL from long to short takes a full TTL cycle before it takes effect — plan ahead.

The explorer below covers the record types you'll encounter most often. Spend a few minutes clicking through them; DNS record confusion is the cause of a surprising number of email delivery failures and website outages.

DHCP in Practice

From the sysadmin view, DHCP means configuring scopes: the pool of assignable IP addresses, the subnet mask, the default gateway, and the DNS server addresses clients receive. Every time a device joins the network, DHCP hands it an address and a lease time — typically 8 hours on a wired corporate network, shorter on guest wireless. When a lease expires, the device renews or is assigned a different address from the pool. Scope sizing matters: a scope with 50 available addresses serving 80 devices will exhaust its pool and leave new devices with no IP. Monitor utilization and expand scopes before they fill up.

Reservations break from dynamic assignment without abandoning DHCP. Printers and servers need a consistent IP so users and applications can find them reliably. Rather than manually configuring static addresses on those devices (which bypasses DHCP entirely and risks conflicts), you create a reservation: a rule that says "whenever the device with MAC address aa:bb:cc:dd:ee:ff requests an address, always give it 10.0.1.50." The device still participates in DHCP and automatically receives the correct gateway and DNS settings — it just always gets the same address.

Network Troubleshooting

When network connectivity fails, a methodical approach beats random clicking. Work from the bottom up: confirm physical layer first (is the cable plugged in? Is the link light on?), then check IP configuration, then test reachability, then test name resolution. Four commands handle most first-line Windows network troubleshooting.

ipconfig /all shows the machine's current IP configuration: address, subnet mask, default gateway, DNS servers, DHCP server, MAC address, and lease expiration. This is the first command to run when something network-related breaks. If the IP address starts with 169.254.x.x, the machine failed to obtain a DHCP lease and assigned itself an APIPA address — it has no useful network connectivity and the DHCP problem must be resolved first.

ping sends ICMP echo requests and waits for replies. Successful replies confirm a working Layer 3 path between two hosts. ping 8.8.8.8 tests internet connectivity; ping fileserver.corp.local tests both name resolution and connectivity in one command. If ping to an IP address succeeds but ping to a hostname fails, the problem is DNS, not connectivity. Some firewalls block ICMP, so no ping response doesn't always mean the host is down.

tracert (Windows) / traceroute (Linux/macOS) shows every router hop along the path to a destination, with round-trip time to each. When ping fails, tracert shows exactly where the path breaks — the last hop that responds is as far as traffic is getting. Each * * * line means a router didn't respond (either it blocks ICMP or it's the failure point).

nslookup queries DNS directly. nslookup google.com asks your configured DNS server to resolve the name and shows the result. If ping google.com fails but ping 8.8.8.8 succeeds, DNS is the issue — run nslookup google.com and nslookup google.com 1.1.1.1 (querying Cloudflare's public resolver as a comparison) to isolate whether the problem is your DNS server or your network configuration.

Patch Management

Software has bugs. Some are security vulnerabilities that attackers actively exploit. Vendors release patches to fix them, and the sysadmin's job is to get those patches deployed before attackers get there first. In Windows environments, most organizations run WSUS (Windows Server Update Services): a local server that downloads updates from Microsoft and distributes them to internal machines on a controlled schedule. Rather than every workstation independently hitting Microsoft's servers, all downloads flow from WSUS, and IT decides which updates are approved and when they deploy. A typical cycle: patches release on Patch Tuesday (the second Tuesday of each month); IT tests them on a small pilot group for a week; if nothing breaks, they approve the updates for broad rollout the following week.

Critical vulnerabilities break the normal cycle. A severity-critical remote code execution vulnerability might warrant emergency deployment the same night it's disclosed. The 2017 WannaCry ransomware outbreak, which disrupted hospitals, utilities, and corporations across 150 countries, exploited a Windows vulnerability for which Microsoft had released a patch two months earlier. Every affected system was one missed patch cycle away from being protected.

Information Security: The CIA Triad

Information security (infosec) is organized around a three-part framework called the CIA triad. Every security control, every policy, every countermeasure can be understood in terms of which of these three properties it protects.

The CIA Triad — the three core properties of information security:
Confidentiality — only authorized parties can access the information.
Integrity — information has not been altered without authorization.
Availability — information and systems are accessible when authorized users need them.

Confidentiality is maintained through encryption, access controls, and physical security — each one a different kind of barrier between the data and everyone who shouldn't see it. Encryption makes the data unreadable without the key; access controls keep unauthorized accounts from opening the file in the first place; physical security keeps an attacker from walking out with the hardware. Integrity rests on hashing, digital signatures, and audit logs. A hash reveals any modification to a file. A digital signature breaks the moment a document is tampered with. An audit log records who changed what and when, so the alteration has a name attached. Availability is the work of redundancy, backups, patch management, and capacity planning — eliminating single points of failure, recovering when systems fail anyway, keeping systems patched and running, and noticing before a disk fills up.

The three properties trade off against each other. Lock down access tightly enough and legitimate users complain they can't get their work done. Encrypt aggressively enough and a lost key means data is permanently inaccessible to you, not just to attackers. Strong encryption protects confidentiality — but if you lose the key, availability suffers. Strict access controls protect both confidentiality and integrity — but too many restrictions hurt availability by slowing down legitimate work. Good security design finds the right balance for the organization's specific risk profile rather than maximizing any single property.

Authentication and Authorization

Two terms that are often confused but mean distinct things:

Authentication — proving who you are. The system verifies your identity.
Authorization — determining what you're allowed to do. The system checks your permissions.

Authentication always comes first. When you type your username and password at a Windows login screen, you are being authenticated — Active Directory verifies that those credentials match a known account. Once authenticated, every resource access triggers authorization — the system checks whether your account (or the groups it belongs to) has permission to read, write, or execute that resource. You can be successfully authenticated but unauthorized to access a specific folder. The error "Access Denied" is an authorization failure, not an authentication failure.

Traditional authentication relies on a single factor: something you know (a password). Passwords have a fundamental weakness — they can be stolen, guessed, phished, or leaked in a data breach without the account owner knowing. Multi-factor authentication (MFA) requires two or more factors from different categories:

Something you know — password, PIN, security question
Something you have — a phone running an authenticator app, a hardware security key (YubiKey), a smart card
Something you are — fingerprint, face scan, retina scan (biometrics)

MFA doesn't have to be two factors from different categories — it must be from different categories. A password plus a security question is two things you know, not MFA. A password plus a six-digit code from an authenticator app is MFA (know + have). If an attacker steals your password via phishing, MFA stops them at the second factor — they don't have your phone. This is why MFA is one of the highest-impact security controls an organization can deploy. Microsoft's data shows MFA blocks over 99% of automated credential attacks.

MFA in practice. Modern organizations enforce MFA through their identity provider (Azure AD, Okta, etc.). When you log in, the identity provider authenticates your password and then pushes a notification to your registered phone app (Microsoft Authenticator, Google Authenticator, Duo). You approve the notification — that approval proves you have the phone. Some high-security environments use hardware keys (YubiKey) that must be physically plugged in during login, eliminating the phone-intercept risk.

Protecting Data: Encryption and Hashing

Two cryptographic tools protect data confidentiality and integrity: encryption makes data unreadable without a key; hashing produces a fixed-length fingerprint that reveals any change to the data. They are fundamentally different operations used for different purposes.

Symmetric Encryption

In symmetric encryption, the same key is used to encrypt and decrypt. The sender and receiver must both possess the same secret key — which means the key must be securely shared before communication can happen. The dominant modern algorithm is AES (Advanced Encryption Standard) with a 256-bit key. AES-256 is what protects encrypted hard drives (BitLocker), files, and archived backups. It's fast and extremely strong — a brute-force attack against AES-256 is computationally infeasible with any foreseeable technology.

Asymmetric Encryption

Asymmetric encryption uses a mathematically linked key pair: a public key that anyone can have, and a private key that only you hold. Data encrypted with the public key can only be decrypted with the corresponding private key. This solves the key distribution problem: you publish your public key openly, and anyone can send you an encrypted message that only you can decrypt. The dominant algorithm is RSA. In practice, asymmetric encryption is too slow for bulk data, so TLS (HTTPS) uses it only to negotiate a shared symmetric session key — then switches to AES for the actual data transfer. This is called a hybrid approach.

Hashing

A hash function takes an input of any length and produces a fixed-length output (the hash or digest). SHA-256 always outputs exactly 256 bits (64 hex characters) regardless of whether the input is one word or an entire database. Hash functions are one-way: it is computationally infeasible to reverse a hash back to its original input. They are also deterministic: the same input always produces the same hash. And they exhibit the avalanche effect: a single character change produces a completely different hash — not just a slightly different one. Hashing protects integrity, not confidentiality: anyone can compute the hash of a file, so hashes aren't secret. What they do is prove a file hasn't changed. When you download software, the developer publishes the SHA-256 hash; you compute the hash of your downloaded file and compare. If they match, the file is exactly what the developer published, byte for byte.

Hashing is also how passwords are stored securely. A system never stores your actual password — it stores the hash of your password. When you authenticate, the system hashes what you typed and compares it to the stored hash. If they match, your password is correct. Since hashes are one-way, a database breach that exposes the password hashes doesn't directly expose the passwords. Modern password storage adds a random per-user salt before hashing, which prevents attackers from pre-computing a table of common password hashes (a rainbow table attack).

Common Threats

Ransomware is currently the most disruptive threat facing organizations. The attack pattern: malware (typically delivered via a phishing email or an unpatched vulnerability) enters the network, moves laterally to reach file servers and backups, then encrypts everything — making all data inaccessible. The attackers demand payment for the decryption key, sometimes threatening to publish stolen data if the ransom isn't paid. Ransomware attacks the Availability pillar of the CIA triad directly. The defenses are layered: immutable offsite backups (so you can restore without paying), patched systems (so the initial entry is harder), MFA (so stolen credentials don't give attackers full access), and network segmentation (so malware can't move freely from one department to every server). A company that has tested its backups and can restore within its RTO has much more negotiating leverage than one that discovers its backups were also encrypted.

Phishing is the most common initial attack vector — a deceptive email designed to steal credentials or trick the recipient into running malware. Spear phishing targets specific individuals using personal details to appear credible (a "message from your CEO" requesting an urgent wire transfer). No technical patch exists for human judgment, which is why organizations layer email filtering (blocking known-malicious senders and attachments), user training (security awareness programs), and MFA (so a stolen password alone isn't enough) together. Social engineering is the broader category: manipulating people — not systems — into taking actions that compromise security. A caller pretending to be IT support and asking a user to disable their firewall is social engineering. Technical security is only as strong as the human decisions surrounding it.

Backup and Recovery

Backups are the last line of defense against data loss from any cause — hardware failure, ransomware, accidental deletion, fire, flood. The industry baseline is the 3-2-1 rule: keep 3 copies of data, on 2 different types of media, with 1 copy stored offsite. If ransomware encrypts your on-site servers and your on-site backup simultaneously, the offsite copy survives. "Offsite" today often means cloud backup storage — geographically separated from your primary data and, critically, air-gapped from the network so ransomware can't reach it.

Three backup types offer different tradeoffs between storage cost and restore speed. A full backup is a complete copy of all selected data — simplest to restore from but requires the most time and storage. An incremental backup copies only data changed since the last backup (full or incremental) — fast and space-efficient, but restoring requires the last full backup plus every incremental since. A differential backup copies all data changed since the last full backup — larger than incremental but restoring only requires the last full plus the most recent differential. Most organizations run full backups weekly and incrementals nightly.

Two business-level metrics define backup requirements. RPO (Recovery Point Objective) is how much data loss the organization can tolerate, measured in time: an RPO of 24 hours means backups must run at least daily, and any data created since the last backup is gone after a restore. RTO (Recovery Time Objective) is how long the organization can tolerate systems being offline: an RTO of 4 hours means recovery procedures must be fast enough to restore service within 4 hours. A company that has never tested a restore doesn't actually know its RTO — it only knows how long its backups take to run.

Monitoring and Alerting

You can't fix what you don't know is broken. On Windows, Event Viewer is the built-in log viewer. Every significant system event — a user login, a failed authentication attempt, a service crash, a disk error, a Group Policy update — generates an event logged at Information, Warning, or Error severity. After an incident, Event Viewer is the first stop: filter for Errors and Warnings in the time window around the problem and work backward. Security event logs record every authentication attempt; a spike in failed logins is a sign of a brute-force attack or a misconfigured service.

Most organizations augment Event Viewer with dedicated monitoring tools — PRTG, Zabbix, Nagios, or cloud-based services like Datadog — that continuously poll every server and network device for CPU utilization, memory usage, disk space, service status, and network throughput. When a threshold is crossed, the tool sends an alert to whoever is on call. A database server's disk is filling up at 3 AM; the monitoring system fires an alert to the on-call sysadmin's phone before it hits 100% and the database stops accepting writes. That is what monitoring is for. The goal is to discover problems before users do.

Chapter 13 asks what happens when you have fifty servers to manage instead of one — and how virtualization and cloud computing changed the answer.

Quiz Chapter 12 Quiz

1. After downloading a large software installer, you compute its SHA-256 hash and compare it to the value published on the developer's website. Which property of the CIA triad does this verify?

2. When you type your username and password to log in to Windows, you are being ______. When the system then checks whether your account can open a specific shared folder, you are being ______.

3. Which of the following correctly represents multi-factor authentication?

4. Your company moves its public website to a new server with a different IP address. Which DNS record type must be updated for visitors to reach the new server?

5. A DHCP reservation is used to:

6. You run ping 8.8.8.8 and get replies. You then run ping google.com and get "could not find host google.com." What is the most likely cause?

7. You encrypt a file with a key. Anyone who possesses that same key can decrypt it. This describes:

8. You change the word "apple" to "Apple" in a document and recompute the SHA-256 hash. The new hash is completely different from the old one. This property of hash functions is called:

9. A ransomware attack encrypts all files on a company's servers, making them inaccessible. Which aspect of the CIA triad does this most directly attack?

10. The 3-2-1 backup rule requires three copies of data on two different types of media. What does the "1" refer to?