Even the best set-up server requires occasional maintenance. Here’s a ten point checklist on how to go about performing your own server upkeep and monitoring. Ticking these off can greatly minimize instances of server failure.
1. Check Your Room Arrangement
The arrangement of your server room is vitally important. If you’ve inherited the job and had no say in the original setup, it’s vitally important to check.
Make sure your hardware has adequate air paths. It may be tempting, but don’t place your servers too close together. It may sound obvious, but also make sure your server is kept off the floor – you’d be surprised how many people make this rookie mistake.
Check that your hardware also has individual access to the cooling system you have in place. And also make sure you have a temperature alarm so that if things do start to overheat, you find out before your hardware is totally cooked.
2. Physically Clean Your Server
And yes, I know your server is in a closed cabinet. You should still do this! Servers inhale dust and dirt all the time through their fans and ventilation, and this causes more issues than you’d think.
When your server sucks in dust and becomes clogged—as almost every machine does—it begins to degrade its performance and reliability. When the cooling system becomes blocked, most modern CPUs and GPUs downclock performance.
This means if your server is running slower than it should – and it may have happened so slowly that you’re experiencing the boiling frog effect – that it’s time to break out the cleaning kit.
Using compressed air is one of the best ways to go about it, but make sure not to damage your fans.
3. Check Logs for Hardware Errors
Hardware failure is a major cause of server outages and data loss. While it may be tempting to rely on the power-on self-test (POST) to pick up this sort of error, there are a number of hardware issues that won’t register until after both POST and OS loading has finished.
Check your system logs for hardware issues. Examples include overheating notices, disk read errors, and network failures. All of these are early indicators that a hardware failure could be on the horizon.
If you do have hardware errors, try updating your drivers, particularly for your RAID card or GPU. This may resolve the issue without you having to toss out a perfectly good bit of hardware.
4. Check Disk Usage
It might sound simple, but too much data usage can also lead to server failure and data loss. If you find yourself exceeding 90% of your disk capacity, either reduce your usage or add more storage.
If your disk usage reaches 100%, there’s a good chance your server will stop responding and your data may corrupt or become lost. This is another reason regular data use audits are a really good idea.
If that wasn’t enough motivation, it’s worth noting a smaller data footprint means faster recoveries. If you do experience a data disaster, it’ll be quicker to perform a full recovery if your server size was fairly small to begin with.
Start by cleaning out old e-mails, unused files and logs, and software you no longer use. There’s no reason you should be using your active system as an archival system – this is what backups are for!
There are many different performance monitoring tools for Windows and Linux services, so you should make sure you have one installed.
5. Check For Application Updates
Out of date operating systems and software is a massive security risk. You don’t need to look any further than the recent WannaCry ransomware outbreak.
In this scenario, the ransomware disproportionately hit users with outdated operating systems such as Windows 7 and Windows Server 2003. This also applies to web apps you may be using like WordPress, which can lead to very direct security breaches.
You may not be aware that your OS or app may have reached its End-of-Life (EOL), which means no bug fixes or support. That’s why it’s worth listing out everything your server uses and ticking them off one by one.
It might sound obvious, but two of the biggest things you also want to make sure are currently up to date is your security and backup software. Anti-virus software relies on having current data on the threats it needs to protect you from, and backup software needs to be up to date on the kind of data it will be handling.
Don’t rely fully on automated updates. While you definitely shouldn’t turn off things like Windows Update, and automated updates do have their advantages, it’s worth having a bit of control over this process yourself. Also, if for some reason you’re over-relying on automated updates, you may not notice if they stop (E.g. due to an EOL).
6. Perform Backup & Hardware Checks
A stress test is important to make sure your server will pull through when you need it. The best ways to do this is to perform a test restore from one of your backups to make sure it works. This should be part of your regular backup routine, but it’s good to incorporate it into your server maintenance checks as well.
Make sure the destination of your backups is properly set, and you’ve got enough room on the destination to store backups of your server. If you’re backing up three terabytes of data and your backup media is only two terabytes, you’re going to have issues. Likewise, you should make sure you’ve got enough room for your entire backup scheme and room for data growth.
As mentioned earlier, if you’ve got cold data, you should either delete it or archive it with a high compression rate.
You should also perform some situational tests of your hardware. This is a good way to test the strength and status of your hardware before it breaks.
7. Consolidate Unnecessary Hardware & Virtualize
There are definite benefits to consolidating multiple older physical servers onto some newer hardware, and virtualizing all your machines. We’ve covered the benefits and how to perform a physical-to-virtual (P2V) transfer in an article series.
On a broader level, you want to remove any pieces of hardware from your server that you’re using – specifically pieces of tech that present a high failure or heating risk. This includes unused PCI-E cards, power supplies, fans, and expansion cards.
Check your RAID cards to see how much life they’ve got left. Since they run so notoriously hot, this reduces their lifespan quite a bit. On the other hand, your CPUs, GPUs, RAM and Motherboard are all fairly low risk, so these are less likely to fail.
8. Cloud Storage Incorporation
If you can, see if you can incorporate cloud storage into your server structure. From a backup point of view, it’s good to have an offsite as well as an onsite copy of your data, which offers you different levels of redundancy.
Depending on how you want to go, you may also want to incorporate cloud storage for your cold archival data.
9. Review User Accounts
Over time, there’s bound to be staff and other user changes, and client cancellations. You want to make sure to remove these users from the system.
Even in a company with an iron-clad policy of removing user accounts and changing passwords when people leave, you’re still bound to miss some permissions and accounts. It’s just the way things work. They may sign up for things without telling you, or just forget to document them.
Storing old sites and users is an obvious security and legal risk. Think of it like regular pruning that needs to be done.
It’s also worth going through and changing all the old passwords as part of regular routine. A system security audit should generally be conducted at least four times a year.
10. Write A Detailed Maintenance Report
While you’re doing everything above, make sure to write a detailed maintenance report. There are plenty of good reasons to do this, both for you and for the organization.
- It shows your boss that you’re not just sitting around twiddling your thumbs.
- It shows exactly what has been done for everyone else’s knowledge, especially if someone comes along later and wants to know.
- It offers accountability if things go wrong – if you’ve taken all the right steps and can prove it, it helps you out if there is ever a major server outage.
- It’s a great way for you to remember what you’ve done, and check back through to see if you’re overdue to perform a certain kind of server maintenance.
- If you include costing, it helps with budget setting for the company – what costs are going to need to be set aside during the next financial year instead of catching your superiors by surprise.
Include both the type of maintenance you’re performing in the report (Updates, cleaning, repair, condition, deletion), the cost, when it’s done, what intervals you’re planning to do it at, and any additional details.
Make Sure You’ve Got The Right Software!
Your backup software is vital to keeping your servers running, and your business intact. Read our handy guide on low-cost and reliable backup software for every OS.