System Administration

No replies
Dave Kinchlea
Dave Kinchlea's picture
Offline
Joined: 2009-04-22

For understandable reasons, virtually all software vendors ship their software in the most generic fashion they believe their customers will allow them to. By that I mean that there will be known security issues like default passwords, device drivers and configurations designed to work with almost any possible hardware configuration, and otherwise generally provided and installed in a way that minimizes the need for customer support. This approach is excellent for small business because it means that they can almost ignore IT costs and let the systems take care of themselves. Ignorance is bliss.

It doesn't take much growth, however, before a business starts to understand that computer systems require maintenance or they start to suffer. Hard drives fill up, fragment and thrash, memory fills up and forces excessive paging and even swapping, previously thought-to-be secure software is now vulnerable to remote attack.

System Administration encompasses the regular, periodic, and urgent actions required to service the underlying system's infrastructure to ensure all is functioning to the level required by the application. To properly understand and contain these costs requires reliance to a SLA for IT services, something usually only provided in very large organizations that offer a centralized IT infrastructure to other departments and charge for the service. For IT administrators this will be a familiar list, for others this might be a bit eye opening.

  • Hard drive
    • look for and recognize the signs of hardware failure (disk versus adapter -- this means that an administrator needs to know something of all the available types of hard drives and ways of communicating with them ... there are many of both and they continue to evolve)
    • look for and recognize the signs of RAID failure -- this means that the administrator needs to know the various trade offs of redundancy, error detection and error correction that each of the various RAID levels have, something that continues to evolve
  • File System
    • monitor usage and keep below 80% whenever possible (while statistics vary, performance generally starts to degrade as disk utilization increases) ... note that 20% of 73GB is 14.6GB and 20% of 1TB is 200GB, it is very hard to just not use that much free space, an administrator needs to balance budgets versus performance degradation and also deal with the incredibly huge amount of digital waste without removing or making inaccessible "important" content (which the administrator has to determine what "important" means)
    • monitor and enforce quotas to assure fair, shared use and/or charge-back purposes (note that quota management is specialized work that relatively few IT administrators work with)
    • monitor and enforce appropriate access control
  • Operating System
    • monitor vendor for patches and hotfixes appropriate to local platform
    • test any applicable patches and hotfixes on appropriate test platform (note that it is usually up to the administrator to determine what is appropriate for each patch/hotfix
    • turn off/disable unused and unwanted applications so as to not waste resources and not be vulnerable to non-related security holes, note that while some services, applications, and processes are obvious in what they do, others are not. On my very dedicated, single-purpose server there are over 125 processes not directly related to the application and nearly 150 in total. Nobody notices when resources are no longer wasted but they sure do when a mistake is made and a wanted process is stopped .... this is much harder than it sounds
    • Anti-virus and anti-malware monitoring defence (also must be able to know where such tools should NOT be active)
  • Business Continuity
    • Backup -- on a schedule that meets defined SLAs, that may mean that the backups have to happen during and without affecting productive use (this is another one of those problems that requires greater skill and knowledge as the volume grows ... Using a gigabit switch into a theoretical device that can write at the same speed, a 1TB file system will take 2.77 hours assuming 100% of the bandwidth is available and can be used ... there are not a lot of devices that can write data at that speed and using 100% of the bandwidth is almost impossible, so this is a best case scenario
      • Differential backups mean much less data is transferred, only the differences between the previous backup which reduces backup time but increases restore time
    • Restore -- restoring a system is usually a lot more work than backing up a system. Sometimes it is as easy as restoring a file or group of files but more often it is much more complicated than that. This is particularly true for the operating systems that offer "easy administration", the management of the running system often interfere with the backups and make restoring to a previous state very cumbersome. These problems can be solved with more expensive management and/or disaster recovery software
    • End User Errors, mistaken deletes and/or modifications ... the administrator may need to provide for these mistakes for at least a select few within the organization
    • Force Majeure -- big disasters are different than ordinary problems, a Force Majeure event is one that cannot itself be anticipated or avoided but must be planned for just the same; a building fire, natural disaster, riot, terror attack, etc. The premise is that the systems must be brought back literally from nothing within a specific time frame (perhaps 48 hours) as defined within the SLA
  • Supporting Applications, the administrator(s) must also be knowledgeable enough to configure and maintain a large number of services like:
    • Domain Name Server (DNS)
    • Email  transportation (SMTP)
    • Email delivery (POP3/IMAP/Groupware
    • Network Time Protocol (NTP)
    • Simple Network Monitoring Protocol (SNMP)
    • Secure Shell (ssh)
    • Dynamic Host Configuration Protocol (DHCP) or Manual networking configuration
    • Scripting (Perl, Shell, Command-line, Powershell) to automate local tasks to assure their completion and adherence to procedures
    • Web Server (Apache, IIS)
    • Java Application Server (Jboss, Jrun, Websphere, IIS) or Servlet engine like Tomcat (and knowledge as to when one is appropriate over another)
    • Databases (Oracle, MS-SQL, Mysql, Sybase, DB2) -- this is a highly paid and sought after skill in most organizations, often not appreciated in smaller organizations that can afford to mask the need with excessive physical resources
    • Scheduling services (taskman, cron) -- how to run and perform tasks on a schedule and during non-peek hours

Note that except for the potential of local accounts (remote desktops) none of the above addresses the reason these servers exist, it is just the necessary work in order that applications perform properly.