Big Server

(or, how I learned to stop worrying and love the hardware)

First off, an apology. This document is as rough as guts. It will attempt to explain in brief, how I configured my new "big" server to be redundant and scaleable. This document is for home users who want to build a Linux box which they can grow with more storage, more CPU capacity, essentially forever. Oh, and cheaply too.

I've been on a quest for scalability for several years. One can achieve a lot of scalability by using rackmount servers (Intel 1RU, 2RU, etc) and deploying a new machine for each function. This costs a lot -- too much for the home user. My solution provides some scalability with lower cost and some putting together of pieces required.

I spent many hours designing this server and testing its performance and reliability. This document may save you some time.

The Hardware

Pentium III (Celeron) 1.3 GHz (tualatin). It's cheap and performance is alright, so it's good value for money.
ASUS Motherboard, TUSI-M/LAN. It's also cheap. Particularly it has onboard ethernet (100 mbit), onboard graphics (SiS chipset) and onboard sound, as well as about 5 USB ports.
A-OPEN desktop case. It's cheap, but it's good, because it site on a shelf in my rack. You've got to have a rack to get scalability. Buy a second-hand one - they don't become obsolete like CPUs. When you put servers into your rack try to use desktop cases because tower cases are harder to work with and less stable (more likely to tip when you are working on your gear). Don't make the mistake I made, and put the rack in a corner; it is hard to get to the back, where you will be doing most of the cabling. My excuse was that there wasn't enough room for it anywhere else.
Linksys 16-port 100 mbit switch. It's cheap (A$300), works great.
512 Mbytes RAM. My mobo can take 1 gig max, so I will be upgrading it with another stick when funds permit. You definitely need a gig of ram. This is the most expensive part of my server - spare no expense to get as much memory as you can.
Twin 80-gig Seagate IDE drives. Get the latest ones with LVB (?) liquid bearings; very quiet, very fast, very cheap (~A$210 each).
TV capture card (Flyvideo '98). Cheap, not too nasty. Currently adequate. Not required for scalability but it is good for running a webcam or encoding live TV to DIVX.
Ricoh MP7320A IDE CD-Writer (32x write speed). You need this for backup.
UPS (Uninterruptible Power Supply). No serious server should be without one.

Configuring the hardware

Set the machine to power up automatically when AC power is applied.
Configure CPU speed and memory speed, and run a memory tester (e.g. MEMTEST86) for 24 hours to make sure that your CPU and memory are working reliably. It's no fun to install and find that the memory is dodgy or the speed is too fast, and have to reinstall. MEMTEST86 comes as a Debian package, and no doubt in other formats also.
First disk drive on primary IDE interface, second disk drive on secondary IDE interface. Stick the CD-RW as a slave on the secondary interface also. This really kills performance whenever writing to CDs, so buy another IDE card as soon as you can. I did timing tests with the drive on the special UDMA cable on the primary interface, compared to the same drive on the secondary interface and an ordinary cable, and the performance difference was minimal. So don't fret about it.
Sometime in 2003 we will be seeing Serial ATA devices. The cables will be much smaller, and hence much nicer than the current ribbon cables. It will be easier to put more drives into one chassis. If you can wait until then, do it.
That's about it for the hardware.

Filesystem architecture

This is where the system design becomes interesting. We're using Software RAID, in particular RAID-1, which mirrors the device. On top of that, we use the LVM (Logical Volume Manager) to provide partitions which can grow or shrink on the fly. On top of LVM we use the ext3 filesystem, for its journalling (for filesystem integrity after a crash) and for its ability to resize an existing filesystem (although the filesystem has to be unmounted to do that, which sorta makes it a problem to resize the root filesystem).

Partitioning

Partition both devices identically, namely 2 primary partitions. First one about 100 megs, second one the rest of the disk (79.9 gigs or so). Set the type of all partitions to type 0xFD (Linux RAID autodetect). Partition 1 will be for /boot, partition 2 will be used by the logical volume manager as a volume group called rootvg.

RAID

Initialise 2 RAID-1 devices (md0 for partition 1 on both drives, and md1 for partition 2 on both drives).

Logical Volume Manager

Install the most recent sistina.com LVM tools package and patch your kernel sources to include the same level of the LVM driver in the kernel. Create a volume group rootvg containing one device, /dev/md1. What does this do now? It means any LVM operations you perform, creating logical volumes for example, or filesystems on those logical volumes, will all be done through /dev/md1 which will use the Software RAID system to mirror the data onto your second drive. Thus, your whole system will be automatically mirrored, and if a device dies, your system will still work in single device mode, while you obtain a replacement drive.

Partitions

/dev/md0 is a fixed-size partition. mkfs it as type ext3. Make sure your kernel has ext3 compiled-in, it is easier than messing around with initrd to boot. When you configure boot-from-hard-disk, LILO understands RAID-1 devices and it will setup correct MBRs on each device so you can boot from either.

Create some logical volumes as you require. I created /dev/rootvg/root for the root fileystem, which also includes /usr, /var and /tmp. If you are running a big mail server for example, you might wish to make /var a separate filesystem. I used to configure linux boxes with separate filesystems for /, /tmp, /var, /usr, /usr/local and /home - but these days it is more sensible to share the available space. However, if you are running a multi-user server or have untrusted users on this machine, it is vitally important to have no world-writable directories on the root filesystem. That means at least /tmp and /var must be on a different partition.

I also created /dev/rootvg/home for my home directory. I took the opportunity to combine various directories I was using on the old server and on other servers into a single gluggy mess. There are a few gigs of duplicated files, but at least it's all in one place now.

Complicated things

As mentioned, LILO knows about RAID-1. At least, I hope it does. I haven't had a failure of one of the disks yet.

Because the root filesystem is on LVM, you need to use an initial ramdisk. The LVM tools package (or Sistina's website docs) provide a method to create such a ramdisk. The ramdisk has to use the LVM tools to start the LVM system after the kernel has loaded from disk (/dev/hda1 or /dev/hdc1 which is /dev/md0 at runtime) but before the root filesystem is loaded. This ramdisk thing works fine.

I had the hardest time setting up my server. I was booting it over NFS for much of the setup phase, while testing the RAID-1 and LVM and filesystem growing and other things. NFSroot kernels can _only_ load the root filesystem over NFS, so you'll have to build both variants.

Scaling the beast

The first thing is, that you can scale the storage to as much disk as you can fit into the chassis. When you add more disk, if you want it RAID-1, then do the same thing as the main system: create twin Linux Raid Autodetect partitions (type 0xFD), create a raid device, and add that device to rootvg in LVM. Or if you are not using raid, then create 1 partition for the whole disk and add that. I don't know what partition type to use in that case. Now you can put as much storage as you like into rootvg. You can then add new logical volumes or extend existing logical volumes to use the new storage on the new device(s).

When you extend existing logical volumes, you need to unmount your filesystem and do a forced fsck on it to extend.

Assuming you have run out of drive bays in your chassis (it happens pretty quickly with desktop chassis), the next step is to setup another server. Make it identical to your first server - you will really appreciate being able to run the same kernels and whatnot. Make all your servers NFS servers, and rearrange your storage so that the new server has some files to share also. Try to balance the storage so 50% of the file data is on the first server and 50% on the second.

If you want to scale CPU power, you need to setup a cluster. I haven't done this myself, but I believe the application to use is called OpenMOSIX. Install OpenMOSIX on all servers you are going to combine and let them share processor power.

Distributed Filesystems

I have spent many hours researching distributed filesystems online. I have checked out CODA, AFS, OpenAFS, Frangipani, Inter-Mezzo, GFS, Lustre and Berkeley xFS. My conclusion is that the state of play of distributed filesystems is still immature, and so consequently my big server runs NFS and will do for a while yet.

What you really want is a cluster filesystem where you can add new machines (with local disk storage) and have that disk storage become added to an existing pool of storage, which combines to present a single image of the filesystem's contents. A serverless filesystem is a type of cluster filesystem where there's no "master" or "primary" server, where all existing servers are peers. That sounds good, but I don't think it's mandatory. I think that Lustre is the filesystem to use when it's time for me to start distributing my storage.

OpenMOSIX is apparently looking at providing some kind of cluster filesystem. One of the concerns of using OpenMOSIX is that all I/O is performed through the "home processor", which means that it is travelling over the network, maybe twice when we are using a distributed filesystem. It seems like a key requirement that a clustered processing system should support a single filesystem image across all processors, and not require the "home processor" to do all I/O work.

The end of the document

That's an overview of the state of play and the design. A future revision will identify key topics which I have utterly failed to address and clean up the descriptions and provide configuration examples.