As many of you may know, I have a lot of computers in my home. I deal with huge amounts of data (mostly video but a lot of other stuff too) and just having it all online means I have more than 10 systems here.
But I have a core of 4 systems that I work with daily and that make up my primary set of working files: My old workstation (pacdat), my new workstation (video), my file server (NFS1) and my backup and domain name master (NETFS)
A few months ago I decided to move much of the data that still resided on my old workstation (P4 2.0GHz- called "pacdat") to a NFS file server (NFS1), including my home directory which is huge.
The old machine had several sets of mirrored drives of various sizes - usually the "sweet spot" size for whenever I purchased them - from 160 Gigs to 300 Gigs. My home directory has grown to outstrip each of these and in fact now has links to several such pairs of RAID 1 arrays. It was my intention to build a RAID 5 array of 320Gig drives that would do me for at least a year or so of growth at present rate - and host them on a single computer that I could mount from several of the systems in my home as needed.
All was going well - until Mother Nature stepped in a couple of weeks ago.
read on for the tail of woe
The synopsis:
- Two of the 6 drives in the RAID array lost tracks at the same spot at the same time it appears - could have survived one but not losing two at once
- the system first slowed down to a crawl - indicating there was a problem and making the copy to another system S L O W
- finally the computer itself died - first thought was the power supply
- nope - new power supply didn't work so replace mother board and RAM (and video cards but that's another story)
- finally gave up trying to recover any data - lost 6 months of "annoying stuff" - no documents or e-mail as they are on another system
- problems with the replacement system continued for two weeks
- now I have a fantastic workstation
The Details (and some thoughts and warnings)
As many of you know, I have a lot of computers around the house. Many are "archive" servers for video, and some are for backup of other systems both in-house and off-site.
I've had a series of personal workstations over the years - typically something near the leading edge but not quite on it. The system is more likely at the leading edge the of price-performance curve - best bang for the buck, at least when I purchase it. I also have the habit of "if it ain't broke, don't fix it" - so once I get my workstation to my liking, I tend to leave it alone for a year or two, or three.
My old workstation "pacdat" is a P4 2.6GHZ with 2 Gigs of RAM and IDE hard disk channels and started with a pair of mirrored (RAID 1) 160Gig drives. It's in a fairly large case with a hefty power supply, and over the course of 3 years I added 2 more pairs of drives, 200s and 300s, each time making another RAID 1 array of them.
My home directory outgrew each drive size in turn, and I ended up with some of my files spread across all three drive sets - which made some things a bit dicy. In fact, at one point I lost some files because I had thought I was deleting symbolic links but ended up deleting the real files. Fortunately I had backups on DVD - one of the last such backups I'll likely ever make as it ran to over 50 DVDs. I've used tapes in the past but today's drives and tapes are both expensive per Gig saved, and fairly slow unless you really spend a lot of money. I'd gotten to the point where I was turning to simply using hard drives as backup media because they truly are the least expensive bang for the buck for large storage.
Having RAID1 (mirror) arrays has proven reliable for me in the past, but I've done a lot with RAID5 (3+ drives with data spread across all such that losing any one drive does not lose data) in the past couple of years and I though I'd give it a try for my own storage this time.
So when I finally decided I'd put a new workstation together, I decided to put a RAID5 array in it. I got a mother board with 8 SATA hard drive ports on it, 4 "native" and 4 more on a secondary controller on the mother board. I purchased 8 320Gig Seagate SATA drives with the intention of setting things up with 7 in the array (giving me the equivallent of 6x320=1200+ Gigs or just over 1 Terabyte of usable storage) with the 8th drive being a "hot spare" just in case something happened to one of the initial 7 drives.
I purchased an Intel mother board - knowing that in general there were drivers for the Intel chipsets for Linux. I then spent the next week trying in vain to get the secondary drive controller working - it was not from Intel (I'll leave the really gory details for a tech article some time)
[page_break]
The bottom line was that I could only get 4 drives working on the new workstation, and I needed more than that would give me. So I re-purposed the internal drives as a RAID0 array (striped, no redundancy but FAST!) for video processing, and put the other drives plus some more into a spare box I had, turning it into a network file server.
Using Linux to create a NFS box is quick and easy. There are many off-the-shelf units that, though they don't tell you, are exactly that; Linux plus a bunch of hard drives with the standard Linux RAID software making them work. The NFS (called NFS1 in my network) box got a copy of my home directory and major files from PACDAT in late December last year, and happily took up the task of storing both my and my wife's files. I left the originals on PACDAT while I worked at moving other tasks from it in preparation for shutting it down and re-purposing it to something else.
I started working with my new workstation, pulling in video from the archives and starting to edit it down to pieces we could use on the web and sell - but I ran into more problems with it. It simply turned out to be completely unreliable. I'd be working away and poof! the system would log me out and present me with the login screen, closing and stopping all the various tasks I had been working on, some of them things that take days to finish.
I tried all manner of changes - BIOS tuning, re-seating the hardware, various incantations and potions suggested by others on the net, all to no avail. I got the system to the point where it would last for days and even weeks at a time - but just when I'd think things had finally sorted themselves out, it would reboot or lock up - sometimes when I was not even working at it.
I'd just about gotten to the point where I would throw in the towel and put something else together - or even go back to my older (and very stable) machine for the time being when we had a crashing summer lightning storm; about two weeks ago now.
The Gods threw lightning bolts all around us - but none were all that close. I got up and watched - and counted (7 seconds = 1 mile) and none came within a mile. The problem is, our electrical system carries some jolts for more than a mile, and it seems one such jolt hit our house and specifically my NFS server.
Now the NFS server was on a UPS - an older one with new battery in it. I'd had it on a newer one in the basement for most of its short life, but Shirley's workstation was a lot older and slower than even my old PACDAT, and the NFS box had a hyper-thread capable CPU and 2 Gigs of RAM (compared to the old P4 2GHz 768 Megs of her workstation) so I moved the NFS machine up by her desk and plugged it into the old UPS, vowing to get a new one next time I was near a store. Getting her up and running was trivial since her home directory was already on the machine anyway and the mother board's built-in video card was more than capable of doing what she needed for e-mail and word processing.
Warning #1 - Older style UPS (uninterruptible power supplies) don't give much if any protection from power surges
But the UPS was one that had a relay in it - a relay that kicked the batteries in when the power went out, but otherwise simply passed line current through with no filtering. Newer UPS units run the load (computer) from the battery all the time, and provide charging current to keep the battery charged when the power is on - there is no time delay if the power goes out, and the charge circuit provides very good line filtering.
The result of this was that at some point in the night my NFS system got some sort of surge that cause two of the 8 drives in it to stop working correctly. They didn't fail altogether, at least not immediately.
The first indication I had that something was wrong was that reading files was V E R Y slow - a snail's pace compared to normal. The screens took a long time to come up from sleep first, then moving from one virtual desktop to another took longer than normal. At first I thought it was something with my workstation, not the file server. I spent a fruitless hour or so investigating that.
Finally I looked at the NFS server - and discovered by using some drive testing utilities (hdparm) that two of the drives had dropped down to the lowest setting for "UDMA" - the two on one drive controller.
I swapped out the drive controller for a spare - no better
I started copying files as fast as I could to another system (the workstation, it had lots of space, even if it was not "redundant") but was only a small way into them when the NFS system simply died.
OK - hmmm. probably the power supply but I don't have a spare that is large enough, so off to the store to pick up one.
New power supply didn't help - it appears the damage included the mother board or something on it - leave that for another day and go back to the store.
On the way there I got to thinking that this might be the opportunity I needed to create a new (and hopefully reliable) workstation to replace the old-new one that was still not reliable. For not too much more than a basic motherboard/cpu/ram setup I'd be able to shuffle the old machines down a notch and get my work back in order.
I should explain that I have 5 monitors on my desk - 4 on the main workstation and 1 on the old one, and that I really wanted all 5 (and maybe eventually more) on the one workstation. I've written about my asperations with my previous "new" workstation (called VIDEO by the way) in Trees hate computers... and why I have so many monitors - but it comes down to the fact that I'm far more efficient with lots of screen real estate.
For various reasons I decided I wanted a machine that would use ATI video cards, and since I'd seen a board with 4 PCI-X slots on it one time I'd visited the store, I though I'd want something similar just in case I decided I wanted more than 6 screens (The cards I'd been looking at can handle 2 XVGA screens each)
An ASUS mother board with ATI glue chips - and a AMD quad-core 64 CPU plus 4 Gigs of RAM plus 3 ATI video cards, and a pair of 1 Terabyte drives - hey, it's only money.
Back home, put the system together and try to re-boot from the NFS drive system. No problem booting - Linux kernel booted on the AMD 64 even though the previous CPU was 32 bit. But the RAID array would not come up - and nothing I did would help, even trying to force it and have it not "rebuild/resync" - there was enough damage on the two drives that the file system was toast.
OK - install the pair of Seagate Terabytes and throw out the two bad drives (ok - pack them up and send them in for warrantee repair/replacement) - set up the remaining ones as a smaller RAID in the new machine and the two Terabytes as a RAID1 for my main files. Install Fedora Core 8. Nine is out but when I went looking for drivers for the video cards they were not yet available.
Everything came up fine - but... with the 5 monitors on there were video artifacts. Again, nothing I could find on the net would fix the problem. I finally deep-sixed the ATI cards in favour of 3 new Nvidia ones and all is well. As I write this the system has been up for 4 days and has not even hiccupped. Here's hoping :)
The only really annoying thing is that I've lost about 6 months worth of "working" files along with a small number of photos I've taken. My e-mail and documents are stored on my laptop and backed up nightly whenever my machine is home.
Warning #2 - RAID (Redundant Array of Inexpensive Disks) does not substitute for doing backups
I didn't do backups of these files - I relied upon my RAID5, expecting to eventually (too late it turned out) put another machine together that would handle all the backing up. My current backup server is too small to add my huge files to those of my customers and laptop that I back up. In fact, most of the files I really miss are not all that large - they're things like the settings and setup for browsers and various programs, cache data, cookies, password files, note files and such - the things that accumulate in the "hidden" files. Other than that I lost some work in progress files - but I mostly have the originals, just not the edits. I've now set up a backup process to get at least these system files.
I also lost my photos. This caused me a bit of anguish but a though hit me - since I had taken only a fairly small number, maybe I could "undelete" the flash cards and recover some, if not all. A little utility called Photorec did exactly what I needed.
3 - The PC may be a fairly mature technology in general but there are enough changes at the hardware level taking place that configuring what you want (or want based on a belief of what is possible) can be frustrating and painful
4 - Open source (Linux) and web search (Google, etc.) combine to make even really esoteric systems doable
Note: work in progress - check back for additions - richard
Resources:
Richard's Digital Rag - Tips for Technology Users in the Real World
http://digital-rag.pacdat.net/article.php/fc9AMD9850LinuxAsusM3A32-MVP-RAID5