[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cobalt-users] Some notes from when I had to restore using raqbackup.sh (quite long)



I would like to reiterate the point others have made in the past about
getting raqbackup.sh installed. If you haven't done so, do it now.

Even if you have done it, you might want to read through the info below as
it points out some problems we had that you might experience.

Does anyone know how to modify the raqbackup.sh script so it keeps more than
two sets of backup files, or would this add too much time onto the whole
process. Ideally I would like a full weeks of backup sets, just in case you
find that things had been deteriorating on the RAQ and files had got messed
up. I know I can go through the backup tapes, but that takes time which you
don't have if it all goes bang.

Gordon.

-- 

One of my RAQ3i servers had started to report a lot of disk errors last
week, and I couldn't get in as the admin user via SSH but the admin GUI
worked fine, apart from I couldn't edit the default site.

We were going to migrate all the sites this week, but unfortunately 4:30am
on a Saturday morning SMS's came flooding into my phone! I feared the worse
so I dialed into the office and got control of a PC that had a serial cable
connected directly to the RAQ so I could open up a console window.

I entered a reboot and waited to see all the usual boot up messages but none
appeared, and no further reponse to the keyboard.

Got to the office and tried to reboot it via the LCD screen. It gave me the
usual prompts but on confirming the request it immediately brought the
hostname and IP address onto the screen.

httpd complain about some entries and so I looked inside httpd.conf and
found that there was HTML code in the file, and also part of the
virtusertable from sendmail. Things were looking really bad.

Nothing left for it apart from getting a spare RAQ and going through the
restore process.

I checked our FTP server which holds the backups that get created by
raqbackup.sh. What an excellent idea that the author had a mechanism which
keeps a current and the previous days backup set.

It looked like that the cron job to do the backup kicked in at 4am as usual
but the RAQ locked up at 4:30am so only half the set was on the server. The
".bak" directory had all the backup files from the other day. Double check
that you think all the files are there before you use the first backup set.

The new RAQ was ready to import back in. I foolishly downloaded the backup
files to my local drive before trying to upload into the "admin" users
directory. Don't do this, as the throughput I got was painfully slow. make
sure you ssh/telnet into the RAQ and then establish an FTP connection to the
server with the backup files.

On my first attempt to restore, it looked like all the sites and users were
created, but in fact none of the sites had files in them (web files and log
files etc). This might have been something I did with the FTP upload from
the PC. I manually untar'd all the .tar.gz files and put the html files back
in.

Everything was going OK until I started to put the log files back in
Unfortunately I expanded them into a directory under / instead of /home
which filled up the partition and then the RAQ completely died. back to
square one.

The first run of the restore command just didn't work but that must have
been done to me messing things up.

Now I had two broken RAQs! Not happy at all, very bleary eyed.

I then proceed to use a restore CD on the original RAQ to reformat and get
the Cobalt OS back on. We had two Restore CD's a v1.1 and a v2.2 I think. I
thought I would use the latest version as I assumed it would require less
patching. We must have wasted 2 hours trying to get the Cobalts running as
it looks like the CD was a bit of a dud. Normally the restore process
whizzes through and you see regular changes on the LCD screen. However on
ours it went through a few stages but then constantly showed "running MFG
tests". I searched the Cobalt lists and I found a couple of entries saying
that it was an artifact and the box should work normally after a reboot.

That might be true for a fully working box, but actually this implies a
failure of the restore process. I shut the machine down and tried to reboot
but it stuck on running MFG tests again.

Anyway, I reverted back to the v1.1 CD and things worked fine. We should
have stuck to what we know. I had not used that v2 CD so far but assumed it
would save me time, but how wrong I was.

I recommend giving the RAQ an internal LAN address, behind your firewall,
whilst you are patching it up so there is no chance of it getting attacked.

Once I got the RAQ up and running, and true it should only take 15-20
minutes with regular progress updates on the LCD screen, I was able to get
the FTP backup files onto it and the restore command ploughed through and
re-created all the sites and users.

Important: turn off the mail server once you have got the RAQ onto its live
IP address. This will stop messages getting bounced as user unknown whilst
you are importing the sites back onto the server.

You should always double check 'every site'. Hopefully you keep most of the
the information in your company databases somewhere. For the most part it
looked like the web sites were up and running.

When I turn the mail server back on, and tail'd the maillog file, I noticed
a user unknown error against one of the domains, which seemed strange as I
knew they had a catch-all. On checking the site there was no user there when
there should have been. Added that one back in manually.

I got some error messages showing up from sendmail saying it couldn't
renaming a BOGUS mail file. I wondered why it was doing it only to specific
users and someone spotted that it looked like people with full stops
(periods) in their username were having this problem.

I had previously created users with usernames like "g.fong". The Cobalt
admin GUI lets you do it so I guess there is nothing wrong with it per se.
The reason why I did it in the first place was to avoid having to enter
something else in the aliase bit, if the username was exactly what they
wanted as an e-mail address.

Anyway, this would cause problems for the import function where say a
mailbox in the /home/spool/mail directory for g.fong would belong to a
totally random user. You could see the results doing a "ls -al".

You can't do "chown g.fong g.fong" to rectify it as it comes up with invalid
user because of the full stop in the username. To get around the problem I
moved all files of the form *.* into a temp directory. I then sent each of
those users an e-mail via the command line. I then used cat to write the
contents of the old mail box back into the newly created one which has the
correct permissions e.g. cat /home/temp/g.fong > /home/spool/mail/g.fong

I think that is the end of my notes although I want to mention something
about ssh/telnet sessions. We are behind a firewall which has timeout values
for TCP sessions. When doing mget * via FTP or running any command that may
pause for a minute or two, I generally press return on the keyboard
occasionally just to keep the connection alive. I hated it when the firewall
disconnects you from the RAQ and you never see the rest of the text from the
running process/script.

I hope the info above proves useful to somebody.

Regards,

Gordon.