bacula in the enterprise part 2

2011-07-23

Software

As mentioned many times, this is a FreeBSD based environment. Some good sysinfo output below:

Operating system release: FreeBSD 8.2-RELEASE
OS architecture: amd64
Kernel build dir location: /usr/obj/usr/src/sys/GENERIC
Currently booted kernel: /boot/kernel/kernel

Currently loaded kernel modules (kldstat(8)):
zfs.ko
opensolaris.ko

Bootloader settings for the Director/Database node:

The /boot/loader.conf has the following contents:

kern.ipc.semmni=1024
kern.ipc.semmns=2048
kern.ipc.semmnu=1024

All of the storage nodes and the director are running a GENERIC kernel with very few system tweaking. One of the storage nodes has a Chelsio 10Gb controller, but that hasn’t had a high enough load to crack the 1Gb/sec barrier.

I’m using Bacula from the ports tree, and the directory has a special Make flag to build with gcc’s debugging symbols. Jenny worked on getting that setup when we were having some stability issues.

The Bacula configuration one the director node is backed by a git repository. It adds a little bit of complexity for a systems administrator, when they want to add a client, but the benefit is clear. This backup project actually enforces change control and tracks all of the commits by who.

I’ve also setup Redmine as a project front-end, and I’ve begun to file tickets and reference what commit fixed what. This not only tracks my progress, but it is the first time I’ve had a backup server that was clearly documented and had some type of accountability.

A snippet of the Redmine site

The Structure

I’ve compared projects like bacula to a large box of LegosTM. It doesn’t enforce a structure by any means, and I’ve taken it upon myself to add meaning to the otherwise flat and incomprehensible bacula-dir.conf

The Bacula Port on FreeBSD installs all configuration files in /usr/local/etc.

Write, the Director, only contains the following in /usr/local/etc/bacula-dir.conf:

@/usr/local/etc/bacula/bacula-dir.conf
@/usr/local/etc/bacula/storage.conf
@/usr/local/etc/bacula/clients.conf
@/usr/local/etc/bacula/messages.conf
@/usr/local/etc/bacula/schedules.conf
@/usr/local/etc/bacula/pools.conf

As you can see, I place everything in etc/bacula/.

Here is a beautiful output of tree(1):

    bacula
    |-- bacula-dir.conf
    |-- bin
    |   |-- create_client.sh
    |   `-- package_list.sh
    |-- clients.conf
    |-- clients.d
    |   |-- 10am
    |   |-- 10pm
    |   |-- 11pm
    |   |-- 12am
    |   |-- 1am
    |   |-- 2am
    |   |-- 3am
    |   |-- 4am
    |   |-- 4pm
    |   |-- 5am
    |   |-- 5pm
    |   |-- 6am
    |   |-- 6pm
    |   |-- 7am
    |   |-- 7pm
    |   |-- 8am
    |   |-- 8pm
    |   |-- 9am
    |   |-- 9pm
    |   |-- TEMPLATE-mac
    |   |-- TEMPLATE-unix
    |   `-- TEMPLATE-win32
    |-- excludes.d
    |   |-- common.conf
    |   |-- mac.conf
    |   |-- unix.conf
    |   `-- win32.conf
    |-- messages.conf
    |-- pools.conf
    |-- schedules.conf
    |-- storage.conf
    `-- storage.d
        |-- write-01.conf
        |-- write-02.conf
        |-- write-03.conf
        |-- write-04.conf
        |-- write-05.conf
        `-- write-06.conf

Storage Nodes

All of the storage nodes are using ZFS as the filesystem/Volume manager.

write-06# zpool list
NAME         SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
filevol001  90.6T  33.3T  57.3T    36%  ONLINE  -

They all have one volume, /filevol001, and I created 512 “drives” within that volume. Effectivly, each storage node has 512 drives, and clients are randomly assigned a drive.

Since I have 6 storage nodes, I wrote a little shell script to handle the directory creation:

#!/usr/bin/env bash
i=1

while [ $i -le 512 ]
do
    install -d -o bacula -g bacula -m 770 /filevol001/drive$i
    ((i++))
done

Simple, right? I also wrote a script to generate the bacula-sd.conf file on a storage node as well:

#!/bin/bash
    
usage()
{
cat << EOF
    Usage $0 NUMBER > /usr/local/etc/bacula-sd.conf
    
    Where "NUMBER" is just a single digit indicating which storage node this is.
    
    Example, for write-07:
    $ make_sd.sh 7 > /usr/local/etc/bacula-sd.conf
EOF
}

i=1
    
if [[ -z $1 ]]
then
    usage
    exit
fi

printf "Storage {\n"
printf "\tName = write-0$1.llnl.gov-sd\n"
printf "\tSDAddress = write-0$1.llnl.gov\n"
printf "\tSDPort = 9103\n"
printf "\tWorkingDirectory = \"/var/db/bacula\"\n"
printf "\tPid Directory = \"/var/run\"\n"
printf "\tMaximum Concurrent Jobs = 516\n"
printf "}\n"

printf "#\n"
printf "# List Directors who are permitted to contact Storage daemon\n"
printf "#\n"
printf "Director {\n"
printf "\tName = write.llnl.gov-dir\n"
printf "\tPassword = \"ItsASecret\"\n"
printf "}\n"

printf "#\n"
printf "# Restricted Director, used by tray-monitor to get the\n"
printf "#   status of the storage daemon\n"
printf "#\n"
printf "Director {\n"
printf "\tName = write.llnl.gov-mon\n"
printf "\tPassword = \"ItsANotherSecret\"\n"
printf "\tMonitor = yes\n"
printf "}\n"

printf "Messages {\n"
printf "\tName = Standard\n"
printf "\tdirector = write.llnl.gov-dir = all\n"
printf "}\n"


printf "Device {\n"
printf "\tName = W0$1FileStorage\n"
printf "\tMedia Type = File\n"
printf "\tArchive Device = /filevol001\n"
printf "\tLabelMedia = yes;\n"
printf "\tRandom Access = Yes;\n"
printf "\tAutomaticMount = yes;\n"
printf "\tRemovableMedia = no;\n"
printf "\tAlwaysOpen = no;\n"
printf "\tMaximum Concurrent Jobs = 2\n"
printf "}\n"
    
while [ $i -le 512 ]
do
    printf "\n"
    printf "Device {\n"
    printf "\tName = W0$1FileStorageD$i\n"
    printf "\tMedia Type = File\n"
    printf "\tArchive Device = /filevol001/drive$i\n"
    printf "\tLabelMedia = yes;\n"
    printf "\tRandom Access = Yes;\n"
    printf "\tAutomaticMount = yes;\n"
    printf "\tRemovableMedia = no;\n"
    printf "\tAlwaysOpen = no;\n"
    printf "\tMaximum Concurrent Jobs = 2\n"
    printf "}\n"
    ((i++))
done

On the Directory, a storage node definition is saved in /usr/local/etc/bacula/storage.d/write-0{N}.conf, which is included in /usr/local/etc/bacula/storage.conf:

@/usr/local/etc/bacula/storage.d/write-01.conf
@/usr/local/etc/bacula/storage.d/write-02.conf
@/usr/local/etc/bacula/storage.d/write-03.conf
@/usr/local/etc/bacula/storage.d/write-04.conf
@/usr/local/etc/bacula/storage.d/write-05.conf
@/usr/local/etc/bacula/storage.d/write-06.conf

Client Generation

There are two components, the TEMPLATE file (there are three, TEMPLATE-unix, TEMPLATE-win32 and TEMPATE-mac) and the shell script.

The Client TEMPLATE File

Here is what one of the TEMPLATE files looks like:

    #
    # Client Definition, the Password here must match
    #  the clients bacula-fd.conf Client definition.
    #
    # Using Vi/m, you can easily replaced HOSTNAME with
    #  the short hostname of the client with:
    #  %s/HOSTNAME/yourhostname/
    #
    #
    
    Client {
        Name = HOSTNAME.llnl.gov
        Address = HOSTNAME.llnl.gov
        FDPort = 9102
        Catalog = Catalog001
        Password = "ItsASecret"
        File Retention = 40 days
        Job Retention = 1 months
        AutoPrune = yes
        Maximum Concurrent Jobs = 10
        Heartbeat Interval = 300
    }
    
    Console {
        Name = HOSTNAME.llnl.gov-acl
        Password = ItsASecret
        JobACL = "HOSTNAME.llnl.gov RestoreFiles", "HOSTNAME.llnl.gov"
        ScheduleACL = *all*
        ClientACL = HOSTNAME.llnl.gov
        FileSetACL = "HOSTNAME.llnl.gov FileSet"
        CatalogACL = Catalog001
        CommandACL = *all*
        StorageACL = *all*
        PoolACL = HOSTNAME.llnl.gov-File
    }
    
    
    Job {
        Name = "HOSTNAME.llnl.gov"
        Type = Backup
        Level = Incremental
        FileSet = "HOSTNAME.llnl.gov FileSet"
        Client = "HOSTNAME.llnl.gov"
        Storage = FileStorageD##
        Pool = HOSTNAME.llnl.gov-File
        Schedule = "@@"
        Messages = Standard
        Priority = 10
        Write Bootstrap = "/var/db/bacula/%c.bsr"
        Maximum Concurrent Jobs = 10
        Reschedule On Error = yes
        Reschedule Interval = 1 hour
        Reschedule Times = 1
        Max Wait Time = 30 minutes
        Cancel Lower Level Duplicates = yes
        Allow Duplicate Jobs = no
        RunScript {
            RunsWhen = Before
            FailJobOnError = no
            Command = "/etc/scripts/package_list.sh"
            RunsOnClient = yes
        }
    }
    
    Pool {
        Name = HOSTNAME.llnl.gov-File
        Pool Type = Backup
        Recycle = yes                  
        AutoPrune = yes                
        Volume Retention = 1 months    
        Maximum Volume Bytes = 10G     
        Maximum Volumes = 100          
        LabelFormat = "HOSTNAME.llnl.govFileVol"
        Maximum Volume Jobs = 5
    }
    
    Job {
        Name = "HOSTNAME.llnl.gov RestoreFiles"
        Type = Restore
        Client= HOSTNAME.llnl.gov
        FileSet="HOSTNAME.llnl.gov FileSet"
        Storage = FileStorageD##
        Pool = HOSTNAME.llnl.gov-File
        Messages = Standard
        #Where = /tmp/bacula-restores
    }
    
    FileSet {
        Name = "HOSTNAME.llnl.gov FileSet"
        Include {
            Options {
                signature = MD5
                compression = GZIP6
                            fstype = ext2
                            fstype = xfs
                            fstype = jfs
                            fstype = ufs
                            fstype = zfs
                            onefs = no
                            Exclude = yes
                            @/usr/local/etc/bacula/excludes.d/common.conf
            }
                    File = /
                    File = /usr/local
                    Exclude Dir Containing = .excludeme
        }
        Exclude {
            @/usr/local/etc/bacula/excludes.d/unix.conf
        }
    }

The Create Client Script

So here is what really makes creating clients easy for us, the create_client script.

I didn’t want to do it this way, really, so part of me is very ashamed of this tool. I would have preferred to re-write this in Python, or make a web page out of it, and let admins create clients from their desktop. Or, I would have loved to create a puppet module to handle this automagically (but that would exlcude everything that isn’t running Puppet, which is huge).

With that disclaimer, here is my create_client shell script:

    #!/usr/bin/env bash
    # usage: cclient -t unix -s 12am -h hostname
    #
    
    umask 022
    
    # Variables
    ## Randomize Schedule
    SCHEDULES="4pm 5pm 6pm 7pm 8pm 9pm 10pm 11pm 12am 1am 2am 3am 4am 5am 6am 7am 8am 9am 10am"
    s=($SCHEDULES)
    num_s=${#s[*]}
    RAND_SCHED=${s[$((RANDOM%num_s))]}
    # Randomize which storage node we use
    NODES="write-06 write-01 write-06 write-01 write-02 write-03 write-04 write-05"
    n=($NODES)
    num_n=${#n[*]}
    RAND_NODE=${n[$((RANDOM%num_n))]}
    
    export DRIVE=`jot -r 1 1 512`
    export BDIR="/usr/local/etc/bacula"
    export TYPE="unix"
    export SCHEDULE=$RAND_SCHED
    export HOSTNAME=""
    export STORAGE_NODE=$RAND_NODE
    export GIT_DIR="/usr/local/etc/bacula/.git"
    export CLASS="desktop"
    
    if [ $(whoami) == "root" ]
    then
    cat << EOF
                    Please do not run this as root. This script runs a
                    git add/commit, which is how changes are managed and
                    tracked. If you run this as root, then it shows up
                    as carlson39 or root.
    
                    If you encounter a problem with your normal OUN account,
                    please contact Mike Carlson, or submit a bug here:
                    https://st-scm.llnl.gov/redmine/snt/projects/bacula/issues/new
    EOF
    
    exit 1
    fi
    
    usage()
    {
    cat << EOF
    
            Usage: $0 [OPTION]... -h HOSTNAME
    
            This script will generate a bacula client definition.
    
            OPTIONS:
            -s      schedule, (4pm|5pm|6pm|7pm|8pm|9pm|10pm|11pm|12am|1am|2am|3am|4am|5am|6am|7am|8am|9am). The default schedule is random.
            -t      type, (unix|win32|mac), unix is the default
            -n      storage node (write-01|write-02|...), the default is random.
            -h      hostname (use the short hostname)
    EOF
    }
    
    cd $BDIR
    
    while getopts 'c:t:s:n:h:' OPTION
    do
            case $OPTION in
                    c)
                            CLASS=$OPTARG
                            ;;
                    t)
                            TYPE=$OPTARG
                            ;;
                    s)
                            SCHEDULE=$OPTARG
                            ;;
                    h)
                            HOSTNAME=$OPTARG
                            echo $HOSTNAME | egrep -q "(llnl.gov|ucllnl.org)"
                            if [ $? -eq 0 ]
                            then
                            HOSTNAME=`echo $HOSTNAME|sed -e 's/.llnl.gov//' -e 's/.ucllnl.org//'`
                            fi
    
                            ;;
                    n)
                            STORAGE_NODE=$OPTARG
                            ;;
                    ?)
                            usage
                            exit
                            ;;
            esac
    done
    
    if [[ -z $CLASS ]] || [[ -z $TYPE ]] || [[ -z $SCHEDULE ]] || [[ -z $HOSTNAME ]] || [[ -z $STORAGE_NODE ]]
    then
            usage
            exit 1
    fi
    
    grep -w $HOSTNAME $BDIR/clients.conf
    if [ $? -eq 0 ]
    then
            echo 'client '$HOSTNAME 'already exists...'
    else
            export RETRY_COUNT="2"
    
            if [ $STORAGE_NODE == "write-01" ]
            then
                    DRIVE=`jot -r 1 33 512`
                    sed -e 's/HOSTNAME/'$HOSTNAME'/g' -e 's/FileStorageD##/FileStorageD'$DRIVE'/' -e 's/\@\@/'$SCHEDULE'/' -e 's/RETRY_COUNT/'$RETRY_COUNT'/g' $BDIR/clients.d/TEMPLATE-$TYPE > $BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf
                    echo \@$BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf >> $BDIR/clients.conf
            else
                    export SN=`echo $STORAGE_NODE | cut -c 7-8`
                    sed -e 's/HOSTNAME/'$HOSTNAME'/g' -e 's/FileStorageD##/W'$SN'FileStorageD'$DRIVE'/' -e 's/\@\@/'$SCHEDULE'/' -e 's/RETRY_COUNT/'$RETRY_COUNT'/g' $BDIR/clients.d/TEMPLATE-$TYPE > $BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf
                    echo \@$BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf >> $BDIR/clients.conf
            fi
    
            chgrp st-bacula-admins $BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf
            git add $BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf $BDIR/clients.conf
            git commit
            echo 'created client definition: '$BDIR/clients.d/$SCHEDULE/$HOSTNAME.conf
            echo 'for '$HOSTNAME'.llnl.gov'
    fi

This is always a work in progress, but at the core, it is a simple sed wrapper with a lot of randomization and a git commit.

Why all the randomization?

Because I had to add around 1000 clients in a VERY short amount of time. We didn’t have a problem pushing the Bacula client to all of the platforms, nor the bacula-fd.conf file either. What I could not do was spend the time to create and manage all of the resources for each client. That is why I have so many devices/drives, so I can attempt to have a 1:1 without having to actually think about it.

So, I wrote ANOTHER script to wrap around this one when I need to do bulk client creations. I’m not going to post that, it just loops through the above command.

Pre-Job command - Package List

I only do this on the Unix/Linux clients, and I thought it was a cool idea.

Yeah, I will pat myself on the back a little bit for that :)

I exclude the Operating System from backups for two reasons, 1) to reduce backing up duplicate and reproducible data and 2) Our build/Imaging process is so quick and clean it is just faster to rebuild than restore everything.

Still, I needed a way to keep the state of installed packages/software.

This is where the pre-job command comes in handy. This part right here:

        RunScript {
            RunsWhen = Before
            FailJobOnError = no
            Command = "/etc/scripts/package_list.sh"
            RunsOnClient = yes
        }

That package_list.bash file looks like this:

#!/usr/bin/env bash
    
export PLIST="/root/plist.txt"
    
case "`uname -s`" in
Linux)
   if [ -x /usr/bin/lsb_release ]; then
       DIST=`lsb_release -d`
   fi
    
   # RHEL
   if [ -x /usr/bin/up2date ]; then
       rpm -qa > $PLIST
   fi
    
   # RHEL 5
   if [ -x /usr/bin/yum ]; then
       if [ -f /var/run/yum.pid ]; then
           echo "Yum currently in use, exiting gracefully..."
           exit 0
       else
           /usr/bin/yum list installed | awk '{print $1}' > $PLIST
       fi
   fi
    
   # Ubuntu
   if [ -x /usr/bin/dpkg ]; then
       /usr/bin/dpkg --get-selections | awk '{print $1}' > $PLIST
   fi
       ;;
    
FreeBSD)
   pkg_info|awk '{print $1}' > $PLIST
   ;;
SunOS)
   pkginfo |awk '{print $1}' > $PLIST
   ;;
esac

That file, /root/plist.txt, gets backed up.

Now we have a record of what was installed on our Unix platforms :)