Legato Networker
We use Legato to back up the majority of our systems. Here are some usage notes.
Client Setup
Checking backup results in Legato (CSCF staff)
- login to backup.cs (or even cscf.cs will work)
- (ssh to cscf, suw, ssh backup.cs)
- run: nwadmin
- note that nwadmin is an X application -- if you're running from Windows, make sure your X Server is running
- Go to Clients -> Indexes (or just click the "Indexes" icon)
- scroll down to the machine in question
- Click on a file system
- click on "Instances"
Note: For checking backups one can run
nwadmin from any unix machine. Since not every one has an account on backup.cs, running
nwadmin as root on backup.cs should be very careful.
nwadmin is not available since the version (7.3.3).
One can run mminfo on any unix host, for example
ubuntu1204-102[48]% mminfo -s backup.cs -c m160.cs -t "3 days ago" -ot
volume client date size level name
77001 m160.cs 13-05-01 2 KB full /data2
77001 m160.cs 13-05-01 2 KB full /data
77001 m160.cs 13-05-01 545 GB full /
Disk202 m160.cs 13-05-02 4 B incr /data2
Disk202 m160.cs 13-05-02 4 B incr /data
Disk202 m160.cs 13-05-02 8345 MB incr /
Restore files from backups
- Quick guide:
- Run the command recover ( become root first if it is for users other then yourself )
- help to get a list of available commands
- changetime to change the time you wish to recover files from
- ls to list files
- cd to change directories
- add to add files or directories for recovery
- rel if you need to relocate where recovered files are finally restored to.
- recover to actually do the restore after using the previously listed commands
- See also See https://cs.uwaterloo.ca/cscf/internal/infrastructure/setups/services/backup/restore.shtml
Monitor what Legato is doing
It's safer to run this from the client machine.
Alternatively, you can log in to the client machine, run
recover
, cd to whatever directory for which you want to see information, and then type
versions
.
Brief Information about backups
Backup server is backup-0.cs (backup.cs). It is a Dell
PowerEdge R815 running Red Hat Enterprise Linux Server release 6.3. We are running NetWorker 8.1.1. There are 4 LTO 5 tape drives in Qualstar RLS-8500 library (jukebox). Four drive devices are
/dev/tape/by-id/scsi-350050763124cb325-nst (drive 1)
/dev/tape/by-id/scsi-35005076312486d24-nst (drive 2)
/dev/tape/by-id/scsi-350050763124c5454-nst (drive 3)
/dev/tape/by-id/scsi-350050763124c485f-nst (drive 4)
The jukebox consists of two cabinets. The top one is an expansion (slots from 1 to 114), The bottom one is the main unit (slots from 115 to 222). Four drives are in the top unit (Cabinet 1). There is one I/O port in the expansion on its left side (ports 1 - 4) and two I/O ports in the main unit, one on each side (ports 5 - 8 on the left, 9 - 12 on the right side). Each I/O has 4 ports. There is a nice touch screen to operate the jukebox. For example, to open an I/O, just touch the "open" button next to the I/O, pushing it in to close it. An operation manual of the jukebox is on the top of it. It could also be found at
511000.pdf
Sometime the jukebox may have a "Fault". Look at the information on the touch screen, fix the problem, then press "Dismiss" on the screen.There are 222 slots in the Qualstar library. Cleaning tape is in slot 222. Run
nsrjb to list tapes in all slots and tapes mounted on all drives.
We do backups in four different levels: incremental, weekly, monthly and full (term end). We clone all special saves, i.e., weekly, monthly and full saves. All special saves are spread weekly long to reduce load. On any single day, we always have incremental and one of special saves are running. And we always have full save running for bootstrap every day. We've moved all incremental and weekly saves to disks for all machines.
We divide all clients in three different groups: CSCF_Default_Group, CSCF_Mac_Group and CSCF_Pc_Group. We also do backups for some machines of MFCF which are in MFCF_Default_Group. Tapes (volumes) name convention:
LTO 3 tapes or disk volumes LTO 5 tapes
Incremental Disk2XX
Weekly Disk3XX
35XXX 37XXX
Weekly Clone 45XXX 47XXX
Monthly 55XXX 57XXX
Mothly Clone 65XXX 67XXX
Full 75XXX 77XXX
Full Clone 85XXX 87XXX
NetApps(fs03.student.math, fs102.cs, and fs02.student.cs) are backed up once a week, a special save on Saturdays. (Backing up fs02 is much slower than backing up fs102).
Shut Down and Start NetWorker Server
Always check if any user is recovering files or backups are running before shutting down NetWorker server.
Use
nsr_shutdown to shut down the NetWorker server. Use
/etc/init.d/networker start to start the server, and
/etc/init.d/gst start to start
Networker Management Console (NMC). Need to run these two commands if NetWorker server or NMC is not started after reboot the host backup.cs .
Add New Client
Use
/u/gxshen/bin/{add_networker_NT_cs_client,add_networker_cs_client} to add new client depending on the architecture of the client.
Withdraw and Deposit Tapes
- Use "nsrjb -w [ -P port ] volume" to withdraw a tape, if no port is given, it will moved to port 1. For example, if we know tape 75149 is in the left storage matrix of bottom unit (i.e., slots 115 - 168), we colud choose any port of 5 - 8, nsrjb -w -P 5 75149, then press Open button to open the I/O port, get the volume 75149 from the I/O port.
- Open the I/O port and load the tape to the I/O port. Use nsrjb -d [-P port] -S slot_number volume or nsrjb -d [-P port] -S slot for a unlabled tape with a barcode to deposit a tape in jukebox. For example, nsrjb -d -P 7 -S 150 75147. Need to wait the following and answer yes for version 8.1.1.
Load the cartridges into the ports, and enter Yes to continue.
Inventory
We need to do inventory sometimes. Use GUI
nwadmin, choose an empty drive (umount the tape in a not current used drive if no empty drive available), then click
Media -> Inventory -> fill in first slot, last slot, click OK. or run
nsrjb -I -f device -S first[-last_slot], where device is one of four drive devices (see above), which could be found by typing
"nsrjb".
Label Tapes
A tape can be used only after it is labeled. Use
nwadmin to label a tape (tapes). Choose a drive without a tape mounted on then choose
Media -> Label -> Pool -> Fill in first slot, last slot, click OK or use
/u/gxshen/bin/labelLto5Tapes
Note:
nwadmin is not available in version 8.1.1. Need to use
Networker Management Console to label tape. The password to access
Network Management Console can be found at the usual place under letter B. The
NetWorker Management Console can be started by launching mozilla on backup.cs and point the browser to
http://backup-0.cs:8000
Reenable a Drive by Using *NetWorker Management Console
Sometime a drive is disabled or in service mode after a certain number of errors. The drive need to be reenabled before it can be used again. Check the drive first before reenabling it.
- Click NetWorker from NetWorker Management Console to lunch Administration window.
- In the Administration window, click Devices. The Devices detail table appears.
- Right-click the drive to be enabled, and select Properties. The Properties window appears.
- On the General tab, in the Status area, set Enabled to Yes.
- Click OK.
Clone tapes in MC 3015
We store most recently clone tapes in MC 3015. Run
/u/gxshen/bin/getOffsiteTapes to get a list of tapes. We usually do this once a week.
Daily Operation
on cscf.cs
- Run nwadmin ( /software/networker-7_client/bin/nwadmin) for monitoring backups and recoveries or nsrwatch on any host (a backup client).
- To expire tape volume(s), run /u/gxshen/bin/expireTapes on backup.cs
- To expire disk volume(s), run /u/gxshen/bin/expireDiskVolumes
- One will see something like "space recovered from volume Disk20X or Disk30X" from the output of nsrwatch.
Index files of these expired save sets on Disk volumes will be removed immediately.
Since 'nsrim' runs only once every 24 hours, data on expired Disk volume will be removed after next 'nsrim' run. To mount a disk volume, run
/u/gxshen/bin/mountdiskvolume Disk20X or Disk30X.
Troubleshooting
Backup logs are in /nsr/logs on backup.cs. /nsr/logs/message and /nsr/logs/OLD/daemon.raw.1 are more helpful. To read daemon.raw, need to use 'nsr_render_log',
nsr_render_log /nsr/logs/OLD/daemon.raw.1 > /tmp/gs1 for example.
- "Waiting for drive" on LCD display of Qualstar juskebox
Usually power-cycling the jukebox fix the problem. If a saving process is running, kill the process, then power-cycle the jukebox.
- "Library Fault" The fault messages are
Library Faulted:
Problem in Cabinet 2 ...
Faulted while attempting to place cartridge in destination slot
Press "Dismiss" to Restart
Check the message on the touch screen of the Library, if nothing is unusual, press "Dismiss".
Go to the machine room to have a look.
If a red letter is lit on the drive, usually pressing the reset button (in the front of the drive) will fix the problem, then enabled the drive (use
Networker Management Console).
If the tape is ejected from the drive, but the tape is not took out from it by NetWorker, run
nsrjb to find out which tape in the drive, which slot the tape is stored; take the tape out from the drive, then put it back in; enabled the drive, and run
nsrjb -u -f device
where is is one of four drive devices (see above), which could be found by typing
"nsrjb". If the above doesn't fix the problem, put the tape in the drive; shut down NetWorker, restart NetWorker; then run
nsrjb -u -f device
Usually need to do inventory for the slot(s) found from above since the version of NetWorker (7.6.4) it seems like to change slot where the tape was stored for this situation.
If an orange "E" flashing on a drive, we usually lose a device file for that drive, /dev/nstN. Normally we have four
[root@backup-0 ]# ls -l /dev/nst?
crw-rw---- 1 root tape 9, 128 Sep 19 14:40 /dev/nst0
crw-rw---- 1 root tape 9, 129 Sep 25 09:59 /dev/nst1
crw-rw---- 1 root tape 9, 130 Sep 19 14:44 /dev/nst2
crw-rw---- 1 root tape 9, 131 Sep 19 14:45 /dev/nst3
If OS can still see the Library (jukebox), for example,
[root@backup-0 ]# /usr/local/sbin/atinfo -i all
[...]
B:T:L Vendor Product / Rev Type Capacity
Serial Number
--------- -------- ---------------- ---- ------- ----------
0:1:0 IBM ULTRIUM-HH5
D2A1 Tape N/A
SN:9068051328
0:1:1 QUALSTAR RLS-85 007D Changer N/A
SN:213034410
0:2:0 IBM ULTRIUM-HH5
D2A1 Tape N/A
SN:1068074057
0:3:0 IBM ULTRIUM-HH5
D2A1 Tape N/A
SN:1068073885
The "QUALSTAR" line is the jukebox. We missed one drive from the above output. We could go to the back of the jukebox, slide out and back in the drive that has an "E" error. The above output of "ls -lt /dev/nst?" shows that /dev/nst1 has a newer time stamp, that is the result after sliding out and back in. If OS cannot see the jukebox, go ahead to slide out and in the drive in question, if stil cannot see the jukebox, we need to power-cycle the jukebox, and restart NetWorker.
- cannot access the hardware
If we see this error message from the output of 'nsrwatch', wait for 2 minutes, usually it will be followed by a message like "Hardware
status of jukebox changed from 'cannot access the hardware' to 'ready' ". Check the logs /nsr/logs/message. That message appears when we open an I/O port, or library
cleans a tape drive. At situations the real error happens, if OS can see all drives and jukebox, restart NetWorker. If cannot see any drive or If could see all drives but not jukebox (see above), need to power-cycle the jukebox, and restart NetWorker. The jukebox uses the first drive
for the library interface. After setting the drive 1 to read only, we never need to power-cycle the jukebox when we see "cannot access hardware" message.
- Waiting for more available space
We use 'adv_file' type for Disk volumes, save sets will not be written continuously from one Disk volume to another. In case seeing message like '(alert) Waiting for more
available space on filesystem /jet_dsik1 for device /jet_disk1', run /u/gxshen/bin/deleoldsavesets, or kill the the save process on backup.cs. Mark the volume as
full (nsrmm -o full -y Diskvolume), and mount another available Disk volume (look for hints from /u/gxshen/bin/mountdiskvolume) and run a make up save for aborted
clients (found from logs or emails).
Please see phone numbers at the end of 'Upgrade NetWorker Server' for hardware and EMC NetWorker support.
Upgrade Networker Server
- Record the latest bootstrap save set ID and its associated volume label. Run mminfo -B to get this information.
- Save a copy of the current configuration, i.e., /nsr/res
- Save a copy of /etc/{rpc,syslog.conf}
- Shut down the NetWorker server by running nsr_shutdown.
- Shut down NetWorker Console server by running /etc/init.d/gst stop
- Remove the earlier NetWorker release in the following order
-
lgtonmc (NetWorker Management Console)
-
lgtoserv (server package)
-
lgtonode (storage node package)
-
lgtolicm (licensing manager package)
-
lgtoclnt (client package)
-
lgtoman (Man Pages)
- Install the new NetWorker release in the following order
-
lgtoclnt
-
lgtonode
-
lgtoserv
-
lgtolicm
-
lgtoman
-
lgtonmc
- Start the NetWorker daemons by running /etc/init.d/networker start.
- Start the NetWorker Console server by running /etc/init.d/gst start
- Check daemons nsrd, nsrexecd, nsrindexd, nsrmmdbd and nsrmmd are running.
You may need to Enter the license enabler code after upgrading To enter the license enabler code:
- Start the NetWorker Management Console software if it is not started.
- Open the Administration window:
- In the Console window, click Enterprise.
- In the left pane, click a NetWorker server in the Enterprise list.
- In the right pane, click the application.
- From the Enterprise menu, select Launch Application. The Administration window is launched as a separate application.
- In the Administration window, click Configuration.
- In the left pane, select Registrations.
- From the File menu, select New. The Create Registration dialog box appears.
- In the Enabler Code attribute, type the enabler code.
- In the Name attribute, type the name of the license.
- (Optional) In the Comment attribute, type a description of the license.
- Click OK.
If you need to enter the enable code, you also need to get an authorization code, send an email to
licensing@emc.com. After getting the code, enter the code by the above procedure. This time, you don't need to select
New, just clicking the new registration created, and enter the authorization code in the
auth code field.
Legato NetWorker Directive File
General Description
During backup processes, Legato NetWorker uses directives to control how particular files are to be backed up, how descendant directories are searched, and how subsequent directives are processed.
We use directives on the backup server to skip /tmp, /cdrom, /var/tmp, /mnt, and /floopy, and to back up /var/mail using mail style file locking and preserving "new mail has arrived" flag.
A
.nsr directive file is parsed before any file in that directory is backed up, so any user can create a
.nsr file and place it in his or her home directory (or subdirectories) to eliminate files to be backed up. A privileged user can place a
.nsr file in the root directory (/) to eliminate a whole file system to be backed up. Each line of a
.nsr file contains one directive. The most useful directive for a usr is
skip directive. It does not back up the specified files and directories. The standard shell file pattern matching (*, [...], [!...], [x-y], ?) can be used to match file names. If a "+" precedes skip, then the directive is propagated to subdirectories.
Examples of .nsr File
A
/.nsr file containing:
<< /usr/src >>
+skip: core *.o
+compressasm: .
will skip all files named
core or *
.o in /usr/src and subdirectories. And other files in the /usr/src will be compressed during backup (and will be set up for automatic decompression on recover).
The following
.nsr file will skip everything in the directory (and subdirectories) it is placed in. This is useful to skip some directory used for large temporary files.
<< . >>
+skip: .
Having a
.nsr file containing
<< . >>
skip: *.jpg *.gif
without the "+" sign, it will skip named
*.jpg or
*.gif only in the directory, not in subdirectories.
The following example will skip every thing in /toberaw and /toberaw2
<< /toberaw >>
+skip: .?* *
<< /toberaw2 >>
+skip: .?* *
For more information about
.nsr file, please look at man pages of nsr(5), nsr_directive(5), and uasm(1). This note is based on these man pages.
Disaster Recovery
Legato NetWorker can recover a client machine and backup server as well. In case our backup serve, backup-0.cs, crashes, we need to do a disaster recovery. Currently we have our NetWorker server installed under /nsr -> /fsys/nsr.
1. Re-install operating system, we are running Red Hat Enterprise Linux Server release 6.3.We have one Jetstor disk shelf attached to backup.cs.
2. Reinstall NetWorker software. Get them from Legato web site, current version we are running is NetWorker 8.1.1. Be sure to install packages in correct order, i.e.,
lgtoclnt
lgtonode
lgtoserv
lgtolicm
lgtoman
lgtonmc
For example
# rpm -ivh lgtoclnt-8.1.1.2-1.x86_64.rpm lgtonode-8.1.1.2-1.x86_64.rpm \
lgtoserv-8.1.1.2-1.x86_64.rpm lgtolicm-8.1.1.2-1.x86_64.rpm
# rpm -ivh lgtoman-8.1.1.2-1.x86_64.rpm
# rpm -ivh lgtonmc-8.1.1-1.x86_64.rpm
# /opt/lgtonmc/bin/nmc_config
Installation guide can be found at
docu50625_NetWorker-8.1-SP1-Installation-Guide--.pdf
3. Start NetWorker server by running
/etc/init.d/networker start
4. Configure jukeboxe, run
jbconfig
Jbconfig is running on host backup-0.cs (Linux 2.6.32-279.14.1.el6.x86_64),
and is using backup-0.cs as the NetWorker server.
1) Configure an AlphaStor Library.
2) Configure an Autodetected SCSI Jukebox.
3) Configure an Autodetected NDMP SCSI Jukebox.
4) Configure an SJI Jukebox.
5) Configure an STL Silo.
6) Exit.
which activity do you want to perform? [1] 2
14484:jbconfig: Scanning SCSI buses; this may take a while ...
Installing 'Qualstar' jukebox - scsidev@6.0.0.
What name do you want to assign to this jukebox device? lto_jukebox
Turn NetWorker auto-cleaning on (yes / no) [yes]?
The following drive(s) can be auto-configured in this jukebox:
1> LTO Ultrium-3 @ 6.1.0 ==> /dev/nst0
2> LTO Ultrium-3 @ 6.2.0 ==> /dev/nst1
3> LTO Ultrium-3 @ 6.3.0 ==> /dev/nst2
These are all the drives that this jukebox has reported.
To change the drive model(s) or configure them as shared or NDMP drives,
you need to bypass auto-configure. Bypass auto-configure? (yes / no) [no]
Jukebox has been added successfully
The following configuration options have been set:
> Jukebox description to the control port and model.
> Autochanger control port to the port at which we found it.
> Networker managed tape autocleaning on.
> Barcode reading to on.
> Volume labels that match the barcodes.
> Slot intended to hold cleaning cartridge to 132. Please insure that a
cleaning cartridge is in that slot
> Number of times we will use a new cleaning cartridge to 5.
> Cleaning interval for the tape drives to 6 months.
You can review and change the characteristics of the autochanger and its
associated devices using the NetWorker Management Console.
Would you like to configure another jukebox? (yes/no) [no]
5. Reset jukeboxes by running
nsrjb -HE
6. Recover the media database and resource configuration files.
1) Get the latest bootstrap save set ID from
cscf_nw_maint@backup, currently gxshen,dlgawley,wcwince,daroloso,jjohnsto. Most likely that information is stored on a 77XXX or 87XXX tape, run "nsrjb" to find out which slot that tape is stored. then
Run
nsrjb -Inv -S# -f device-name where # is the slot number you find from above and device-name is your choice, say /dev/nst1
If you cannot get bootstrap save set ID, run "scanner -B device-name" to find out.
For example, if the first and last lines of bootstrap file is
date time level ssid file record volume
...
05/28/2014 03:43:24 AM full 2911212444 229 0 77057
2) Run
mmrecov -v (for example)
What is the name of the device you plan on using [/dev/nst0]? /dev/nst1
Enter the latest bootstrap save set id: 2911212444
Enter starting file number (if known) [0]: 229
Enter starting record number (if known) [0]: 0
Please insert the volume on which save set id 2911212444 started
into /dev/nst0. When you have done this, press :
Scanning /dev/nst0 for save set 2911212444; this may take a while...
7. Stop the Legato server by running
/etc/init.d/networker stop
8. Move res directory away, and copy res.R to res,
cd /nsr; mv res res.tmp; mv res.R res
9. Restart Legato server by running
/etc/init.d/networker start
10. Reset jukeboxes by running
nsrjb -j lto_jukebox -HE and re-inventory
nsrjb -j lto_jukebox -Iv
11. Run
nsrck -L7 to recover the indexes
12. Recover /.software/local, maybe the whole xhier tree /.software if needed from backups.
13. Run a test backup and recovery to make sure the server is fully recovered.
14. If need more information, look at
NetWorker_8.1_Server_Disaster_Recovery_and_Availability_Best_Practices_Guide.pdf
15. We have service contract (hardware) with Qualstar, the phone number is 1-877-444-1744 or email
support@qualstar.com
16. For software support, contact EMC at 1-877-534-2867. Our site ID is 4347277
Debian / Ubuntu clients
See
LinuxLegatoClientSetup.