NVR5100/SEB5100/DVR5300 RAID 5 Array & Linux XFS-Filesystem Troubleshooting

NOTICE

POTENTIAL FOR DATA LOSS.
The steps detailed in the resolution of this article may result in a loss of critical data if not performed properly. Before beginning these steps, make sure all important data is backed up in the event of data loss. If you are unsure, please contact Product Support Services prior to attempting the procedure below.

NOTICE

COMPLEX PROCEDURE REQUIRED.
The resolution of this article has many complex steps that may result in unforeseen results if not performed correctly. If you are at all unfamiliar with the requirements, please contact Product Support Services for assistance.

NOTICE

DISCONTINUED PELCO PRODUCT.
This product has been discontinued. For further information please visit pelco.com/support/discontinued-products

Issue

  • No color in video playback timeline.
  • Unexplained color-gaps video playback timeline.
  • Sudden significant shortage in usual video retention time.
  • Endura Diagnostics reports errors in volumes, offline volumes, core dumps or xfs errors.

Product Line

Pelco Video Management

Environment

  • Pelco Endura NVR5100 version 1.4+
  • Pelco Endura DVR5100 version 1.4+
  • Pelco Endura DVR5300 version 1.4+

Cause

  • Multiple Failed/Failing Hard Disk Drive(s) present in array.
  • Re-insertion of non-optimal HDD(s) into array (instead of replacement with new drive).
  • Failed/Failing RAID Array Adapter(s).
  • Excessive Heat or Vibration in installed environment.
  • Power outages or unstable/unclean power in installed environment.
  • Ungraceful restart/reboot of unit(s).
  • Compounding of minor Filesystem Corruptions caused by normal product use over time aka "wear and tear".
  • Possible Hardware or Software defect.

Resolution

note:  Exactly typed command syntax or selections are given in blue
note:  Except where directed, proceed through the steps of this article from top to bottom, in order they are presented. 
 

INDEX

Getting Started

Section 1. Reboot unit with OS Filesystem check
Section 2. SSH back into the unit and verify it has rebooted
Section 3. Check Hard Disk Drive status & RAID Array configuration
Section 4. Fully reconfigure an Array
Section 5. Check for offline/missing volumes
Section 6. Recreate volumes
Section 7. Search for and delete core dumps
Section 8. Xfs repair all volumes
Section 9. Verify Resolution
Addendum 4a 
 


 














Getting Started 

1. Launch Endura Utilities, log in and press Search.
note:  The default login credentials for Endura Utilities version 2.2 and below is [Username: Administrator and Password: configapp], however version 2.3 and greater removed that unique login credential requirement, and instead authenticates against the standard Endura Credentials. The default login in that case is [Username: admin and Password: admin].

2. In the System Attributes tab, right-click the NVR/DVR in question and select SSH Into.

note:  The default Linux Administrative login credentials for Endura NVR/DVR/SEB units is [Username: root and Password: pel2899100 ] .
 
note:  If you have never SSH'd into the unit before, you will likely receive the following warning prompt...

...simply click Yes to proceed.

note: If you receive the following error...

Visit http://www.putty.org/ to download and then copy putty.exe into your workstation c:\windows\system32 folder.

note: If SSH fails to connect in any other fashion, the NVR/DVR may not be able to fully boot up, and you will need to connect a VGA Monitor and PC Keyboard directly at the local NVR/DVR console in order to proceed. 

 

Section 1. Reboot unit with OS Filesystem check

1. Begin a continuous ping of the device by right-clicking it from within Endura Utilities and choosing Continuous Ping....

2. shutdown -rF now
note: This can take 7-20 minutes, depending on the filesystem check and other factors. 

3. Watch the Continuous Ping Window from step 1, eventually the unit will stop Replying to pings, which means it has fully gone down for reboot. Once it comes back up again (ping Reply's start to succeed again), proceed to Section 2.  
note: If the unit does not stop responding to ping commands within 20 minutes, it means one of the services has failed to stop, or there is a hardware problem. If this happens, SSH in again and issue the halt command from the root prompt, then wait for the unit to turn off, and then power it back up using the front panel power button. Once it begins replying to the Continuous Ping from step 1, proceed to Section 2.
note: If the unit does not begin replying to Continuous ping after 20+ minutes, you will need to attach a local monitor and keyboard to the unit, and contact Pelco Technical Support @ 1 (800) 289-9100 for further assistance. The first troubleshooting step at this point will usually be to remove the HDDs (either the first or second set of 6, depending on where the system is becoming stuck at bootup), and then reboot without those drives in the system.



Section 2. SSH back into the unit and verify it has rebooted
1. uptime
This will tell you how long a unit has been online. This should now show only a few minutes or less, rather than hours or days.



Section 3. Check Hard Disk Drive status & RAID Array configuration
1. megamgr
This brings you into the MegaRAID Manager (RAID Array Controller Configuration Utility).
note: the steps in this article assume the NVR/DVR/SEB being troubleshot is fully loaded, meaning each drive bay slot contains a hard disk drive.

2. Using the arrow keys, choose Objects > Physical.

This displays all drives on the first half (Adapter 1) of the unit.  
  • If everything is healthy, all inserted drives will all be listed as * ONLIN
  • If a drive is healthy but the array is not properly configured, it will show * READY.
  • If a drive is in the process of rebuilding, it will show * REBLD.
  • If a drive is Offline or a rebuild has failed, it will show * FAIL.
  • If the RAID Array Controller cannot communicate with a drive, the space where it should be will be completely blank, and you may receive a communication error when trying to use the arrow keys to select it. If this happens, carefully remove and re-insert the drive into the slot. If this does not resolve the issue, it means the Hard Disk Drive, Back plane, or Array Controller hardware is bad. Replace the Drive and check again, else send the NVR/DVR unit into Pelco for Service.
note: By default - when entering megamgr - you are looking at the configuration of "Adapter 1". 
You must also perform all visual checks and steps in this section on Adapter 2 by pressing escape several times to get back to the main menu, then choosing Select Adapter > Adapter 2 ...

...and then Objects > Physical again.

note: The following Cases are conditions which must be observed/met before attempting any of the steps within that subsection. Please carefully review each option before proceeding. 
 
Case 1: If ALL DRIVES on both Adapters are * ONLIN  -
1. This is good, you now have two choices.
             
Choice 1a. Proceed with troubleshooting for fastest resolution -
1. Press escape and fully exit from megamgr back to the root ssh prompt, and then skip to “Section 5. Check for offline/missing volumes”.

Choice 1b. Optional Check RAID Array Parity Consistency to prevent future issues -
Consistency Check is a corrective/preventative measure to maintain the RAID Array Parity Integrity, which allows detection and automatic replacement of bad parity blocks. Finding a bad parity block during a rebuild of a failed drive is a serious problem, as the system does not have the redundancy to recover the data properly at that point*
note: The RAID Adapter/Controller models in use for Pelco DVR5300/NVR5100/SEB5100 are manufactured by LSI Logic, who advises that  recommends execution of RAID Consistency Check at least once monthly (see LSI Logic MegaRAID Configuration Software Users Guide sections 2.4.5 and 3.10*). 
note: Consistency Check requires approximately 72 hours to complete, and performance is significant impacted during this time (Operations such as video search and playback will be noticeably sluggish).

1. Press escape to return to the main menu within megamgr, then select the Check Consistency menu option and press enter/return.
2. To execute a RAID Array Parity Data Consistency check on the Logical Drive for the Adapter in question, press spacebar, which will turn the highlighted entry yellow, and then press F10 and confirm Yes with enter/return to begin the check.
note: Do not proceed with any further steps in this article until Check Consistency has completed.
3. Once the Consistency Check has finished, press escape and fully exit from megamgr back to the root ssh prompt, and then skip to “Section 5. Check for offline/missing volumes” further down this article. 
 
Case 2: If NO MORE THAN 1 DRIVE on each Adapter is Offline(* FAIL)  -
1. Whenever possible, first replace the Offline Hard Disk Drive with a brand new, unused, Pelco Approved Drive. Wait 30 seconds to see if the drive goes into * REBLD state automatically.
2. If you wish to try and rebuild a drive without replacing it, or if the inserted drive does not begin rebuilding, use the arrow keys to highlight the drive and press enter, then select the Rebuild option.
note: * REBLD state will take several hours, and no further steps should be performed from this article until all drives are * ONLIN.
3. Once all drives are
 * ONLIN Goto Section 3. Case 1:

             SubCase 2a:  Rebuild Fails on an Adapter with NO MORE THAN 1 Offline(* FAIL) Drive -
             2a-1. If a brand new, unused, Pelco Approved Drive was not used above, do so now by returning to Case 2: Step 1.
             2a-2. If a new drive has failed rebuild, the RAID Array Parity Data is corrupt, or there are other hardware problems such as a bad Controller Card.

Case 3: Exactly 2 Offline(* FAIL) drives exist on the same Adapter -
In RAID 5, two failed drives usually means the Array data (video in this case) is unsalvageable, however there are rare instances where this is not the case.
To try and determine if the data may be salvageable:

1. Note the drive numbers which have failed on the Adapter in question, and then press escape and fully exit from megamgr back to the root ssh prompt.
2. Issue the following command to check the history showing when each drive has failed:
megarc -pdFailInfo -chAll -idAll -a1
note: In megamgr, the two RAID Array Adapters are referred to as "Adapter 1" and "Adapter 2", however when working with the Linux console command megarc (as shown in the above screenshot), these same adapters are referred to as a0 (which is Adapter 1) and a1 (Adapter 2). Based on this, the above command will display the output for the 2nd Adapter, aka "Adapter 2" within megamgr. use -a0 instead if the failed drives are on the first adapter aka "Adapter 1" within megamgr.

note:  Please bear in mind that the above screenshot is a less than ideal example, as the DVR unit it was taken from has had only 1 drive failure. whereas the scenario we are working on in this section of the article applies to a unit where two drives have failed.

3. Note the time and date that the drives failed.
4. If both drives failed at approximately the same time (within a few seconds of one another), reboot the unit and then restart Section 3, otherwise proceed below to step 5.
5. Return to megamgr., and go back to Objects > Physical again.
6. Highlight the * FAIL drive which failed most recently according to the previous megarc output, press enter and select the option to Make Online.
note: This attempts to force the drive online, but will allow existing parity corruption to remain.  
7. Highlight the remaining * FAIL drive, press enter and then select the Rebuild option.
note: * REBLD state will take several hours, and no further steps should be performed from this article until all drives are * ONLIN.
note: Once * REBLD from step 7 has completed, all drives should show * ONLIN, however the first drive which was made online in step 6 is corrupt, and may cause future issues if left as-is.
8. Highlight the drive which was made online in step 6, press enter and select the Fail Drive option.
9. The Drive will go to * FAIL status. Highlight it again, press enter and then select the Rebuild option. This will clear up any corruption on the drive which was forced online.
note: * REBLD state will take several hours, and no further steps should be performed from this article until all drives are * ONLIN.
10. Once all steps are completed and all drives are * ONLIN, press escape and fully exit from megamgr back to the root ssh prompt, then skip to “Section 5. Check for offline/missing volumes”.
note: If either Make Online or Rebuild fails in steps 6-9, proceed to “Section 4. Fully reconfigure an Array” or Send the NVR/DVR/SEB unit into Pelco for Service and Repair.



Section 4. Fully reconfigure an Array 
note: if all drives are * ONLIN after the above steps, skip this section and go to “Section 5. Check for offline/missing volumes” instead.
note: If the steps in this section are needed, any video data on the Adapter/volume(s) in question is already lost. 
note: Only proceed with the steps in this section under direct supervision of Pelco Technical Support, or if you really know what you are doing. 
note: N = Adapter Number (0  = “Adapter 1”, and 1  = “Adapter 2”).
note: This will result in a complete loss of video.

1. Press escape and fully exit from megamgr, back to the root ssh prompt, and then stop the following services...
service hald stop
service nsdd stop
service syslog stop
service acpid stop
service postgresql stop
service PgAutoVacuum stop
 

2. Clear the array configuration, and then reboot. With DVR/NVR v1.04 or greater, the RAID 5 configuration and volumes will be automatically recreated upon reboot. *
megarc -clrCfg -aN 
reboot 

* In some rare cases, the volumes may not be automatically created upon reboot. See Addendum 4a, if directed by Pelco Product Support.



Section 5. Check for offline/missing volumes
1. cat /proc/partitions

This shows all volumes.

2. df

This shows volumes which *are currently* online/mounted.

3. Examining the output from items 1. and 2., NVR/DVR Local volumes should be shown as:
/dev/sda1 linked to /data/local_0
/dev/sdb1 linked to /data/local_2
/dev/sdc1 linked to /data/local_1
/dev/sdd1 linked to /data/local_3

Any SEB volumes will be shown afterwards, and linked to the SEB5100 Universally Unique Identifier (UUID):
/dev/sde1 linked to /data/uuidaa03421d-4847-4b26-a8fb-5da0ee9f37ae-2_0
/dev/sdf1 linked to /data/uuidaa03421d-4847-4b26-a8fb-5da0ee9f37ae-2_2
/dev/sdg1 linked to /data/uuidaa03421d-4847-4b26-a8fb-5da0ee9f37ae-2_1
/dev/sdh1 linked to /data/uuidaa03421d-4847-4b26-a8fb-5da0ee9f37ae-2_3

note: Items listed that begin with /dev/hd** are Operating System components, and are not pertinent to this article.
note: With Firmware version 1.4, there should be 4 mounted video data volumes for each full NVR/DVR/SEB.  
note: The are exceptions to the above output, for instance if the NVR/DVR/SEB unit(s) was originally shipped or created using version Firmware version 1.3 or below, or was manually recreated (incorrectly) at some point.
Regardless, what is important is that all appropriate
/data/local_* video data volumes exist, and the size of each add up to the total amount of storage space for the NVR/DVR/SEB unit(s) in question (step 4. below).

4. Using the above information, answer the following questions -

Question 1. Are all /data/local_* and/or /data/uuid* volumes accounted for? (Do all the /dev/sdXX entries in the cat /proc/partitions output also show up in the df output?)
Question 2. For the specific NVR/DVR model number(s), is the appropriate amount of storage space accounted for amongst all /data/local_* entries?
Question 3. For any present SEB5100(s) models, is the appropriate amount of storage space accounted for amongst all /data/uuid* entries?
note: Due to RAID 5 Parity Data, capacity in bytes will be listed short by approximately 1 full Hard Disk Drive per Adapter. A fully loaded NVR/DVR/SEB contains 2 RAID 5 Arrays, meaning the total capacity displayed will be short by approximately 2 Hard Disk Drives 

If the answer to all of these questions is YES, skip to “Section 7. Search for and delete core dumps”.
If the answer to any of these questions is no, attempt Section 8. Xfs repair all volumes, and then repeat this Section 5.
If this is your second attempt at this Section 5, continue to “Section 6. Recreate volumes”.


Section 6. Recreate volumes
note: If the steps in this section are needed, any video data on the volume(s) in question is already lost.
note: Only proceed with the steps in this section under direct supervision of Pelco Technical Support, or if you really know what you are doing.  

1. Stop all Services and unmount all data volumes:
service hald stop
service nsdd stop
service syslog stop
service acpid stop
service postgresql stop
service PgAutoVacuum stop
umount /usr/local/Pelco/Database
umount /var/log
swapoff /data/local_0/.swapfile
note: In cases where /data/local_0 is the missing volume in question, the above swapoff command will return and error stating "No such file or directory". If this happens, use the swapon command to check where the swapfile is located, as seen in the following screenshot...
 
...and then proceed with the previous swapoff command, replacing local_0 with local_1 or whichever other volume the swapon result indicated.
umount /data/* 

2. Write zeros to the volume:
note: Be sure you know which volumes need recreating and use the correct syntax for /dev/sdX below.
dd if=/dev/zero of=/dev/sdX bs=5M count=2

3. Run fdisk and recreate sdX partition:
fdisk /dev/sdX
type n for new
p for primary   
1  for first partition 
 and  to use default values for the next 2 prompts.
w  to write the changes  .

4.
Verify the volume is no longer xfs:
file -s /dev/sdX1

5. Format the volume:
mkfs.xfs /dev/sdX1

6. Verify the volume is now xfs:
file -s /dev/sdX1

7. Reboot the unit:
reboot

note: This can take 7-25 minutes, depending on the filesystem check and other factors. You can use Endura Utilities right-click dialog option “Continuous Ping” to watch as the unit eventually goes down (stops replying to pings), and then comes back up again (pings start back up approximately 7-25 minutes later). 



Section 7. Search for and delete core dumps
1. create a list inside the file /root/list.
find / -name core.* > /root/list 

2. Open the created file. Look through this list, remove any entries that are not core dumps.
vi /root/list
note: You are now using Linux vi. Press insert to begin editing, and use and enter/return to create new lines where needed, as well as delete or backspace where needed.
note: If you aren't sure which files are not core dumps, it is recommended you contact Pelco Technical Support @ (800) 289-9100.  

3. Press escape to exit editing mode. Save the file changes by entering a colon (hold shift and the : key simultaneously), and then release shift and type wq and press enter/return.
note: If you wish to exit vi without saving the changes, enter q! instead of wq, and press enter/return

4. Delete the core dumps, which are all remaining files listed in the created text list.
for i in 'cat /root/list'; do rm -rf $i; done



Section 8. Xfs repair all volumes
1. Stop all Services and unmount all data volumes:
service hald stop
service nsdd stop
service syslog stop
service acpid stop
service postgresql stop
service PgAutoVacuum stop
umount /usr/local/Pelco/Database
umount /var/log
swapoff /data/local_0/.swapfile
umount /data/*

 

2. Get a listing of the data volumes (ie. sd*# volumes)
cat /proc/partitions


3. Run xfs_repair on each of the local NVR/DVR (sd*#) volumes
xfs_repair /dev/sda1
xfs_repair /dev/sdb1 
xfs_repair /dev/sdc1 
xfs_repair /dev/sdd1 
Should any volumes fail xfs_repair, go to step 4, otherwise skip to step 5. 

4. Attempt to manually mount and unmount the volume... 
note: XX = last digits of volume(s) needing this step, i.e. /dev/sdc1 or /dev/sdd1 
mount /dev/sdXX /mnt/tmp
umount /mnt/tmp

4a. Run xfs_repair again once the mount/unmount is complete
xfs_repair /dev/sdXX

4b.. If xfs_repair still does not run, it may be giving an error stating there is “valuable metadata”. Use the -L switch to clear the log when trying again.
 xfs_repair -L /dev/sdXX

5. Re-run all xfs_repair commands from step 3, a total of 3 times per volume, then goto step 6.

6.
Reboot the NVR/DVR Unit
shutdown -rF now



Section 9. Verify Resolution
1. After the final reboot, use df to again ensure that all volumes are still mounted properly.


2. Verify video recording and playback using WS5000/WS5200.
Open WS5000, drag a camera from the NVR/DVR in question to the video timeline at the bottom of the screen, zoom into current date/time. Is there new video coming in? (wait up to 10 minutes to see new color).

3. If an issue remains, ask your Pelco Technical Support Representative for a Webex Session ID so that you can bring them into the Workstation PC via Remote Desktop Connection by visiting https://schneider-electric-ee.webex.com




-- end article --


 

Addendum 4a

note: The commands in this addendum may be required if Section 4 Array recreation does not properly take place after clearing an adapter configuration and rebooting the NVR/DVR/SEB in question.
note: If the steps in this section are needed, any video data on the Adapter/volume(s) in question is already lost. 
note: Only proceed with the steps in this section under direct supervision of Pelco Technical Support.
note: N = Adapter Number (0  = “Adapter 1”, and 1  = “Adapter 2”). 

1. First backup the configuration from the other adapter/array...
megarc –ScfgAndParm  –fFileName  -aN


2. Then restore it to the adapter/array you need to recreate....
megarc –RcfgAndParm  –fFileName  -aN   

note:  In the above 2 screenshots. The second array aka "adapter 1" is the one wherein 2 drives have failed and we are having to recreate it.
We've backed up the configuration from the first adapter aka "adapter 0" to a file named "MyAdapter0Backup", and then restored that same file onto our second array "adapter 1".
note: If there any other errors causing the above commands not to work, the RAID Array Controller/Adapter is likely bad, and you will need to send the NVR/DVR unit into Pelco for Service

3. Reboot the system
reboot  

4. Go to  “Section 5. Check for offline/missing volumes”.  


-- end addendum --