Hello,
I'm having a problem with our system when Netvault attempts to backup a client using the netvault 'datachannel' (not using NFS, CIFS or local drives). Within a few seconds of a client job starting the network goes away for the entire machine. The link indicates it is up/up, but all network sessions drop, no new ones can be established untill you ifdown/up the interface. I can run backups all day long over NFS or CIFS (via UNC paths) without any network related issues.
I can reproduce this issue fairly consistantly, 8/10 times, some times it fails in under a minute, other times it may run for an hour or so then fail. just create a job which is going to use netvaults internal method to transfer data over the network, then run that job. within 30 seconds we lose connectivity to the backup server. dmesg output indicates nothing nor are there any logs created under /var/log indicating what the problem might be. compairing the kernel log output pre and post show no differences. The network switch which this machine is plugged into doesn't indicate any issues either. no errors on the interface, and as far as the switch is concerned it belives the host interface to be up/up as well.
we are using Intel 82599EB 10 gig adapter, debian 6 linux with kernel 2.6.32-5 amd, netvault 9. The clients which cause this issue are windows clients (windows 7/8 and windows server 2008r2 / 2012).
does this sound like a familar issue? I went threw the read me and the version of debian we are using is supported, and I was unable to find and documents referencing the intel 10 gig card..
These jobs are no larger or smaller than the ones we do over NFS/CIFS, clients can be 1 gig or 10 gig, behaviour remians the same.
I did a packet capture but it doesn't seem to indicate anything of use, all of a suddent the flow stops, no tear down. I can provide if intrested.
below is some info in case it helps.
netvault version 9, but the same problem existed on 8.x.
I tried another distro wtih an older kernel (2.6.27 i think it was), same result.
I have downloaded, built and installed the newest driver intel has for this card without change as well.
03:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Flags: bus master, fast devsel, latency 0, IRQ 38
Memory at df300000 (64-bit, non-prefetchable) [size=512K]
I/O ports at ecc0 [size=32]
Memory at df2f8000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at df200000 [disabled] [size=512K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [e0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-a3-e4-90
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: ixgbe
--------------------------------------------------------------------------------------------------------------------------
sysctl edits:
kernel.shmmax=134217728
kernel.shmall=134217728
---------------------------------------------------------------------------------------------------------------------------
from dmesg:
ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 2.0.44-k2
ixgbe: Copyright (c) 1999-2009 Intel Corporation.
ixgbe 0000:03:00.0: PCI INT A -> GSI 38 (level, low) -> IRQ 38
ixgbe 0000:03:00.0: setting latency timer to 64
ixgbe 0000:03:00.0: irq 64 for MSI/MSI-X
ixgbe 0000:03:00.0: irq 65 for MSI/MSI-X
--snip--
ixgbe 0000:03:00.1: irq 97 for MSI/MSI-X
ixgbe: 0000:03:00.1: ixgbe_init_interrupt_scheme: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8
ixgbe 0000:03:00.1: (PCI Express:5.0Gb/s:Width x8) 00:1b:21:a3:e4:91
ixgbe 0000:03:00.1: MAC: 2, PHY: 0, PBA No: fafa0e-090
ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
st: Version 20081215, fixed bufsize 32768, s/g segs 256
ixgbe: eth0 NIC Link is Up 10 Gbps, Flow Control: None
--------------------------------------------------------------------------------------------------------------------------------------------
snipet from netvault log of failed job:
Job Message 2013/03/25 15:43:20 63 Media backup01 (backup01: SL_ADICA0C0081225_LLA (ADIC Scalar i500)) Media in 'DRIVE 1:backup01' assigned to job ready for data transfer
Information 2013/03/25 15:43:20 63 Media backup01 Using network socket for data transfer
Background 2013/03/25 15:43:20 63 Media backup01 Sent Plugin space left estimate of 1447535 Mb
Information 2013/03/25 15:43:24 0 Media backup01 (backup01: SL_ADICA0C0081225_LLA (ADIC Scalar i500)) Added valid terminator to 'backup01 25 Mar 14:53-1' <BoltsBlipArchive1> in DRIVE 2:backup01 successfully
Background 2013/03/25 15:43:25 63 Data Plugin DF001 Data channel requested connection to 'backup01.toonboxent.com' (10.101.1.5)
Background 2013/03/25 15:43:25 63 Data Plugin DF001 Data channel connected to 'backup01.toonboxent.com' (10.101.1.5)
Background 2013/03/25 15:43:25 63 Data Plugin DF001 Data channel connected from '10.101.2.5' (10.101.2.5)
Background 2013/03/25 15:44:13 -1 System backup01 NetVault Backup running on 'DF001' has not responded to messages
Error 2013/03/25 15:44:13 63 Jobs backup01 Process running on 'DF001' has exited unexpectedly
Error 2013/03/25 15:44:13 63 Jobs backup01 Setting exit status to failed (2)
Error 2013/03/25 15:44:13 63 Jobs backup01 Job Status: Backup Failed
Job Message 2013/03/25 15:44:13 63 Jobs backup01 Finished job 63, phase 1 (instance 4)
Error 2013/03/25 15:44:15 63 Media backup01 backup01 SL_F0A1E95000 (IBM ULTRIUM-TD5): had channel error
Error 2013/03/25 15:44:15 63 Media backup01 Plugin has gone down
Error 2013/03/25 15:44:15 63 Media backup01 backup01 SL_F0A1E95000 (IBM ULTRIUM-TD5): had transfer aborted
Warning 2013/03/25 15:44:15 63 Media backup01 Data transfer to Mid 'backup01 25 Mar 15:43-1' aborted
Error 2013/03/25 15:44:15 63 Media backup01 (backup01: SL_ADICA0C0081225_LLA (ADIC Scalar i500)) Drive 'DRIVE 1:backup01' has completed its transfer
-