Quantcast
Channel: Software Communities : Popular Discussions - NetVault
Viewing all articles
Browse latest Browse all 1582

10Gbps intel network goes down as soon as a job starts using NV data channel

$
0
0

Hello,

 

I'm having a problem with our system when Netvault attempts to backup a client using the netvault 'datachannel'  (not using NFS,  CIFS or local drives).    Within a few seconds of a client job starting the network goes away for the entire machine.    The link indicates it is up/up,  but all network sessions drop,  no new ones can be established untill you ifdown/up the interface.   I can run backups all day long over NFS or CIFS (via UNC paths) without any network related issues.

 

I can reproduce this issue fairly consistantly,  8/10 times,  some times it fails in under a minute,  other times it may run for an hour or so then fail.    just create a job which is going to use netvaults internal method to transfer data over the network,  then run that job.  within 30 seconds we lose connectivity to the backup server.  dmesg output indicates nothing nor are there any logs created under /var/log indicating what the problem might be.  compairing the kernel log output pre and post show no differences.   The network switch which this machine is plugged into doesn't indicate any issues either.    no errors on the interface,  and as far as the switch is concerned it belives the host interface to be up/up as well.

 

we are using Intel 82599EB 10 gig adapter,  debian 6 linux with kernel 2.6.32-5 amd,  netvault 9.   The clients which cause this issue are windows clients (windows 7/8 and windows server 2008r2 / 2012).  

 

does this sound like a familar issue?  I went threw the read me and the version of debian we are using is supported,  and I was unable to find and documents referencing the intel 10 gig card..

 

These jobs are no larger or smaller than the ones we do over NFS/CIFS,  clients can be 1 gig or 10 gig,  behaviour remians the same.

 

 

I did a packet capture but it doesn't seem to indicate anything of use,  all of a suddent the flow stops,  no tear down.  I can provide if intrested. 

 

 

 

 

below is some info in case it helps.

 

netvault version 9,  but the same problem existed on 8.x.

I tried another distro wtih an older kernel (2.6.27 i think it was),  same result.   

 

I have downloaded,  built and installed the newest driver intel has for this card without change as well.

 

 

03:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)

        Subsystem: Intel Corporation Ethernet Server Adapter X520-2

        Flags: bus master, fast devsel, latency 0, IRQ 38

        Memory at df300000 (64-bit, non-prefetchable) [size=512K]

        I/O ports at ecc0 [size=32]

        Memory at df2f8000 (64-bit, non-prefetchable) [size=16K]

        Expansion ROM at df200000 [disabled] [size=512K]

        Capabilities: [40] Power Management version 3

        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+

        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-

        Capabilities: [a0] Express Endpoint, MSI 00

        Capabilities: [e0] Vital Product Data

        Capabilities: [100] Advanced Error Reporting

        Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-a3-e4-90

        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)

        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)

        Kernel driver in use: ixgbe

 

--------------------------------------------------------------------------------------------------------------------------

sysctl edits:

 

kernel.shmmax=134217728

kernel.shmall=134217728

 

---------------------------------------------------------------------------------------------------------------------------


 

from dmesg:

 

ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 2.0.44-k2

ixgbe: Copyright (c) 1999-2009 Intel Corporation.

ixgbe 0000:03:00.0: PCI INT A -> GSI 38 (level, low) -> IRQ 38

ixgbe 0000:03:00.0: setting latency timer to 64

ixgbe 0000:03:00.0: irq 64 for MSI/MSI-X

ixgbe 0000:03:00.0: irq 65 for MSI/MSI-X

--snip--

ixgbe 0000:03:00.1: irq 97 for MSI/MSI-X

ixgbe: 0000:03:00.1: ixgbe_init_interrupt_scheme: Multiqueue Enabled: Rx Queue count = 8, Tx Queue count = 8

ixgbe 0000:03:00.1: (PCI Express:5.0Gb/s:Width x8) 00:1b:21:a3:e4:91

ixgbe 0000:03:00.1: MAC: 2, PHY: 0, PBA No: fafa0e-090

ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection

st: Version 20081215, fixed bufsize 32768, s/g segs 256

ixgbe: eth0 NIC Link is Up 10 Gbps, Flow Control: None

 

--------------------------------------------------------------------------------------------------------------------------------------------

 

snipet from netvault log of failed job:

 

Job Message     2013/03/25 15:43:20     63 Media   backup01        (backup01: SL_ADICA0C0081225_LLA (ADIC Scalar i500)) Media in 'DRIVE 1:backup01' assigned to job ready for data transfer

Information     2013/03/25 15:43:20     63 Media   backup01        Using network socket for data transfer

Background      2013/03/25 15:43:20     63 Media   backup01        Sent Plugin space left estimate of 1447535 Mb

Information     2013/03/25 15:43:24     0 Media   backup01        (backup01: SL_ADICA0C0081225_LLA (ADIC Scalar i500)) Added valid terminator to 'backup01 25 Mar 14:53-1' <BoltsBlipArchive1> in DRIVE 2:backup01 successfully

Background      2013/03/25 15:43:25     63 Data Plugin     DF001   Data channel requested connection to 'backup01.toonboxent.com' (10.101.1.5)

Background      2013/03/25 15:43:25     63 Data Plugin     DF001   Data channel connected to 'backup01.toonboxent.com' (10.101.1.5)

Background      2013/03/25 15:43:25     63 Data Plugin     DF001   Data channel connected from '10.101.2.5' (10.101.2.5)

Background      2013/03/25 15:44:13     -1 System  backup01        NetVault Backup running on 'DF001' has not responded to messages

Error   2013/03/25 15:44:13     63 Jobs    backup01        Process running on 'DF001' has exited unexpectedly

Error   2013/03/25 15:44:13     63 Jobs    backup01        Setting exit status to failed (2)

Error   2013/03/25 15:44:13     63 Jobs    backup01        Job Status: Backup Failed

Job Message     2013/03/25 15:44:13     63 Jobs    backup01        Finished job 63, phase 1 (instance 4)

Error   2013/03/25 15:44:15     63 Media   backup01        backup01 SL_F0A1E95000 (IBM ULTRIUM-TD5): had channel error

Error   2013/03/25 15:44:15     63 Media   backup01        Plugin has gone down

Error   2013/03/25 15:44:15     63 Media   backup01        backup01 SL_F0A1E95000 (IBM ULTRIUM-TD5): had transfer aborted

Warning 2013/03/25 15:44:15     63      Media backup01        Data transfer to Mid 'backup01 25 Mar 15:43-1' aborted

Error   2013/03/25 15:44:15     63 Media   backup01        (backup01: SL_ADICA0C0081225_LLA (ADIC Scalar i500)) Drive 'DRIVE 1:backup01' has completed its transfer

 

 


 

 

-


Viewing all articles
Browse latest Browse all 1582

Trending Articles