27.12.2014 Views

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

QLogic OFED+ Host Software User Guide, Rev. B

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong><br />

<strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

D000046-005 B


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Information furnished in this manual is believed to be accurate and reliable. However, <strong>QLogic</strong> Corporation assumes no<br />

responsibility for its use, nor for any infringements of patents or other rights of third parties, which may result from its<br />

use. <strong>QLogic</strong> Corporation reserves the right to change product specifications at any time without notice. Applications<br />

described in this document for any of these products are for illustrative purposes only. <strong>QLogic</strong> Corporation makes no<br />

representation nor warranty that such applications are suitable for the specified use without further testing or<br />

modification. <strong>QLogic</strong> Corporation assumes no responsibility for any errors that may appear in this document.<br />

No part of this document may be copied nor reproduced by any means, nor translated nor transmitted to any magnetic<br />

medium without the express written consent of <strong>QLogic</strong> Corporation. In accordance with the terms of their valid <strong>QLogic</strong><br />

agreements, customers are permitted to make electronic and paper copies of this document for their own exclusive<br />

use.<br />

<strong>Rev</strong>. B, November 2010<br />

Document <strong>Rev</strong>ision History<br />

Changes<br />

Sections Affected<br />

Added IB Bonding sub-section Section 3, page 3-6<br />

Added IPATH_HCA_SELECTION_ALG to Table<br />

4-7. Environment Variables<br />

“Environment Variables” on page 4-20<br />

ii<br />

D000046-005 B


Table of Contents<br />

1 Introduction<br />

How this <strong>Guide</strong> is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1<br />

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2<br />

Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3<br />

2 Step-by-Step Cluster Setup and MPI Usage Checklists<br />

Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1<br />

Using MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2<br />

3 TrueScale Cluster Setup and Administration<br />

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1<br />

Installed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2<br />

TrueScale and OpenFabrics Driver Overview . . . . . . . . . . . . . . . . . . . . . . . 3-3<br />

IPoIB Network Interface Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4<br />

IPoIB Administration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5<br />

Administering IPoIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5<br />

Stopping, Starting and Restarting the IPoIB Driver. . . . . . . . . . . 3-5<br />

Configuring IPoIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6<br />

Editing the IPoIB Configuration File . . . . . . . . . . . . . . . . . . . . . . 3-6<br />

IB Bonding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6<br />

Interface Configuration Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7<br />

Red Hat EL4 Update 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7<br />

Red Hat EL5, All Updates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8<br />

SuSE Linux Enterprise Server (SLES) 10 and 11. . . . . . . . . . . . 3-9<br />

Verify IB Bonding is Configured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10<br />

Subnet Manager Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12<br />

<strong>QLogic</strong> Distributed Subnet Administration . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13<br />

Applications that use Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />

Virtual Fabrics and the Distributed SA. . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />

Configuring the Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />

Default Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />

Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15<br />

Virtual Fabrics with Overlapping Definitions . . . . . . . . . . . . . . . . . . . . 3-16<br />

D000046-005 iii


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Distributed SA Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19<br />

SID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19<br />

ScanFrequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20<br />

LogFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20<br />

Dbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20<br />

Other Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21<br />

MPI over uDAPL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21<br />

Changing the MTU Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21<br />

Managing the TrueScale Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22<br />

Configure the TrueScale Driver State . . . . . . . . . . . . . . . . . . . . . . . . . 3-23<br />

Start, Stop, or Restart TrueScale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23<br />

Unload the Driver/Modules Manually. . . . . . . . . . . . . . . . . . . . . . . . . . 3-24<br />

TrueScale Driver Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24<br />

More Information on Configuring and Loading Drivers. . . . . . . . . . . . . . . . . 3-25<br />

Performance Settings and Management Tips . . . . . . . . . . . . . . . . . . . . . . . 3-25<br />

Homogeneous Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26<br />

Adapter and Other Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26<br />

Remove Unneeded Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28<br />

Disable Powersaving Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29<br />

Hyper-Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29<br />

<strong>Host</strong> Environment Setup for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29<br />

Configuring for ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30<br />

Configuring ssh and sshd Using shosts.equiv . . . . . . . . . . 3-30<br />

Configuring for ssh Using ssh-agent . . . . . . . . . . . . . . . . . . . 3-32<br />

Process Limitation with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33<br />

Checking Cluster and <strong>Software</strong> Status. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34<br />

ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34<br />

iba_opp_query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35<br />

ibstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36<br />

ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36<br />

ipath_checkout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37<br />

4 Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1<br />

<strong>QLogic</strong> MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1<br />

PSM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2<br />

Other MPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2<br />

Linux File I/O in MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2<br />

MPI-IO with ROMIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />

Getting Started with MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />

iv D000046-005


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Copy Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />

Create the mpihosts File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />

Compile and Run an Example C Program . . . . . . . . . . . . . . . . . . . . . 4-4<br />

Examples Using Other Programming Languages . . . . . . . . . . . . . . . . 4-5<br />

<strong>QLogic</strong> MPI Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6<br />

Use Wrapper Scripts for Compiling and Linking . . . . . . . . . . . . . . . . . 4-7<br />

Configuring MPI Programs for <strong>QLogic</strong> MPI . . . . . . . . . . . . . . . . . . . . . 4-8<br />

To Use Another Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9<br />

Compiler and Linker Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10<br />

Process Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11<br />

TrueScale Hardware Contexts on the DDR and QDR<br />

InfiniBand Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12<br />

Enabling and Disabling <strong>Software</strong> Context Sharing . . . . . . . . . . . 4-13<br />

Restricting TrueScale Hardware Contexts<br />

in a Batch Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13<br />

Context Sharing Error Messages . . . . . . . . . . . . . . . . . . . . . . . . 4-14<br />

Running in Shared Memory Mode . . . . . . . . . . . . . . . . . . . . . . . 4-14<br />

mpihosts File Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15<br />

Using mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16<br />

Console I/O in MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18<br />

Environment for Node Programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19<br />

Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20<br />

Running Multiple Versions of TrueScale or MPI . . . . . . . . . . . . . . . . . 4-22<br />

Job Blocking in Case of Temporary InfiniBand Link Failures. . . . . . . . 4-23<br />

Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23<br />

CPU Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23<br />

mpirun Tunable Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24<br />

MPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24<br />

MPD Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24<br />

Using MPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25<br />

<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP Applications . . . . . . . . . . . . . . . . . . . 4-25<br />

Debugging MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26<br />

MPI Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26<br />

Using Debuggers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27<br />

<strong>QLogic</strong> MPI Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28<br />

5 Using Other MPIs<br />

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1<br />

Installed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2<br />

Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />

D000046-005 v


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />

Compiling Open MPI Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />

Running Open MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4<br />

Further Information on Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />

MVAPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />

Compiling MVAPICH Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />

Running MVAPICH Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6<br />

Further Information on MVAPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6<br />

Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI<br />

with the mpi-selector Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6<br />

HP-MPI and Platform MPI 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />

Compiling Platform MPI 7 Applications . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />

Running Platform MPI 7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />

More Information on Platform MPI 7 . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />

Platform (Scali) MPI 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />

Compiling Platform MPI 5.6 Applications . . . . . . . . . . . . . . . . . . . . . . 5-9<br />

Running Platform MPI 5.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . 5-10<br />

Further Information on Platform MPI 5.6 . . . . . . . . . . . . . . . . . . . . . . . 5-10<br />

Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11<br />

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11<br />

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11<br />

Compiling Intel MPI Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13<br />

Running Intel MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13<br />

Further Information on Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14<br />

Improving Performance of Other MPIs Over InfiniBand Verbs. . . . . . . . . . . 5-14<br />

6 Performance Scaled Messaging<br />

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1<br />

Virtual Fabric Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2<br />

Using SL and PKeys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2<br />

Using Service ID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3<br />

SL2VL mapping from the Fabric Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3<br />

Verifying SL2VL tables on <strong>QLogic</strong> 7300 Series Adapters . . . . . . . . . . . . . . 6-4<br />

vi D000046-005


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

7 Dispersive Routing<br />

8 gPXE<br />

gPXE Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1<br />

Required Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2<br />

Preparing the DHCP Server in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2<br />

Installing DHCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2<br />

Configuring DHCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3<br />

Netbooting Over InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4<br />

Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5<br />

Boot Server Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5<br />

Steps on the gPXE Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14<br />

HTTP Boot Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14<br />

A<br />

B<br />

C<br />

mpirun Options Summary<br />

Job Start Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1<br />

Essential Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1<br />

Spawn Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2<br />

Quiescence Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3<br />

Verbosity Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3<br />

Startup Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3<br />

Stats Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4<br />

Tuning Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5<br />

Shell Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6<br />

Debug Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6<br />

Format Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7<br />

Other Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7<br />

Benchmark Programs<br />

Benchmark 1: Measuring MPI Latency Between Two Nodes . . . . . . . . . . . B-1<br />

Benchmark 2: Measuring MPI Bandwidth Between Two Nodes . . . . . . . . . B-3<br />

Benchmark 3: Messaging Rate Microbenchmarks. . . . . . . . . . . . . . . . . . . . B-4<br />

Benchmark 4: Measuring MPI Latency in <strong>Host</strong> Rings . . . . . . . . . . . . . . . . . B-5<br />

VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration. . . . . . . . . . . . . . . . . C-1<br />

Getting Information about Ethernet IOCs on the Fabric . . . . . . . . . . . C-1<br />

Editing the VirtualNIC Configuration file . . . . . . . . . . . . . . . . . . . . . . . C-4<br />

Format 1: Defining an IOC using the IOCGUID . . . . . . . . . . . . . C-5<br />

Format 2: Defining an IOC using the IOCSTRING . . . . . . . . . . . C-6<br />

Format 3: Starting VNIC using DGID . . . . . . . . . . . . . . . . . . . . . C-6<br />

D000046-005 vii


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

VirtualNIC Failover Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7<br />

Failover to a different Adapter port on the same Adapter. . . . . . C-7<br />

Failover to a different Ethernet port on the same<br />

Ethernet gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7<br />

Failover to a port on a different Ethernet gateway . . . . . . . . . . . C-8<br />

Combination method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8<br />

Creating VirtualNIC Ethernet Interface Configuration Files . . . . . . . . . C-8<br />

VirtualNIC Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-9<br />

Starting, Stopping and Restarting the VirtualNIC Driver . . . . . . . . . . . C-12<br />

Link Aggregation Configuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14<br />

Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14<br />

VirtualNIC Configuration Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14<br />

D<br />

SRP Configuration<br />

SRP Configuration Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1<br />

Important Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1<br />

<strong>QLogic</strong> SRP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2<br />

Stopping, Starting and Restarting the SRP Driver . . . . . . . . . . . . . . . . D-3<br />

Specifying a Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3<br />

Determining the values to use for the configuration . . . . . . . . . . D-6<br />

Specifying an SRP Initiator Port of a Session by Card and<br />

Port Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8<br />

Specifying an SRP Initiator Port of Session by Port GUID . . . . . D-8<br />

Specifying a SRP Target Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9<br />

Specifying a SRP Target Port of a Session by IOCGUID . . . . . . D-10<br />

Specifying a SRP Target Port of a Session by Profile String . . . D-10<br />

Specifying an Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10<br />

Restarting the SRP Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10<br />

Configuring an Adapter with Multiple Sessions . . . . . . . . . . . . . . . . . . D-11<br />

Configuring Fibre Channel Failover. . . . . . . . . . . . . . . . . . . . . . . . . . . D-13<br />

Failover Configuration File 1: Failing over from one<br />

SRP Initiator port to another. . . . . . . . . . . . . . . . . . . . . . . . . . . D-14<br />

Failover Configuration File 2: Failing over from a port on the<br />

VIO hardware card to another port on the VIO hardware card. D-15<br />

Failover Configuration File 3: Failing over from a port on a<br />

VIO hardware card to a port on a different VIO hardware card<br />

within the same Virtual I/O chassis . . . . . . . . . . . . . . . . . . . . . D-16<br />

Failover Configuration File 4: Failing over from a port on a<br />

VIO hardware card to a port on a different VIO hardware<br />

card in a different Virtual I/O chassis . . . . . . . . . . . . . . . . . . . . D-17<br />

viii D000046-005


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Configuring Fibre Channel Load Balancing. . . . . . . . . . . . . . . . . . . . . D-18<br />

1 Adapter Port and 2 Ports on a Single VIO. . . . . . . . . . . . . . . . D-18<br />

2 Adapter Ports and 2 Ports on a Single VIO Module . . . . . . . . D-19<br />

Using the roundrobinmode Parameter . . . . . . . . . . . . . . . . . . . . D-20<br />

Configuring SRP for Native InfiniBand Storage. . . . . . . . . . . . . . . . . . D-21<br />

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-23<br />

Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-23<br />

Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-23<br />

OFED SRP Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-24<br />

E<br />

F<br />

Integration with a Batch Queuing System<br />

Using mpiexec with PBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1<br />

Using SLURM for Batch Queuing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-2<br />

Allocating Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-3<br />

Generating the mpihosts File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-3<br />

Simple Process Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-4<br />

Clean Termination of MPI Processes . . . . . . . . . . . . . . . . . . . . . . . . . E-4<br />

Lock Enough Memory on Nodes when Using SLURM. . . . . . . . . . . . . . . . . E-5<br />

Troubleshooting<br />

Using LEDs to Check the State of the Adapter . . . . . . . . . . . . . . . . . . . . . . F-1<br />

BIOS Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-2<br />

Kernel and Initialization Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-2<br />

Driver Load Fails Due to Unsupported Kernel. . . . . . . . . . . . . . . . . . . F-3<br />

Rebuild or Reinstall Drivers if Different Kernel Installed . . . . . . . . . . . F-3<br />

InfiniPath Interrupts Not Working. . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-3<br />

OpenFabrics Load Errors if ib_qib Driver Load Fails . . . . . . . . . . . . F-4<br />

InfiniPath ib_qib Initialization Failure. . . . . . . . . . . . . . . . . . . . . . . . F-5<br />

MPI Job Failures Due to Initialization Problems . . . . . . . . . . . . . . . . . F-6<br />

OpenFabrics and InfiniPath Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-6<br />

Stop Infinipath Services Before Stopping/Restarting InfiniPath . . . . . . F-6<br />

Manual Shutdown or Restart May Hang if NFS in Use . . . . . . . . . . . . F-7<br />

Load and Configure IPoIB Before Loading SDP . . . . . . . . . . . . . . . . . F-7<br />

Set $IBPATH for OpenFabrics Scripts . . . . . . . . . . . . . . . . . . . . . . . . F-7<br />

SDP Module Not Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-7<br />

ibsrpdm Command Hangs when Two <strong>Host</strong> Channel<br />

Adapters are Installed but Only Unit 1 is Connected<br />

to the Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-8<br />

Outdated ipath_ether Configuration Setup Generates Error . . . . . . . . F-8<br />

System Administration Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . F-8<br />

D000046-005 ix


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Broken Intermediate Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-8<br />

Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-9<br />

Large Message Receive Side Bandwidth Varies with<br />

Socket Affinity on Opteron Systems . . . . . . . . . . . . . . . . . . . . . . . . . F-9<br />

Erratic Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-9<br />

Method 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-10<br />

Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-10<br />

Performance Warning if ib_qib Shares Interrupts with eth0 . . . . . F-11<br />

<strong>QLogic</strong> MPI Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-11<br />

Mixed Releases of MPI RPMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-12<br />

Missing mpirun Executable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-12<br />

Resolving <strong>Host</strong>name with Multi-Homed Head Node . . . . . . . . . . . . . . F-13<br />

Cross-Compilation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-13<br />

Compiler/Linker Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-14<br />

Compiler Cannot Find Include, Module, or Library Files . . . . . . . . . . . F-14<br />

Compiling on Development Nodes . . . . . . . . . . . . . . . . . . . . . . . F-15<br />

Specifying the Run-time Library Path . . . . . . . . . . . . . . . . . . . . . F-15<br />

Problem with Shell Special Characters and Wrapper Scripts . . . . . . . F-16<br />

Run Time Errors with Different MPI Implementations . . . . . . . . . . . . . F-17<br />

Process Limitation with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-19<br />

Number of Processes Exceeds ulimit for Number of Open Files . . F-19<br />

Using MPI.mod Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-20<br />

Extending MPI Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-20<br />

Lock Enough Memory on Nodes When Using a Batch<br />

Queuing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-22<br />

Error Creating Shared Memory Object . . . . . . . . . . . . . . . . . . . . . . . . F-23<br />

gdb Gets SIG32 Signal Under mpirun -debug with the<br />

PSM Receive Progress Thread Enabled . . . . . . . . . . . . . . . . . . . . . F-24<br />

General Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-25<br />

Error Messages Generated by mpirun . . . . . . . . . . . . . . . . . . . . . . . F-25<br />

Messages from the <strong>QLogic</strong> MPI (InfiniPath) Library. . . . . . . . . . F-25<br />

MPI Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-27<br />

Driver and Link Error Messages Reported by MPI Programs. . . F-29<br />

MPI Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-30<br />

G<br />

ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues . . . . . . . . . . . . . . . . G-1<br />

x D000046-005


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

Checking the logical connection between the InfiniBand <strong>Host</strong><br />

and the VIO hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-1<br />

Verify that the proper VirtualNIC driver is running . . . . . . . . . . . G-2<br />

Verifying that the qlgc_vnic.cfg file contains the correct<br />

information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-2<br />

Verifying that the host can communicate with the I/O<br />

Controllers (IOCs) of the VIO hardware . . . . . . . . . . . . . . . . . . G-3<br />

Checking the interface definitions on the host. . . . . . . . . . . . . . . . . . . G-6<br />

Interface does not show up in output of 'ifconfig' . . . . . . . . . . . . G-6<br />

Verify the physical connection between the VIO hardware and<br />

the Ethernet network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-7<br />

Troubleshooting SRP Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-8<br />

ib_qlgc_srp_stats showing session in disconnected state . . . . . G-8<br />

Session in 'Connection Rejected' state . . . . . . . . . . . . . . . . . . . . . . . . G-9<br />

Attempts to read or write to disk are unsuccessful . . . . . . . . . . . . . . . G-11<br />

Four sessions in a round-robin configuration are active . . . . . . . . . . . G-12<br />

Which port does a port GUID refer to . . . . . . . . . . . . . . . . . . . . . . . . G-12<br />

How does the user find a <strong>Host</strong> Channel Adapter port GUID . . . . . . . G-13<br />

Need to determine the SRP driver version.. . . . . . . . . . . . . . . . . . . . . G-15<br />

H<br />

I<br />

Write Combining<br />

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1<br />

Verify Write Combining is Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1<br />

PAT and Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2<br />

MTRR Mapping and Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2<br />

Edit BIOS Settings to Fix MTRR Issues . . . . . . . . . . . . . . . . . . . . . . . H-2<br />

Use the ipath_mtrr Script to Fix MTRR Issues. . . . . . . . . . . . . . . . H-3<br />

Useful Programs and Files<br />

Check Cluster Homogeneity with ipath_checkout . . . . . . . . . . . . . . . . . I-1<br />

Restarting InfiniPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-1<br />

Summary and Descriptions of Useful Programs . . . . . . . . . . . . . . . . . . . . . I-2<br />

dmesg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-3<br />

iba_opp_query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-4<br />

ibhosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-7<br />

ibstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-7<br />

ibtracert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-8<br />

ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-9<br />

ident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-9<br />

ipathbug-helper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-10<br />

ipath_checkout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-10<br />

D000046-005 xi


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-12<br />

ipath_mtrr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-13<br />

ipath_pkt_test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-14<br />

ipathstats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />

lsmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />

modprobe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />

mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />

mpi_stress. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16<br />

rpm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16<br />

strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17<br />

Common Tasks and Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17<br />

Summary and Descriptions of Useful Files . . . . . . . . . . . . . . . . . . . . . . . . . I-18<br />

boardversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />

status_str. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />

version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21<br />

Summary of Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21<br />

J<br />

Recommended Reading<br />

References for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />

Books for Learning MPI Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />

Reference and Source for SLURM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />

InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />

OpenFabrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />

Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />

Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />

Rocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />

Other <strong>Software</strong> Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />

List of Figures<br />

Figure<br />

Page<br />

3-1 <strong>QLogic</strong> <strong>OFED+</strong> <strong>Software</strong> Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1<br />

3-2 Distributed SA Default Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16<br />

3-3 Distributed SA Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17<br />

3-4 Distributed SA Multiple Virtual Fabrics Configured Example . . . . . . . . . . . . . . . . . . 3-17<br />

3-5 Virtual Fabrics with Overlapping Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18<br />

3-6 Virtual Fabrics with PSM_MPI Virtual Fabric Enabled . . . . . . . . . . . . . . . . . . . . . . . 3-18<br />

3-7 Virtual Fabrics with all SIDs assigned to PSM_MPI Virtual Fabric. . . . . . . . . . . . . . 3-19<br />

3-8 Virtual Fabrics with Unique Numeric Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19<br />

C-1 Without IB_Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-10<br />

C-2 With IB_Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-11<br />

xii D000046-005


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

List of Tables<br />

Table<br />

Page<br />

4-1 <strong>QLogic</strong> MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7<br />

4-2 Command Line Options for Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7<br />

4-3 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9<br />

4-4 Portland Group (PGI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9<br />

4-5 PathScale Compiler Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10<br />

4-6 Available Hardware and <strong>Software</strong> Contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11<br />

4-7 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20<br />

5-1 Other Supported MPI Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1<br />

5-2 Open MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />

5-3 MVAPICH Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />

5-4 Platform MPI 7 Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />

5-5 Platform MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10<br />

5-6 Intel MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13<br />

F-1 LED Link and Data Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1<br />

I-1 Useful Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-2<br />

I-2 ipath_checkout Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-11<br />

I-3 Common Tasks and Commands Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17<br />

I-4 Useful Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />

I-5 status_str File Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />

I-6 Status—Other Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-20<br />

I-7 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21<br />

D000046-005 xiii


<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />

<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />

xiv D000046-005


Preface<br />

The <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong> shows end users how to use the<br />

installed software to setup the fabric. End users include both the cluster<br />

administrator and the Message-Passing Interface (MPI) application programmers,<br />

who have different but overlapping interests in the details of the technology.<br />

For specific instructions about installing the <strong>QLogic</strong> QLE7140, QLE7240,<br />

QLE7280, QLE7340, QLE7342, QMH7342, and QME7342 PCI Express ® (PCIe ® )<br />

adapters see the <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong>, and the<br />

initial installation of the Fabric <strong>Software</strong>, see the <strong>QLogic</strong> Fabric <strong>Software</strong><br />

Installation <strong>Guide</strong>.<br />

Intended Audience<br />

This guide is intended for end users responsible for administration of a cluster<br />

network as well as for end users who want to use that cluster.<br />

This guide assumes that all users are familiar with cluster computing, that the<br />

cluster administrator is familiar with Linux ® administration, and that the application<br />

programmer is familiar with MPI, vFabrics, VNIC, SRP, and Distributed SA.<br />

Related Materials<br />

• <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong><br />

• <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong><br />

• Release Notes<br />

Documentation Conventions<br />

This guide uses the following documentation conventions:<br />

• NOTE: provides additional information.<br />

• CAUTION! indicates the presence of a hazard that has the potential of<br />

causing damage to data or equipment.<br />

• WARNING!! indicates the presence of a hazard that has the potential of<br />

causing personal injury.<br />

D000046-005 B xv


Preface<br />

License Agreements<br />

• Text in blue font indicates a hyperlink (jump) to a figure, table, or section in<br />

this guide, and links to Web sites are shown in underlined blue. For<br />

example:<br />

Table 9-2 lists problems related to the user interface and remote agent.<br />

See “Installation Checklist” on page 3-6.<br />

For more information, visit www.qlogic.com.<br />

• Text in bold font indicates user interface elements such as a menu items,<br />

buttons, check boxes, or column headings. For example:<br />

Click the Start button, point to Programs, point to Accessories, and<br />

then click Command Prompt.<br />

Under Notification Options, select the Warning Alarms check box.<br />

• Text in Courier font indicates a file name, directory path, or command line<br />

text. For example:<br />

To return to the root directory from anywhere in the file structure:<br />

Type cd /root and press ENTER.<br />

Enter the following command: sh ./install.bin<br />

• Key names and key strokes are indicated with UPPERCASE:<br />

Press CTRL+P.<br />

Press the UP ARROW key.<br />

• Text in italics indicates terms, emphasis, variables, or document titles. For<br />

example:<br />

For a complete listing of license agreements, refer to the <strong>QLogic</strong><br />

<strong>Software</strong> End <strong>User</strong> License Agreement.<br />

What are shortcut keys<br />

<br />

To enter the date type mm/dd/yyyy (where mm is the month, dd is the<br />

day, and yyyy is the year).<br />

• Topic titles between quotation marks identify related topics either within this<br />

manual, or in the online help that is referred to as the help system<br />

throughout this document.<br />

License Agreements<br />

Refer to the <strong>QLogic</strong> <strong>Software</strong> End <strong>User</strong> License Agreement for a complete listing<br />

of all license agreements affecting this product.<br />

xvi<br />

D000046-005 B


Preface<br />

Technical Support<br />

Technical Support<br />

Availability<br />

Customers should contact their authorized maintenance provider for technical<br />

support of their <strong>QLogic</strong> InfiniBand products. <strong>QLogic</strong>-direct customers may contact<br />

<strong>QLogic</strong> Technical Support; others will be redirected to their authorized<br />

maintenance provider.<br />

Visit the <strong>QLogic</strong> support Web site listed in Contact Information for the latest<br />

firmware and software updates.<br />

<strong>QLogic</strong> Technical Support for products under warranty is available during local<br />

standard working hours excluding <strong>QLogic</strong> Observed Holidays.<br />

Training<br />

<strong>QLogic</strong> offers training for technical professionals for all iSCSI, InfiniBand, and<br />

Fibre Channel products. From the main <strong>QLogic</strong> web page at www.qlogic.com,<br />

click the Education and Resources tab at the top, then click the Education &<br />

Training tab on the left. The <strong>QLogic</strong> Global Training Portal offers online courses,<br />

certification exams, and scheduling of in-person training.<br />

Technical Certification courses include installation, maintenance and<br />

troubleshooting <strong>QLogic</strong> SAN products. Upon demonstrating knowledge using live<br />

equipment, <strong>QLogic</strong> awards a certificate identifying the student as a Certified<br />

Professional. The training professionals at <strong>QLogic</strong> may be reached by e-mail at<br />

training@qlogic.com.<br />

Contact Information<br />

Please feel free to contact your <strong>QLogic</strong> approved reseller or <strong>QLogic</strong> Technical<br />

Support at any phase of integration for assistance. <strong>QLogic</strong> Technical Support can<br />

be reached by the following methods:<br />

Web<br />

Email<br />

http://support.qlogic.com<br />

support@qlogic.com<br />

Knowledge Database<br />

The <strong>QLogic</strong> knowledge database is an extensive collection of <strong>QLogic</strong> product<br />

information that you can search for specific solutions. We are constantly adding to<br />

the collection of information in our database to provide answers to your most<br />

urgent questions. Access the database from the <strong>QLogic</strong> Support Center:<br />

http://support.qlogic.com.<br />

D000046-005 B xvii


Preface<br />

Technical Support<br />

xviii<br />

D000046-005 B


1 Introduction<br />

How this <strong>Guide</strong> is Organized<br />

The <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong> is organized into these sections:<br />

• Section 1, provides an overview and describes interoperability.<br />

• Section 2, describes how to setup your cluster to run high-performance MPI<br />

jobs.<br />

• Section 3, describes the lower levels of the supplied <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong><br />

software. This section is of interest to a TrueScale cluster administrator.<br />

• Section 4, helps the Message Passing Interface (MPI) programmer make the<br />

best use of the <strong>QLogic</strong> MPI implementation. Examples are provided for<br />

compiling and running MPI programs.<br />

• Section 5, gives examples for compiling and running MPI programs with<br />

other MPI implementations.<br />

• Section 6, describes <strong>QLogic</strong> Performance Scaled Messaging (PSM) that<br />

provides support for full Virtual Fabric (vFabric) integration, allowing users to<br />

specify InfiniBand Service Level (SL) and Partition Key (PKey), or to provide<br />

a configured Service ID (SID) to target a vFabric.<br />

• Section 7, describes dispersive routing in the InfiniBand ® fabric to avoid<br />

congestion hotspots by “sraying” messages across the multiple potential<br />

paths.<br />

• Section 8, describes open-source Preboot Execution Environment (gPXE)<br />

boot including installation and setup.<br />

• Appendix A, describes the most commonly used options to mpirun.<br />

• Appendix B, describes how to run <strong>QLogic</strong>’s performance measurement<br />

programs.<br />

• Appendix C, describes the VirtualNIC interface configuration and<br />

administration, providing virtual Ethernet connectivity.<br />

• Appendix D, describes SCSI RDMA Protocol (SRP) configuration that allows<br />

the SCSI protocol to run over InfiniBand for Storage Area Network (SAN)<br />

usage.<br />

D000046-005 B 1-1


1–Introduction<br />

Overview<br />

Overview<br />

• Appendix E, describes two methods the administrator can use to allow users<br />

to submit MPI jobs through batch queuing systems.<br />

• Appendix F, provides information for troubleshooting installation, cluster<br />

administration, and MPI.<br />

• Appendix G, provides information for troubleshooting the upper layer<br />

protocol utilities in the fabric.<br />

• Appendix H, provides instructions for checking write combining and for using<br />

the Page Attribute Table (PAT) and Memory Type Range Registers (MTRR).<br />

• Appendix I, contains useful programs and files for debugging, as well as<br />

commands for common tasks.<br />

• Appendix J, contains a list of useful web sites and documents for a further<br />

understanding of the InfiniBand fabric, and related information.<br />

In addition, the <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong> contains<br />

information on <strong>QLogic</strong> hardware installation and the <strong>QLogic</strong> Fabric <strong>Software</strong><br />

Installation <strong>Guide</strong> contains information on <strong>QLogic</strong> software installation.<br />

The material in this documentation pertains to a <strong>QLogic</strong> OFED cluster. A cluster is<br />

defined as a collection of nodes, each attached to an InfiniBand-based fabric<br />

through the <strong>QLogic</strong> interconnect.<br />

The <strong>QLogic</strong> InfiniBand adapters are InfiniBand 4X adapters. The quad data rate<br />

(QDR) adapters (QLE7340, QLE7342, QMH7342, and QME7342) have a raw<br />

data rate of 40Gbps (data rate of 32Gbps). The double data rate (DDR) adapters<br />

(QLE7240 and QLE7280) have a raw data rate of 20Gbps (data rate of 16Gbps).<br />

The single data rate (SDR) adapters (QLE7140) have a raw data rate of 10Gbps<br />

(data rate of 8Gbps). The QLE7340, QLE7342, QMH7342, and QME7342<br />

adapters can also run in DDR or SDR mode, and the QLE7240 and QLE7280 can<br />

also run in SDR mode.<br />

The <strong>QLogic</strong> adapters utilize standard, off-the-shelf InfiniBand 4X switches and<br />

cabling. The <strong>QLogic</strong> interconnect is designed to work with all InfiniBand-compliant<br />

switches.<br />

NOTE:<br />

If you are using the QLE7240 or QLE7280, and want to use DDR mode,<br />

then DDR-capable switches must be used. Likewise, when using the<br />

QLE7300 series adapters in QDR mode, a QDR switch must be used.<br />

1-2 D000046-005 B


1–Introduction<br />

Interoperability<br />

<strong>QLogic</strong> <strong>OFED+</strong> software is interoperable with other vendors’ IBTA compliant<br />

InfiniBand adapters running compatible OFED releases. There are several<br />

options for subnet management in your cluster:<br />

• An embedded subnet manager can be used in one or more managed<br />

switches. <strong>QLogic</strong> offers the <strong>QLogic</strong> Embedded Fabric Manager (FM) for<br />

both DDR and QDR switch product lines supplied by your InfiniBand switch<br />

vendor.<br />

• A host-based subnet manager can be used. <strong>QLogic</strong> provides the <strong>QLogic</strong><br />

Fabric Manager (FM), as a part of the <strong>QLogic</strong> InfiniBand Fabric Suite.<br />

Interoperability<br />

<strong>QLogic</strong> <strong>OFED+</strong> participates in the standard InfiniBand subnet management<br />

protocols for configuration and monitoring. Note that:<br />

• <strong>QLogic</strong> <strong>OFED+</strong> (including Internet Protocol over InfiniBand (IPoIB)) is<br />

interoperable with other vendors’ InfiniBand adapters running compatible<br />

OFED releases.<br />

• The <strong>QLogic</strong> MPI stack is not interoperable with other InfiniBand adapters<br />

and target channel adapters. Instead, it uses an InfiniBand-compliant,<br />

vendor-specific protocol that is highly optimized for <strong>QLogic</strong> MPI and the<br />

<strong>QLogic</strong> PSM API.<br />

• In addition to supporting running MPI over verbs, <strong>QLogic</strong> provides a<br />

high-performance InfiniBand-Compliant vendor-specific protocol for running<br />

over verbs, known as PSM. MPIs run over PSM will not interoperate with<br />

other adapters.<br />

NOTE:<br />

See the OpenFabrics web site at www.openfabrics.org for more information<br />

on the OpenFabrics Alliance.<br />

D000046-005 B 1-3


1–Introduction<br />

Interoperability<br />

1-4 D000046-005 B


2 Step-by-Step Cluster Setup<br />

and MPI Usage Checklists<br />

Cluster Setup<br />

This section describes how to set up your cluster to run high-performance<br />

Message Passing Interface (MPI) jobs.<br />

Perform the following tasks when setting up the cluster. These include BIOS,<br />

adapter, and system settings.<br />

1. Make sure that hardware installation has been completed according to the<br />

instructions in the <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong><br />

and software installation and driver configuration has been completed<br />

according to the instructions in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation<br />

<strong>Guide</strong>. To minimize management problems, the compute nodes of the<br />

cluster must have very similar hardware configurations and identical<br />

software installations. See “Homogeneous Nodes” on page 3-26 for more<br />

information.<br />

2. Check that the BIOS is set properly according to the instructions in the<br />

<strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong>.<br />

3. Set up the Distributed Subnet Administration (SA) to correctly synchronize<br />

your virtual fabrics. See “<strong>QLogic</strong> Distributed Subnet Administration” on<br />

page 3-13<br />

4. Adjust settings, including setting the appropriate MTU size. See “Adapter<br />

and Other Settings” on page 3-26.<br />

5. Remove unneeded services. <strong>QLogic</strong> recommends turning irqbalance off.<br />

See “Remove Unneeded Services” on page 3-28.<br />

6. Disable powersaving features. See “Disable Powersaving Features” on<br />

page 3-29.<br />

7. Check other performance tuning settings. See “Performance Settings and<br />

Management Tips” on page 3-25.<br />

8. If using Intel ® processors, turn off Hyper-Threading. See “Hyper-Threading”<br />

on page 3-29.<br />

D000046-005 B 2-1


2–Step-by-Step Cluster Setup and MPI Usage Checklists<br />

Using MPI<br />

Using MPI<br />

9. Set up the host environment to use ssh. Two methods are discussed in<br />

“<strong>Host</strong> Environment Setup for MPI” on page 3-29.<br />

10. Verify the cluster setup. See “Checking Cluster and <strong>Software</strong> Status” on<br />

page 3-34.<br />

1. Verify that the <strong>QLogic</strong> hardware and software has been installed on all the<br />

nodes you will be using, and that ssh is set up on your cluster (see all the<br />

steps in the Cluster Setup checklist).<br />

2. Copy the examples to your working directory. See “Copy Examples” on<br />

page 4-3.<br />

3. Make an mpihosts file that lists the nodes where your programs will run.<br />

See “Create the mpihosts File” on page 4-3.<br />

4. Compile the example C program using the default wrapper script mpicc.<br />

Use mpirun to run it. See “Compile and Run an Example C Program” on<br />

page 4-4.<br />

5. Try the examples with other programming languages, C++, Fortran 77, and<br />

Fortran 90 in “Examples Using Other Programming Languages” on<br />

page 4-5.<br />

6. To test using other MPIs that run over PSM, such as MVAPICH, Open MPI,<br />

HP ® -MPI, Platform MPI, and Intel MPI, see Section 5 Using Other MPIs.<br />

7. To switch between multiple versions of Open MPI, MVAPICH, and <strong>QLogic</strong><br />

MPI, use the mpi-selector. See “Managing Open MPI, MVAPICH, and<br />

<strong>QLogic</strong> MPI with the mpi-selector Utility” on page 5-6.<br />

8. Refer to “<strong>QLogic</strong> MPI Details” on page 4-6 for more information about<br />

<strong>QLogic</strong> MPI, and to “Performance Tuning” on page 4-23 to read more about<br />

runtime performance tuning.<br />

9. Refer to Section 5 Using Other MPIs to learn about using other MPI<br />

implementations.<br />

2-2 D000046-005 B


3 TrueScale Cluster Setup<br />

and Administration<br />

Introduction<br />

This section describes what the cluster administrator needs to know about the<br />

<strong>QLogic</strong> <strong>OFED+</strong> software and system administration.<br />

The TrueScale driver ib_qib, <strong>QLogic</strong> Performance Scaled Messaging (PSM),<br />

accelerated Message-Passing Interface (MPI) stack, the protocol and MPI support<br />

libraries, and other modules are components of the <strong>QLogic</strong> <strong>OFED+</strong> software. This<br />

software provides the foundation that supports the MPI implementation.<br />

Figure 3-1 illustrates these relationships. Note that HP-MPI, Scali, MVAPICH,<br />

MVAPICH2, and Open MPI can run either over PSM or OpenFabrics ® <strong>User</strong> Verbs.<br />

The <strong>QLogic</strong> Virtual Network Interface Controller (VNIC) driver module is also<br />

illustrated in the figure.<br />

MPI Applications<br />

<strong>User</strong> Space<br />

Commo<br />

n<br />

InfiniBand/OpenFabri<br />

<strong>QLogic</strong> <strong>OFED+</strong><br />

Hardware<br />

<strong>QLogic</strong> MPI<br />

HP-MPI<br />

Scali<br />

MVAPICH<br />

Open MPI<br />

<strong>QLogic</strong> <strong>OFED+</strong><br />

Communication<br />

Library (PSM)<br />

HP-MPI<br />

Scali<br />

MVAPICH<br />

MVAPICH2<br />

Open MPI<br />

<strong>User</strong> Verbs<br />

HP-MPI<br />

Intel MPI<br />

uDAPL<br />

<strong>QLogic</strong> FM<br />

uMAD API<br />

Kernel Space<br />

TCP/IP<br />

IPoIB<br />

VNIC<br />

<strong>QLogic</strong> <strong>OFED+</strong> Driver<br />

<strong>QLogic</strong> infiniband adapter<br />

Figure 3-1. <strong>QLogic</strong> <strong>OFED+</strong> <strong>Software</strong> Structure<br />

D000046-005 B 3-1


3–TrueScale Cluster Setup and Administration<br />

Installed Layout<br />

Installed Layout<br />

This section describes the default installed layout for the <strong>QLogic</strong> <strong>OFED+</strong> software<br />

and <strong>QLogic</strong>-supplied MPIs.<br />

The <strong>QLogic</strong> MPI is installed in:<br />

/usr/mpi/qlogic<br />

The shared libraries are installed in:<br />

/usr/mpi/qlogic/lib for 32-bit applications<br />

/usr/mpi/qlogic/lib64 for 64-bit applications<br />

MPI include files are in:<br />

/usr/mpi/qlogic/cdinclude<br />

MPI programming examples and the source for several MPI benchmarks are in:<br />

/usr/mpi/qlogic/share/mpich/examples<br />

NOTE:<br />

If <strong>QLogic</strong> MPI is installed in an alternate location, the argument passed to<br />

--prefix (Your location) replaces the default /usr/mpi/qlogic<br />

prefix. <strong>QLogic</strong> MPI binaries, documentation, and libraries are installed under<br />

that prefix. However, a few configuration files are installed in /etc<br />

regardless of the desired --prefix.<br />

If you have installed the software into an alternate location, the<br />

$MPICH_ROOT environment variable needs to match --prefix.<br />

<strong>QLogic</strong> <strong>OFED+</strong> utility programs, are installed in:<br />

/usr/bin<br />

Documentation is found in:<br />

/usr/share/man<br />

/usr/share/doc/infinipath<br />

/usr/share/doc/mpich-infinipath<br />

License information is found only in usr/share/doc/infinipath. <strong>QLogic</strong><br />

<strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> user documentation can be found on the <strong>QLogic</strong> web site<br />

on the software download page for your distribution.<br />

Configuration files are found in:<br />

/etc/sysconfig<br />

Init scripts are found in:<br />

/etc/init.d<br />

3-2 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

TrueScale and OpenFabrics Driver Overview<br />

The TrueScale driver modules in this release are installed in:<br />

/lib/modules/$(uname -r)/<br />

updates/kernel/drivers/infiniband/hw/qib<br />

Most of the other OFED modules are installed under the infiniband<br />

subdirectory. Other modules are installed under:<br />

/lib/modules/$(uname -r)/updates/kernel/drivers/net<br />

The RDS modules are installed under:<br />

/lib/modules/$(uname -r)/updates/kernel/net/rds<br />

<strong>QLogic</strong>-supplied OpenMPI and MVAPICH RPMs with PSM support and compiled<br />

with GCC, PathScale, PGI, and the Intel compilers are now installed in directories<br />

using this format:<br />

/usr/mpi//--qlc<br />

For example:<br />

/usr/mpi/gcc/openmpi-1.4-qlc<br />

TrueScale and OpenFabrics Driver Overview<br />

The TrueScale ib_qib module provides low-level <strong>QLogic</strong> hardware support, and<br />

is the base driver for both MPI/PSM programs and general OpenFabrics protocols<br />

such as IPoIB and service data point (SDP). The driver also supplies the Subnet<br />

Management Agent (SMA) component.<br />

Optional configurable OpenFabrics components and their default settings at<br />

startup are:<br />

• IPoIB network interface. This component is required for TCP/IP networking<br />

for running Ethernet traffic over the TrueScale link. It is not running until it is<br />

configured.<br />

• VNIC. It is not running until it is configured.<br />

• OpenSM. This component is disabled at startup. It can be installed on one<br />

node as a master, with another node being a standby, or disable it on all<br />

nodes except where it will be used as an SM.<br />

• SRP (OFED and <strong>QLogic</strong> modules). SRP is not running until the module is<br />

loaded and the SRP devices on the fabric have been discovered.<br />

• MPI over uDAPL (can be used by Intel MPI or HP ® -MPI). IPoIB must be<br />

configured before MPI over uDAPL can be set up.<br />

Other optional drivers can now be configured and enabled, as described in “IPoIB<br />

Network Interface Configuration” on page 3-4.<br />

D000046-005 B 3-3


3–TrueScale Cluster Setup and Administration<br />

IPoIB Network Interface Configuration<br />

Complete information about starting, stopping, and restarting the <strong>QLogic</strong> <strong>OFED+</strong><br />

services are in “Managing the TrueScale Driver” on page 3-22.<br />

IPoIB Network Interface Configuration<br />

The following instructions show you how to manually configure your OpenFabrics<br />

IPoIB network interface. <strong>QLogic</strong> recommends using the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong><br />

<strong>Software</strong> Installation package that automatically installs the IPoIB Network<br />

Interface configuration. This example assumes that you are using sh or bash as<br />

your shell, all required <strong>QLogic</strong> <strong>OFED+</strong> and OpenFabric’s RPMs are installed, and<br />

your startup scripts have been ran (either manually or at system boot).<br />

For this example, the IPoIB network is 10.1.17.0 (one of the networks reserved for<br />

private use, and thus not routable on the Internet), with a /8 host portion. In this<br />

case, the netmask must be specified.<br />

This example assumes that no hosts files exist, the host being configured has the<br />

IP address 10.1.17.3, and DHCP is not used.<br />

NOTE:<br />

Instructions are only for this static IP address case. Configuration methods<br />

for using DHCP will be supplied in a later release.<br />

1. Type the following command (as a root user):<br />

ifconfig ib0 10.1.17.3 netmask 0xffffff00<br />

2. To verify the configuration, type:<br />

ifconfig ib0<br />

ifconfig ib1<br />

The output from this command will be similar to:<br />

ib0 Link encap:InfiniBand HWaddr<br />

00:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:<br />

00<br />

inet addr:10.1.17.3 Bcast:10.1.17.255 Mask:255.255.255.0<br />

UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1<br />

RX packets:0 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:0 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:128<br />

RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)<br />

3-4 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

IPoIB Administration<br />

3. Type:<br />

ping -c 2 -b 10.1.17.255<br />

The output of the ping command will be similar to the following, with a line<br />

for each host already configured and connected:<br />

WARNING: pinging broadcast address<br />

PING 10.1.17.255 (10.1.17.255) 517(84) bytes of data.<br />

174 bytes from 10.1.17.3: icmp_seq=0 ttl=174 time=0.022<br />

ms<br />

64 bytes from 10.1.17.1: icmp_seq=0 ttl=64 time=0.070 ms<br />

(DUP!)<br />

64 bytes from 10.1.17.7: icmp_seq=0 ttl=64 time=0.073 ms<br />

(DUP!)<br />

The IPoIB network interface is now configured.<br />

4. Restart (as a root user) by typing:<br />

/etc/init.d/openibd restart<br />

NOTE:<br />

• The configuration must be repeated each time the system is rebooted.<br />

• IPoIB-CM (Connected Mode) is enabled by default. The setting in<br />

/etc/infiniband/openib.conf is SET_IPOIB_CM=yes. To use<br />

datagram mode, use change the setting to SET_IPOIB_CM=no.<br />

IPoIB Administration<br />

Administering IPoIB<br />

Stopping, Starting and Restarting the IPoIB Driver<br />

<strong>QLogic</strong> recommends using the <strong>QLogic</strong> IFS Installer TUI to stop, stat and restart<br />

the IPoIB driver. Refer to the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more<br />

information. For using the command line to stop, start, and restart the IPoIB driver<br />

use the following commands.<br />

To stop the IPoIB driver, use the following command:<br />

/etc/init.d/openibd stop<br />

To start the IPoIB driver, use the following command:<br />

/etc/init.d/openibd start<br />

D000046-005 B 3-5


3–TrueScale Cluster Setup and Administration<br />

IB Bonding<br />

To restart the IPoIB driver, use the following command:<br />

/etc/init.d/openibd restart<br />

Configuring IPoIB<br />

<strong>QLogic</strong> recommends using the <strong>QLogic</strong> IFS Installer TUI to configure the IPoIB<br />

driver. Refer to the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more<br />

information. For using the command line to configure the IPoIB driver use the<br />

following commands.<br />

Editing the IPoIB Configuration File<br />

1. For each IP Link Layer interface, create an interface configuration file,<br />

/etc/sysconfig/network-scripts/ifcfg-NAME, where NAME is the<br />

value of the NAME field specified in the CREATE block. The following is a<br />

sample: /etc/sysconfig/network-scripts/ifcfg-ib1 file:<br />

DEVICE=ib1<br />

BOOTPROTO=static<br />

BROADCAST=192.168.18.255<br />

IPADDR=192.168.18.120<br />

NETMASK=255.255.255.0<br />

ONBOOT=yes<br />

NOTE:<br />

For IPoIB, the INSTALL script for the adapter now helps the user<br />

create the ifcfg files.<br />

IB Bonding<br />

2. After modifying the /etc/sysconfig/ipoib.cfg file, restart the IPoIB driver<br />

with the following:<br />

/etc/init.d/openibd restart<br />

IB bonding is a high availability solution for IPoIB interfaces. It is based on the<br />

Linux Ethernet Bonding Driver and was adopted to work with IPoIB. The support<br />

for IPoIB interfaces is only for the active-backup mode, other modes should not be<br />

used. <strong>QLogic</strong> only supports bonding across <strong>Host</strong> Channel Adapter ports at this<br />

time. Bonding port 1 and port 2 on the same <strong>Host</strong> Channel Adapter is not<br />

supported in this release.<br />

3-6 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

IB Bonding<br />

Interface Configuration Scripts<br />

Create interface configuration scripts for the ibX and bondX interfaces. Once the<br />

configurations are in place, perform a server reboot, or a service network restart.<br />

For SLES operating systems (OS), a server reboot is required. Refer to the<br />

following standard syntax for bonding configuration by the OS.<br />

NOTE:<br />

For all of the following OS configuration script examples that set MTU,<br />

MTU=65520 is valid only if all IPoIB slaves operate in connected mode and<br />

are configured with the same value. For IPoIB slaves that work in datagram<br />

mode, use MTU=2044. If the MTU is not set correctly or the MTU is not set<br />

at all (set to the default value), performance of the interface may be lower.<br />

Red Hat EL4 Update 8<br />

The following is an example for bond0 (master). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-bond0:<br />

DEVICE=bond0<br />

IPADDR=192.168.1.1<br />

NETMASK=255.255.255.0<br />

NETWORK=192.168.1.0<br />

BROADCAST=192.168.1.255<br />

ONBOOT=yes<br />

BOOTPROTO=none<br />

USERCTL=no<br />

TYPE=Bonding<br />

MTU=65520<br />

BONDING_OPTS="primary=ib0 updelay=0 downdelay=0<br />

The following is an example for ib0 (slave). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-ib0:<br />

DEVICE=ib0<br />

USERCTL=no<br />

ONBOOT=yes<br />

MASTER=bond0<br />

SLAVE=yes<br />

BOOTPROTO=none<br />

TYPE=InfiniBand<br />

PRIMARY=yes<br />

D000046-005 B 3-7


3–TrueScale Cluster Setup and Administration<br />

IB Bonding<br />

The following is an example for ib1 (slave 2). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-ib1:<br />

DEVICE=ib1<br />

USERCTL=no<br />

ONBOOT=yes<br />

MASTER=bond0<br />

SLAVE=yes<br />

BOOTPROTO=none<br />

TYPE=InfiniBand<br />

Add the following lines to the file /etc/modprobe.conf:<br />

alias bond0 bonding<br />

options bond0 miimon=100 mode=1 max_bonds=1<br />

Red Hat EL5, All Updates<br />

The following is an example for bond0 (master). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-bond0:<br />

DEVICE=bond0<br />

IPADDR=192.168.1.1<br />

NETMASK=255.255.255.0<br />

NETWORK=192.168.1.0<br />

BROADCAST=192.168.1.255<br />

ONBOOT=yes<br />

BOOTPROTO=none<br />

USERCTL=no<br />

MTU=65520<br />

BONDING_OPTS="primary=ib0 updelay=0 downdelay=0"<br />

The following is an example for ib0 (slave). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-ib0:<br />

DEVICE=ib0<br />

USERCTL=no<br />

ONBOOT=yes<br />

MASTER=bond0<br />

SLAVE=yes<br />

BOOTPROTO=none<br />

TYPE=InfiniBand<br />

PRIMARY=yes<br />

3-8 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

IB Bonding<br />

The following is an example for ib1 (slave 2). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-ib1:<br />

DEVICE=ib1<br />

USERCTL=no<br />

ONBOOT=yes<br />

MASTER=bond0<br />

SLAVE=yes<br />

BOOTPROTO=none<br />

TYPE=InfiniBand<br />

Add the following lines to the file /etc/modprobe.conf:<br />

alias bond0 bonding<br />

options bond0 miimon=100 mode=1 max_bonds=1<br />

SuSE Linux Enterprise Server (SLES) 10 and 11<br />

The following is an example for bond0 (master). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-bond0:<br />

DEVICE="bond0"<br />

TYPE="Bonding"<br />

IPADDR="192.168.1.1"<br />

NETMASK="255.255.255.0"<br />

NETWORK="192.168.1.0"<br />

BROADCAST="192.168.1.255"<br />

BOOTPROTO="static"<br />

USERCTL="no"<br />

STARTMODE="onboot"<br />

BONDING_MASTER="yes"<br />

BONDING_MODULE_OPTS="mode=active-backup miimon=100<br />

primary=ib0 updelay=0 downdelay=0"<br />

BONDING_SLAVE0=ib0<br />

BONDING_SLAVE1=ib1<br />

MTU=65520<br />

D000046-005 B 3-9


3–TrueScale Cluster Setup and Administration<br />

IB Bonding<br />

The following is an example for ib0 (slave). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-ib0:<br />

DEVICE='ib0'<br />

BOOTPROTO='none'<br />

STARTMODE='off'<br />

WIRELESS='no'<br />

ETHTOOL_OPTIONS=''<br />

NAME=''<br />

USERCONTROL='no'<br />

IPOIB_MODE='connected'<br />

The following is an example for ib1 (slave 2). The file is named<br />

/etc/sysconfig/network-scripts/ifcfg-ib1:<br />

DEVICE='ib1'<br />

BOOTPROTO='none'<br />

STARTMODE='off'<br />

WIRELESS='no'<br />

ETHTOOL_OPTIONS=''<br />

NAME=''<br />

USERCONTROL='no'<br />

IPOIB_MODE='connected'<br />

Verify the following line is set to the value of yes in /etc/sysconfig/boot:<br />

RUN_PARALLEL="yes"<br />

Verify IB Bonding is Configured<br />

After the configuration scripts are updated, and the service network is restarted or<br />

a server reboot is accomplished, use the following CLI commands to verify that IB<br />

bonding is configured.<br />

• cat /proc/net/bonding/bond0<br />

• # ifconfig<br />

3-10 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

IB Bonding<br />

Example of cat /proc/net/bonding/bond0 output:<br />

# cat /proc/net/bonding/bond0<br />

Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)<br />

Bonding Mode: fault-tolerance (active-backup) (fail_over_mac)<br />

Primary Slave: ib0<br />

Currently Active Slave: ib0<br />

MII Status: up<br />

MII Polling Interval (ms): 100<br />

Up Delay (ms): 0<br />

Down Delay (ms): 0<br />

Slave Interface: ib0<br />

MII Status: up<br />

Link Failure Count: 0<br />

Permanent HW addr: 80:00:04:04:fe:80<br />

Slave Interface: ib1<br />

MII Status: up<br />

Link Failure Count: 0<br />

Permanent HW addr: 80:00:04:05:fe:80<br />

D000046-005 B 3-11


3–TrueScale Cluster Setup and Administration<br />

Subnet Manager Configuration<br />

Example of Ifconfig output:<br />

st2169:/etc/sysconfig # ifconfig<br />

bond0 Link encap:InfiniBand HWaddr<br />

80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00<br />

inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0<br />

inet6 addr: fe80::211:7500:ff:909b/64 Scope:Link<br />

UP BROADCAST RUNNING MASTER MULTICAST MTU:65520 Metric:1<br />

RX packets:120619276 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:120619277 errors:0 dropped:137 overruns:0 carrier:0<br />

collisions:0 txqueuelen:0<br />

RX bytes:10132014352 (9662.6 Mb) TX bytes:10614493096 (10122.7 Mb)<br />

ib0 Link encap:InfiniBand HWaddr<br />

80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00<br />

UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1<br />

RX packets:118938033 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:118938027 errors:0 dropped:41 overruns:0 carrier:0<br />

collisions:0 txqueuelen:256<br />

RX bytes:9990790704 (9527.9 Mb) TX bytes:10466543096 (9981.6 Mb)<br />

ib1 Link encap:InfiniBand HWaddr<br />

80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00<br />

UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1<br />

RX packets:1681243 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:1681250 errors:0 dropped:96 overruns:0 carrier:0<br />

collisions:0 txqueuelen:256<br />

RX bytes:141223648 (134.6 Mb) TX bytes:147950000 (141.0 Mb)<br />

Subnet Manager Configuration<br />

<strong>QLogic</strong> recommends using the <strong>QLogic</strong> Fabric Manager to manage your fabric.<br />

Refer to the <strong>QLogic</strong> Fabric Manager <strong>User</strong> <strong>Guide</strong> for information on configuring the<br />

<strong>QLogic</strong> Fabric Manager.<br />

OpenSM is an optional component of the OpenFabrics project that provides a<br />

Subnet Manager (SM) for InfiniBand networks. This package can be installed on<br />

all machines, but only needs to be enabled on the machine in the cluster that will<br />

act as a subnet manager. You can’t use OpenSM if any of your InfiniBand<br />

switches provide a subnet manager, or if you are running a host-based SM.<br />

WARNING!!<br />

Don’t run OpenSM with <strong>QLogic</strong> FM in the same fabric.<br />

3-12 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

If you are using the Installer tool, you can set the OpenSM default behavior at the<br />

time of installation.<br />

OpenSM only needs to be enabled on the node that acts as the subnet manager,<br />

so use the chkconfig command (as a root user) to enable it on the node where it<br />

will be run:<br />

chkconfig opensmd on<br />

The command to disable it on reboot is:<br />

chkconfig opensmd off<br />

You can start opensmd without rebooting your machine by typing:<br />

/etc/init.d/opensmd start<br />

You can stop opensmd again by typing:<br />

/etc/init.d/opensmd stop<br />

If you want to pass any arguments to the OpenSM program, modify the following<br />

file, and add the arguments to the OPTIONS variable:<br />

/etc/init.d/opensmd<br />

For example:<br />

Use the UPDN algorithm instead of the Min Hop algorithm.<br />

OPTIONS="-R updn"<br />

For more information on OpenSM, see the OpenSM man pages, or look on the<br />

OpenFabrics web site.<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

As InfiniBand clusters are scaled into the Petaflop range and beyond, a more<br />

efficient method for handling queries to the Fabric Manager is required. One of<br />

these issues is that while the Fabric Manager can configure and operate that<br />

many nodes, under certain conditions it can become overloaded with queries from<br />

those same nodes.<br />

For example, consider an InfiniBand fabric consisting of 1,000 nodes, each with 4<br />

processors. When a large MPI job is started across the entire fabric, each process<br />

needs to collect InfiniBand path records for every other node in the fabric - and<br />

every single process is going to be querying the subnet manager for these path<br />

records at roughly the same time. This amounts to a total of 3.9 million path<br />

queries just to start the job!<br />

In the past, MPI implementations have side-stepped this problem by hand crafting<br />

path records themselves, but this solution cannot be used if advanced fabric<br />

management techniques such as virtual fabrics and mesh/torus configurations are<br />

being used. In such cases, only the subnet manager itself has enough information<br />

to correctly build a path record between two nodes.<br />

D000046-005 B 3-13


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

The Distributed Subnet Administration (SA) solves this problem by allowing each<br />

node to locally replicate the path records needed to reach the other nodes on the<br />

fabric. At boot time, each Distributed SA queries the subnet manager for<br />

information about the relevant parts of the fabric, backing off whenever the subnet<br />

manager indicates that it is busy. Once this information is in the Distributed SA's<br />

database, it is ready to answer local path queries from MPI or other InfiniBand<br />

applications. If the fabric changes (due to a switch failure or a node being added<br />

or removed from the fabric) the Distributed SA updates the affected portions of the<br />

database. The Distributed SA is installed and runs on every node in the fabric,<br />

except the management node running <strong>QLogic</strong> FM.<br />

Applications that use Distributed SA<br />

The <strong>QLogic</strong> PSM Library has been extended to take advantage of Distributed SA.<br />

Therefore, all MPIs that use the <strong>QLogic</strong> PSM library can take advantage of the<br />

Distributed SA. Other applications must be modified specifically to take advantage<br />

of it. For developers writing applications that use the Distributed SA, please refer<br />

to the header file /usr/include/Infiniband/ofedplus_path.h for information on<br />

modifying applications to use the Distributed SA. This file can be found on any<br />

node where the Distributed SA is installed. For further assistance please contact<br />

<strong>QLogic</strong> Support.<br />

Virtual Fabrics and the Distributed SA<br />

The Distributed SA is designed to be aware of Virtual Fabrics, but to only store<br />

records for those Virtual Fabrics that match Service ID records in the Distributed<br />

SA's configuration file. In addition, the Distributed SA recognizes when multiple<br />

SIDs match the same Virtual Fabric and will only store one copy of each path<br />

record within a Virtual Fabric. Next, SIDs that match more than one Virtual Fabric<br />

will be assigned to a single Virtual Fabrics. Finally, Virtual Fabrics that do not<br />

match SIDs in the Distributed SA's database will be ignored.<br />

Configuring the Distributed SA<br />

In order to absolutely minimize the number of queries made by the Distributed SA,<br />

it is important to configure it correctly, both to match the configuration of the Fabric<br />

Manager and to exclude those portions of the fabric that will not be used by<br />

applications using the Distributed SA. The configuration file for the Distributed SA<br />

is named /etc/sysconfig/iba/qlogic_sa.conf.<br />

Default Configuration<br />

As shipped, the <strong>QLogic</strong> Fabric Manager creates a single virtual fabric, called<br />

“Default” and maps all nodes and Service IDs to it, and the Distributed SA ships<br />

with a configuration that lists a set of thirty-one Service IDs (SID<br />

0x1000117500000000 through 0x100011750000000f and 0x1 through 0xf). This<br />

results in an arrangement like the one shown in Figure 3-2<br />

3-14 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

Virtual Fabric “Default”<br />

Pkey: 0xffff<br />

SID Range: 0 x0-0 xffffffffffffffff<br />

Virtual Fabric “Default”<br />

Pkey: 0xffff<br />

SID Range: 0x1-0xf<br />

SID Range: 0x 1000117500000000-0x 100011750000000f<br />

Infiniband Fabric<br />

Distributed SA<br />

Figure 3-2. Distributed SA Default Configuration<br />

If you are using the <strong>QLogic</strong> FM in its default configuration, and you are using the<br />

standard <strong>QLogic</strong> PSM Service IDs, this arrangement will work fine and you will not<br />

need to modify the Distributed SA's configuration file - but notice that the<br />

Distributed SA has restricted the range of Service IDs it cares about to those that<br />

were defined in its configuration file. Attempts to get path records using other SIDs<br />

will not work, even if those other SIDs are valid for the fabric.<br />

Multiple Virtual Fabrics Example<br />

A person configuring the physical InfiniBand fabric may want to limit how much<br />

InfiniBand bandwidth MPI applications are permitted to consume. In that case,<br />

they may re-configure the <strong>QLogic</strong> Fabric Manager, turning off the "Default" Virtual<br />

Fabric and replacing it with several other Virtual Fabrics.<br />

In Figure 3-3, the administrator has divided the physical fabric into four virtual<br />

fabrics: "Admin" (used to communicate with the Fabric Manager), "Storage" (used<br />

by SRP), "PSM_MPI" (used by regular MPI jobs) and a special "Reserved" fabric<br />

for special high-priority jobs.<br />

D000046-005 B 3-15


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

Virtual Fabric “Admin”<br />

Pkey: 0x7fff<br />

Virtual Fabric<br />

“Reserved”<br />

Pkey: 0x8002<br />

SID Range: 0x10-0x1f<br />

Virtual Fabric “Storage”<br />

Pkey: 0x8001<br />

SID:<br />

0x0000494353535250<br />

Virtual Fabric<br />

“PSM_MPI”<br />

Pkey: 0x8003<br />

SID Range: 0x1-0xf<br />

SID Range:<br />

0x1000117500000000-<br />

0x100011750000000f<br />

Virtual Fabric “PSM_MPI”<br />

Pkey: 0x8003<br />

SID Range: 0x1-0xf<br />

SID Range: 0x1000117500000000-<br />

0x100011750000000f<br />

Infiniband Fabric<br />

Distributed SA<br />

Figure 3-3. Distributed SA Multiple Virtual Fabrics Example<br />

Due to the fact that the Distributed SA was not configured to include the SID<br />

Range 0x10 through 0x1f, it has simply ignored the "Reserved" VF. Adding those<br />

SIDs to the qlogic_sa.conf file solves the problem as shown in Figure 3-4.<br />

Virtual Fabric “Admin”<br />

Pkey: 0x7fff<br />

Virtual Fabric<br />

“Reserved”<br />

Pkey: 0x8002<br />

SID Range: 0x10-0x1f<br />

Virtual Fabric “Reserved”<br />

Pkey: 0x8002<br />

SID Range: 0x10-0x1f<br />

Virtual Fabric “Storage”<br />

Pkey: 0x8001<br />

SID:<br />

0x0000494353535250<br />

Virtual Fabric<br />

“PSM_MPI”<br />

Pkey: 0x8003<br />

SID Range: 0x1-0xf<br />

SID Range:<br />

0x1000117500000000-<br />

0x100011750000000f<br />

Virtual Fabric “PSM_MPI”<br />

Pkey: 0x8003<br />

SID Range: 0x1-0xf<br />

SID Range: 0x1000117500000000-<br />

0x100011750000000f<br />

Infiniband Fabric<br />

Distributed SA<br />

Figure 3-4. Distributed SA Multiple Virtual Fabrics Configured Example<br />

Virtual Fabrics with Overlapping Definitions<br />

As defined, SIDs should never be shared between Virtual Fabrics. Unfortunately,<br />

it is very easy to accidentally create such overlaps. Figure 3-5 shows an example<br />

with overlapping definitions.<br />

3-16 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

Virtual Fabric “Default”<br />

Virtual<br />

Pkey:<br />

Fabric<br />

0xffff<br />

“Default”<br />

SID Range:<br />

Pkey:<br />

0x0-0xffffffffffffffff<br />

0xffff<br />

SID Range: Virtual 0x0-0xffffffffffffffff<br />

Fabric “PSM_MPI”<br />

Pkey: 0x8002<br />

SID Range: 0x1-0xf<br />

SID Range:<br />

0x1000117500000000-<br />

0x100011750000000f<br />

<br />

Looking for for SID SID Ranges Range 0x1-0xf 0x1-0xf and<br />

and 0x1000117500000000-<br />

0x100011750000000f<br />

Infiniband Fabric<br />

Distributed SA<br />

Figure 3-5. Virtual Fabrics with Overlapping Definitions<br />

In Figure 3-5, the fabric administrator enabled the "PSM_MPI" Virtual Fabric<br />

without modifying the "Default" Virtual Fabric. As a result, the Distributed SA sees<br />

two different virtual fabrics that match its configuration file.<br />

In Figure 3-6, the person administering the fabric has created two different Virtual<br />

Fabrics without turning off the Default - and two of the new fabrics have<br />

overlapping SID ranges.<br />

Virtual Fabric “Reserved”<br />

ID: 2<br />

Pkey: 0x8003<br />

SID Range: 0x1-0xf<br />

Virtual Virtual Fabric Fabric “Default” “Default” Pkey: 0xffff<br />

SID Range: Pkey: 0x0-0xffffffffffffffff<br />

0xffff<br />

SID Range:<br />

Virtual<br />

0x0-0xffffffffffffffff<br />

Fabric “PSM_MPI”<br />

ID: 1 Pkey: 0x8002<br />

SID Range: 0x1-0xf<br />

SID Range:<br />

0x1000117500000000-<br />

0x100011750000000f<br />

<br />

Virtual Fabric “Default”<br />

Pkey: 0xffff<br />

SID Range: 0x1-0xf<br />

SID<br />

0x1000117500000000-<br />

Range: 0x1000117500000000-<br />

0x100011750000000f<br />

Looking for SID Ranges 0x1-0xf and<br />

Infiniband Fabric<br />

Distributed SA<br />

Figure 3-6. Virtual Fabrics with PSM_MPI Virtual Fabric Enabled<br />

In Figure 3-6, the administrator enabled the "PSM_MPI" fabric, and then added a<br />

new "Reserved" fabric that uses one of the SID ranges that "PSM_MPI" uses.<br />

When a path query has been received, the Distributed SA deals with these<br />

conflicts as follows:<br />

D000046-005 B 3-17


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

First, any virtual fabric with a pkey of 0xffff is declared to be the "Default". The<br />

"Default" Virtual Fabric is treated as a special case by the Distributed SA. The<br />

"Default" Virtual Fabric is used only as a last resort. Stored SIDs are only mapped<br />

to the default if they do not match any other Virtual Fabrics. Thus, in the first<br />

example, Figure 3-6, the Distributed SA will assign all the SIDs in its configuration<br />

file to the "PSM_MPI" Virtual Fabric as shown in Figure 3-7.<br />

Virtual Fabric “Default”<br />

Pkey: 0xffff<br />

SID Range: 0x0-0xffffffffffffffff<br />

Virtual Fabric “PSM_MPI”<br />

Pkey: 0x8002<br />

SID Range: 0x1-0xf<br />

SID Range:<br />

0x1000117500000000-<br />

0x100011750000000f<br />

Virtual Fabric “PSM_MPI”<br />

Pkey: 0x8002<br />

SID Range: 0x1-0xf<br />

SID Range: 0x1000117500000000-<br />

0x100011750000000f<br />

Infiniband Fabric<br />

Distributed SA<br />

Figure 3-7. Virtual Fabrics with all SIDs assigned to PSM_MPI Virtual Fabric<br />

Second, the Distributed SA handles overlaps by taking advantage of the fact that<br />

Virtual Fabrics have unique numeric indexes. (These IDs can be seen by using<br />

the command "iba_saquery -o vfinfo".) The Distributed SA will always<br />

assign a SID to the Virtual Fabric with the lowest ID number, as shown in<br />

Figure 3-8. This ensures that all copies of the Distributed SA in the InfiniBand<br />

fabric will make the same decisions about assigning SIDs. However, it also means<br />

that the behavior of your fabric can be affected by the order you configured the<br />

virtual fabrics.<br />

Virtual Fabric “Reserved”<br />

ID: 2 Pkey: 0x8003<br />

SID Range: 0x1-0xf<br />

Virtual Fabric Virtual “Default” Fabric “Default” Pkey: 0xffff<br />

SID Range: Pkey: 0x0-0xffffffffffffffff<br />

0xffff<br />

SID Range:<br />

Virtual<br />

0x0-0xffffffffffffffff<br />

Fabric “PSM_MPI”<br />

ID: 1 Pkey: 0x8002<br />

SID Range: 0x1-0xf<br />

SID Range:<br />

0x1000117500000000-<br />

0x100011750000000f<br />

Infiniband Fabric<br />

Virtual Fabric “PSM_MPI” “Default”<br />

Pkey: 0x8002 0xffff<br />

SID Range: 0x1-0xf<br />

SID Range: 0x1000117500000000-<br />

0x100011750000000f<br />

Distributed SA<br />

Figure 3-8. Virtual Fabrics with Unique Numeric Indexes<br />

3-18 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

In Figure 3-8, the Distributed SA assigns all overlapping SIDs to the "PSM_MPI"<br />

fabric because it has the lowest Index<br />

NOTE:<br />

The Distributed SA makes these assignments not because they are right,<br />

but because they allow the fabric to work even though there are errors. The<br />

correct solution in these cases is to redefine the fabric so that no node will<br />

ever be a member of two Virtual Fabrics that service the same SID.<br />

Distributed SA Configuration File<br />

The Distributed SA configuration file is<br />

/etc/sysconfig/iba/qlogic_sa.conf. It has several settings, but normally<br />

administrators will only need to deal with two or three of them.<br />

SID<br />

The SID is the primary configuration setting for the Distributed SA, and it can be<br />

specified multiple times. When the Distributed SA starts, it loads information about<br />

the Virtual Fabrics specified by the Fabric Manager and each SID is mapped to a<br />

single virtual fabric. Multiple SIDs can be mapped to the same virtual fabric. The<br />

default configuration for the Distributed SA includes all the SIDs defined in the<br />

default Qlogic FM configuration for use by MPI.<br />

The SID arguments have a very particular logic that must be understood for<br />

correct operation. A SID= argument defines one Service ID that is associated with<br />

a single virtual fabric. In addition, multiple SID= arguments can point to a single<br />

virtual fabric. For example, a virtual fabric has three sets of SIDs associated with<br />

it: 0x0a1 through 0x0a3, 0x1a1 through 0x1a3 and 0x2a1 through 0x2a3. You<br />

would define this as<br />

SID=0x0a1<br />

SID=0x0a2<br />

SID=0x0a3<br />

SID=0x1a1<br />

SID=0x1a2<br />

SID=0x1a3<br />

SID=0x2a1<br />

SID=0x2a2<br />

SID=0x2a3<br />

NOTE:<br />

A SID of zero is not supported at this time. Instead, the OPP libraries treat<br />

zero values as "unspecified".<br />

D000046-005 B 3-19


3–TrueScale Cluster Setup and Administration<br />

<strong>QLogic</strong> Distributed Subnet Administration<br />

ScanFrequency<br />

Periodically, the Distributed SA will completely re synchronize its database. This<br />

also occurs if the Fabric Manager is restarted. ScanFrequency defines the<br />

minimum number of seconds between complete re synchronizations. It defaults to<br />

600 seconds, or 10 minutes. On very large fabrics, increasing this value can help<br />

reduce the total amount of SM traffic. For example, to set the interval to 15<br />

minutes, add this line to the bottom of the qlogic_sa.conf file:<br />

ScanFrequency=900<br />

LogFile<br />

Normally, the Distributed SA logs special events to /var/log/messages. This<br />

parameter allows you to specify a different destination for the log messages. For<br />

example, to direct Distributed SA messages to their own log, add this line to the<br />

bottom of the qlogic_sa.conf file:<br />

LogFile=/var/log/SAReplica.log<br />

Dbg<br />

This parameter controls how much logging the Distributed SA will do. It can be set<br />

to a number between one and seven, where one indicates no logging and seven<br />

includes informational and debugging messages. To change the Dbg setting for<br />

Distributed SA, find the line in qlogic_sa.conf that reads Dbg=5 and change it to a<br />

different value, between 1 and 7. The value of Dbg changes the amount of logging<br />

that the Distributed SA generates as follows:<br />

• Dbg=1 or Dbg=2: Alerts and Critical Errors<br />

Only errors that will cause the Distributed SA to terminate will be<br />

reported.<br />

• Dbg=3: Errors<br />

Errors will be reported, but nothing else. (Includes Dbg=1 and Dbg=2)<br />

• Dbg=4: Warnings<br />

Errors and warnings will be reported. (Includes Dbg=3)<br />

• Dbg=5: Normal<br />

Some normal events will be reported along with errors and warnings.<br />

(Includes Dbg=4)<br />

• Dbg=6: Informational Messages<br />

In addition to the normal logging, Distributed SA will report detailed<br />

information about its status and operation. Generally, this will produce<br />

too much information for normal use. (Includes Dbg=5)<br />

3-20 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

MPI over uDAPL<br />

• Dbg=7: Debugging<br />

This should only be turned on at the request of <strong>QLogic</strong> Support. This<br />

will generate so much information that system operation will be<br />

impacted. (Includes Dbg=6)<br />

Other Settings<br />

The remaining configuration settings for the Distributed SA are generally only<br />

useful in special circumstances and are not needed in normal operation. The<br />

sample qlogic_sa.conf configuration file contains a brief description of each.<br />

MPI over uDAPL<br />

Intel MPI can be run over uDAPL, which uses InfiniBand Verbs. uDAPL is the user<br />

mode version of the Direct Access Provider Library (DAPL), and is provided as a<br />

part of the OFED packages. You will also have to have IPoIB configured.<br />

The setup for Intel MPI is described in the following steps:<br />

1. Make sure that DAPL 1.2 (not version 2.0) is installed on every node. In this<br />

release they are called compat-dapl. They can be installed either with the<br />

<strong>QLogic</strong> OFED <strong>Host</strong> <strong>Software</strong> package.<br />

2. Verify that there is a /etc/dat.conf file. The file dat.conf contains a list of<br />

interface adapters supported by uDAPL service providers. In particular, it<br />

must contain mapping entries for OpenIB-cma for dapl 1.2.x, in a form<br />

similar to this (all on one line):<br />

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2<br />

"ib0 0" ""<br />

3. On every node, type the following command (as a root user):<br />

# modprobe rdma_ucm<br />

To ensure that the module is loaded when the driver is loaded, add<br />

RDMA_UCM_LOAD=yes to the /etc/infiniband/openib.conf file. (Note that<br />

rdma_cm is also used, but it is loaded automatically.)<br />

4. Bring up an IPoIB interface on every node, for example, ib0. See the<br />

instructions for configuring IPoIB for more details.<br />

For more information on using Intel MPI, see Section 5.<br />

Changing the MTU Size<br />

The Maximum Transfer Unit (MTU) is set to 4K and enabled in the driver by<br />

default. To see the current MTU size, and the maximum supported by the adapter,<br />

type the command:<br />

$ ibv_devinfo<br />

D000046-005 B 3-21


3–TrueScale Cluster Setup and Administration<br />

Managing the TrueScale Driver<br />

To change the driver default back to 2K MTU, add this line (as a root user) into<br />

/etc/modprobe.conf (or /etc/modprobe.conf.local):<br />

options ib_qib ibmtu=4<br />

Restart the driver as described in “Managing the TrueScale Driver” on page 3-22.<br />

NOTE:<br />

To use 4K MTU, set the switch to have the same 4K default. If you are using<br />

<strong>QLogic</strong> switches, the following applies:<br />

• For the Externally Managed 9024, use 4.2.2.0.3 firmware<br />

(9024DDR4KMTU_firmware.emfw) for the 9024 EM. This has the 4K MTU<br />

default, for use on fabrics where 4K MTU is required. If 4K MTU support<br />

is not required, then use the 4.2.2.0.2 DDR *.emfw file for DDR<br />

externally-managed switches. Use FastFabric (FF) to load the firmware<br />

on all the 9024s on the fabric.<br />

• For the 9000 chassis, use the most recent 9000 code 4.2.4.0.1. The 4K<br />

MTU support is in 9000 chassis version 4.2.1.0.2 and later. For the 9000<br />

chassis, when the FastFabric 4.3 (or later) chassis setup tool is used,<br />

the user is asked to select an MTU. FastFabric can then set that MTU in<br />

all the 9000 internally managed switches. The change will take effect on<br />

the next reboot. Alternatively, for the internally managed 9000s, the<br />

ismChassisSetMtu Command Line Interface (CLI) command can be<br />

used. This should be executed on every switch and both hemispheres of<br />

the 9240s.<br />

• For the 12000 switches, refer to the <strong>QLogic</strong> FastFabric <strong>User</strong> <strong>Guide</strong> for<br />

externally managed switches, and to the <strong>QLogic</strong> FastFabric CLI<br />

Reference <strong>Guide</strong> for the internally managed switches.<br />

For reference, see the <strong>QLogic</strong> FastFabric <strong>User</strong> <strong>Guide</strong> and the <strong>QLogic</strong> 9000<br />

CLI Reference <strong>Guide</strong>. Both are available from the <strong>QLogic</strong> web site.<br />

For other switches, see the vendors’ documentation.<br />

Managing the TrueScale Driver<br />

The startup script for ib_qib is installed automatically as part of the software<br />

installation, and normally does not need to be changed. It runs as a system<br />

service.<br />

The primary configuration file for the TrueScale driver ib_qib and other modules<br />

and associated daemons is /etc/infiniband/openib.conf.<br />

3-22 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

Managing the TrueScale Driver<br />

Normally, this configuration file is set up correctly at installation and the drivers are<br />

loaded automatically during system boot once the RPMs have been installed.<br />

However, the ib_qib driver has several configuration variables that set reserved<br />

buffers for the software, define events to create trace records, and set the debug<br />

level.<br />

If you are upgrading, your existing configuration files will not be overwritten.<br />

See the ib_qib man page for more details.<br />

Configure the TrueScale Driver State<br />

Use the following commands to check or configure the state. These methods will<br />

not reboot the system.<br />

To check the configuration state, use this command. You do not need to be a root<br />

user:<br />

$ chkconfig --list openibd<br />

To enable the driver, use the following command (as a root user):<br />

# chkconfig openibd on 2345<br />

To disable the driver on the next system boot, use the following command (as a<br />

root user):<br />

# chkconfig openibd off<br />

NOTE:<br />

This command does not stop and unload the driver if the driver is already<br />

loaded.<br />

Start, Stop, or Restart TrueScale<br />

Restart the software if you install a new <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> release,<br />

change driver options, or do manual testing.<br />

<strong>QLogic</strong> recommends using the <strong>QLogic</strong> IFS Installer TUI to stop, stat and restart<br />

the IPoIB driver. Refer to the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more<br />

information. For using the command line to stop, start, and restart (as a root user)<br />

the TrueScale support use the following syntex:<br />

# /etc/init.d/openibd [start | stop | restart]<br />

WARNING!!<br />

If <strong>QLogic</strong> Fabric Manager, or OpenSM is configured and running on the<br />

node, it must be stopped before using the openibd stop command, and<br />

may be started after using the openibd start command.<br />

D000046-005 B 3-23


3–TrueScale Cluster Setup and Administration<br />

Managing the TrueScale Driver<br />

WARNING!!<br />

Stopping or restarting openibd terminates any <strong>QLogic</strong> MPI, VNIC, and SRP<br />

processes, as well as any OpenFabrics processes that are running at the<br />

time. <strong>QLogic</strong> recommends stopping all processes prior to using the openibd<br />

command.<br />

This method will not reboot the system. The following set of commands shows<br />

how to use this script.<br />

When you need to determine which TrueScale and OpenFabrics modules are<br />

running, use the following command. You do not need to be a root user.<br />

$ lsmod | egrep ’ipath_|ib_|rdma_|findex’<br />

You can check to see if opensmd is running by using the following command (as a<br />

root user); if there is no output, opensmd is not configured to run:<br />

# /sbin/chkconfig --list opensmd | grep -w on<br />

Unload the Driver/Modules Manually<br />

You can also unload the driver/modules manually without using<br />

/etc/init.d/openibd. Use the following series of commands (as a root user):<br />

# umount /ipathfs<br />

# fuser -k /dev/ipath* /dev/infiniband/*<br />

# lsmod | egrep ’^ib_|^rdma_|^iw_’ | xargs modprobe -r<br />

TrueScale Driver Filesystem<br />

The TrueScale driver supplies a filesystem for exporting certain binary statistics to<br />

user applications. By default, this filesystem is mounted in the /ipathfs directory<br />

when the TrueScale script is invoked with the start option (e.g. at system<br />

startup). The filesystem is unmounted when the TrueScale script is invoked with<br />

the stop option (for example, at system shutdown).<br />

Here is a sample layout of a system with two cards:<br />

/ipathfs/0/flash<br />

/ipathfs/0/port2counters<br />

/ipathfs/0/port1counters<br />

/ipathfs/0/portcounter_names<br />

/ipathfs/0/counter_names<br />

/ipathfs/0/counters<br />

/ipathfs/driver_stats_names<br />

3-24 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

More Information on Configuring and Loading Drivers<br />

/ipathfs/driver_stats<br />

/ipathfs/1/flash<br />

/ipathfs/1/port2counters<br />

/ipathfs/1/port1counters<br />

/ipathfs/1/portcounter_names<br />

/ipathfs/1/counter_names<br />

/ipathfs/1/counters<br />

The driver_stats file contains general driver statistics. There is one numbered<br />

subdirectory per TrueScale device on the system. Each numbered subdirectory<br />

contains the following per-device files:<br />

• port1counters<br />

• port2counters<br />

• flash<br />

The driver1counters and driver2counters files contain counters for the<br />

device, for example, interrupts received, bytes and packets in and out, etc. The<br />

flash file is an interface for internal diagnostic commands.<br />

The file counter_names provides the names associated with each of the counters<br />

in the binary port#counters files, and the file driver_stats_names provides the<br />

names for the stats in the binary driver_stats files.<br />

More Information on Configuring and Loading<br />

Drivers<br />

See the modprobe(8), modprobe.conf(5), and lsmod(8) man pages for more<br />

information. Also see the file /usr/share/doc/initscripts-*/sysconfig.txt<br />

for more general information on configuration files.<br />

Performance Settings and Management Tips<br />

The following sections provide suggestions for improving performance and<br />

simplifying cluster management. Many of these settings will be done by the<br />

system administrator. <strong>User</strong> level runtime performance settings are shown in<br />

“Performance Tuning” on page 4-23.<br />

D000046-005 B 3-25


3–TrueScale Cluster Setup and Administration<br />

Performance Settings and Management Tips<br />

Homogeneous Nodes<br />

To minimize management problems, the compute nodes of the cluster should<br />

have very similar hardware configurations and identical software installations. A<br />

mismatch between the TrueScale software versions can also cause problems. Old<br />

and new libraries must not be run within the same job. It may also be useful to<br />

distinguish between the TrueScale-specific drivers and those that are associated<br />

with kernel.org, OpenFabrics, or are distribution-built. The most useful tools are:<br />

• ident (see “ident” on page I-9)<br />

• ipathbug-helper (see “ipathbug-helper” on page I-10)<br />

• ipath_checkout (see “ipath_checkout” on page I-10)<br />

• ipath_control (see “ipath_control” on page I-12)<br />

• mpirun (see “mpirun” on page I-15)<br />

• rpm (see “rpm” on page I-16)<br />

• strings (see “strings” on page I-17)<br />

NOTE:<br />

Run these tools to gather information before reporting problems and<br />

requesting support.<br />

Adapter and Other Settings<br />

The following adapter and other settings can be adjusted for better performance.<br />

• Use an InfiniBand MTU of 4096 bytes instead of 2048 bytes, if available,<br />

with the QLE7340, QLE7342, QLE7240, QLE7280, and QLE7140. 4K<br />

MTU is enabled in the TrueScale driver by default. To change this setting for<br />

the driver, see “Changing the MTU Size” on page 3-21.<br />

• Use a PCIe Max Read Request size of at least 512 bytes with the<br />

QLE7340, QLE7342, QLE7240 and QLE7280. QLE7240 and QLE7280<br />

adapters can support sizes from 128 bytes to 4096 byte in powers of two.<br />

This value is typically set by the BIOS.<br />

For <strong>QLogic</strong> 7300 and 7200 Series <strong>Host</strong> Channel Adapters, to improve peak<br />

IB bandwidth on Nehalem and Harpertown CPU systems, set PCIe<br />

parameters as follows:<br />

<br />

<br />

Set PCIe Max Read Request to 4096 bytes<br />

Set PCIe Max Payload to 256 bytes by, as root, adding the following<br />

line:<br />

options ib_qib pcie_caps=0x51<br />

to the /etc/modprobe.conf file:<br />

The above should be sufficient on Intel Nehalem CPUs or newer.<br />

3-26 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

Performance Settings and Management Tips<br />

On Intel Harpertown CPUs, it may be beneficial to add a<br />

pcie_coalesce=1 parameter to this line.<br />

On AMD CPUs (PCIe Gen1) no ib_qib parameter changes are<br />

recommended.<br />

Alternatively, these PCIe parameters can also be set in the BIOS on some<br />

systems.<br />

• Use PCIe Max Payload size of 256, where available, with the QLE7340,<br />

QLE7342, QLE7240 and QLE7280. The QLE7240 and QLE7280 adapters<br />

can support 128, 256, or 512 bytes. This value is typically set by the BIOS<br />

as the minimum value supported both by the PCIe card and the PCIe root<br />

complex.<br />

• Make sure that write combining is enabled. The x86 Page Attribute Table<br />

(PAT) mechanism that allocates Write Combining (WC) mappings for the<br />

PIO buffers has been added and is now the default. If PAT is unavailable or<br />

PAT initialization fails for some reason, the code will generate a message in<br />

the log and fall back to the MTRR mechanism. See Appendix H Write<br />

Combining for more information.<br />

• Check the PCIe bus width. If slots have a smaller electrical width than<br />

mechanical width, lower than expected performance may occur. Use this<br />

command to check PCIe Bus width:<br />

$ ipath_control -iv<br />

This command also shows the link speed.<br />

• Experiment with non-default CPU affinity while running<br />

single-process-per-node latency or bandwidth benchmarks. Latency<br />

may be slightly lower when using different CPUs (cores) from the default. On<br />

some chipsets, bandwidth may be higher when run from a non-default CPU<br />

or core. See “Performance Tuning” on page 4-23 for more information on<br />

using taskset with <strong>QLogic</strong> MPI. With another MPI, look at its documentation<br />

to see how to force a benchmark to run with a different CPU affinity than the<br />

default. With OFED micro benchmarks such as from the qperf or perftest<br />

suites, taskset will work for setting CPU affinity.<br />

Turn C-state Off to improve MPI latency (ping-pong) benchmarks on<br />

Nehalem systems. In the BIOS, look for advanced CPU settings, and set the<br />

C-State parameter to "disable."<br />

D000046-005 B 3-27


3–TrueScale Cluster Setup and Administration<br />

Performance Settings and Management Tips<br />

• Allocate all chip resources to a single port on a dual port card. The<br />

singleport parameter, when set to a non-zero value at driver load, will cause<br />

dual port Truescale cards to act as single port cards, with only infiniband port<br />

1 enabled. The board identification string is not affected (that is, a QLE7342<br />

with singleport set will still be identified as a QLE7342, not a QLE7340). The<br />

default value of this parameter is 0, however it may be set to 1 during the<br />

IFS installation process<br />

By default, in the <strong>QLogic</strong> Installation TUI will configure a dual-port card so all<br />

chip resources are directed to port1 to maximize performance. This also<br />

saves power by not enabling the second port. If you want to utilize both ports<br />

specify when asked about single port to use dual ports.<br />

Remove Unneeded Services<br />

The cluster administrator can enhance application performance by minimizing the<br />

set of system services running on the compute nodes. Since these are presumed<br />

to be specialized computing appliances, they do not need many of the service<br />

daemons normally running on a general Linux computer.<br />

Following are several groups constituting a minimal necessary set of services.<br />

These are all services controlled by chkconfig. To see the list of services that are<br />

enabled, use the command:<br />

$ /sbin/chkconfig --list | grep -w on<br />

Basic network services are:<br />

• network<br />

• ntpd<br />

• syslog<br />

• xinetd<br />

• sshd<br />

For system housekeeping, use:<br />

• anacron<br />

• atd<br />

• crond<br />

If you are using Network File System (NFS) or yellow pages (yp) passwords:<br />

• rpcidmapd<br />

• ypbind<br />

• portmap<br />

• nfs<br />

• nfslock<br />

• autofs<br />

To watch for disk problems, use:<br />

3-28 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>Host</strong> Environment Setup for MPI<br />

• smartd<br />

• readahead<br />

The service comprising the TrueScale driver and SMA is:<br />

• openibd<br />

Other services may be required by your batch queuing system or user community.<br />

If your system is running the daemon irqbalance, <strong>QLogic</strong> recommends turning it<br />

off. Disabling irqbalance will enable more consistent performance with programs<br />

that use interrupts. Use this command:<br />

# /sbin/chkconfig irqbalance off<br />

See “Erratic Performance” on page F-9 for more information.<br />

Disable Powersaving Features<br />

Hyper-Threading<br />

If you are running benchmarks or large numbers of short jobs, it is beneficial to<br />

disable the powersaving features, since these features may be slow to respond to<br />

changes in system load.<br />

For RHEL4 and RHEL5, run this command as a root user:<br />

# /sbin/chkconfig --level 12345 cpuspeed off<br />

For SLES 10 and SLES 11, run this command as a root user:<br />

# /sbin/chkconfig --level 12345 powersaved off<br />

After running either of these commands, reboot the system for the changes to<br />

take effect.<br />

If you are using Intel NetBurst ® Processors that support Hyper-Threading, <strong>QLogic</strong><br />

recommends turning off Hyper-Threading in the BIOS, which will provide more<br />

consistent performance. You can check and adjust this setting using the BIOS<br />

Setup utility. For specific instructions, follow the hardware documentation that<br />

came with your system.<br />

<strong>Host</strong> Environment Setup for MPI<br />

After the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> software and the GNU (GCC) compilers have been<br />

installed on all the nodes, the host environment can be set up for running MPI<br />

programs.<br />

D000046-005 B 3-29


3–TrueScale Cluster Setup and Administration<br />

<strong>Host</strong> Environment Setup for MPI<br />

Configuring for ssh<br />

Running MPI programs with the command mpirun on an TrueScale cluster<br />

depends, by default, on secure shell ssh to launch node programs on the nodes.<br />

In <strong>QLogic</strong> MPI, mpirun uses the secure shell command ssh to start instances of<br />

the given MPI program on the remote compute nodes without the need for<br />

interactive password entry on every node.<br />

To use ssh, you must have generated Rivest, Shamir, Adleman (RSA) or Digital<br />

Signal Algorithm (DSA) keys, public and private. The public keys must be<br />

distributed and stored on all the compute nodes so that connections to the remote<br />

machines can be established without supplying a password.<br />

You or your administrator must set up the ssh keys and associated files on the<br />

cluster. There are two methods for setting up ssh on your cluster. The first<br />

method, the shosts.equiv mechanism, is typically set up by the cluster<br />

administrator. The second method, using ssh-agent, is more easily<br />

accomplished by an individual user.<br />

NOTE:<br />

• rsh can be used instead of ssh. To use rsh, set the environment<br />

variable MPI_SHELL=rsh. See “Environment Variables” on page 4-20 for<br />

information on setting environment variables. Also see “Shell Options”<br />

on page A-6 for information on setting shell options in mpirun.<br />

• rsh has a limit on the number of concurrent connections it can have,<br />

typically 255, which may limit its use on larger clusters.<br />

Configuring ssh and sshd Using shosts.equiv<br />

This section describes how the cluster administrator can set up ssh and sshd<br />

through the shosts.equiv mechanism. This method is recommended, provided<br />

that your cluster is behind a firewall and accessible only to trusted users.<br />

“Configuring for ssh Using ssh-agent” on page 3-32 shows how an individual user<br />

can accomplish the same thing using ssh-agent.<br />

The example in this section assumes the following:<br />

• Both the cluster nodes and the front end system are running the openssh<br />

package as distributed in current Linux systems.<br />

• All cluster end users have accounts with the same account name on the<br />

front end and on each node, by using Network Information Service (NIS) or<br />

another means of distributing the password file.<br />

• The front end used in this example is called ip-fe.<br />

3-30 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>Host</strong> Environment Setup for MPI<br />

• Root or superuser access is required on ip-fe and on each node to<br />

configure ssh.<br />

• ssh, including the host’s key, has already been configured on the system<br />

ip-fe. See the sshd and ssh-keygen man pages for more information.<br />

To use shosts.equiv to configure ssg and sshd:<br />

1. On the system ip-fe (the front end node), change the<br />

/etc/ssh/ssh_config file to allow host-based authentication. Specifically,<br />

this file must contain the following four lines, all set to yes. If the lines are<br />

already there but commented out (with an initial #), remove the #.<br />

RhostsAuthentication yes<br />

RhostsRSAAuthentication yes<br />

<strong>Host</strong>basedAuthentication yes<br />

EnableSSHKeysign yes<br />

2. On each of the TrueScale node systems, create or edit the file<br />

/etc/ssh/shosts.equiv, adding the name of the front end system. Add the<br />

line:<br />

ip-fe<br />

Change the file to mode 600 when you are finished editing.<br />

3. On each of the TrueScale node systems, create or edit the file<br />

/etc/ssh/ssh_known_hosts. You will need to copy the contents of the file<br />

/etc/ssh/ssh_host_dsa_key.pub from ip-fe to this file (as a single line),<br />

and then edit that line to insert ip-fe ssh-dss at the beginning of the line.<br />

This is very similar to the standard known_hosts file for ssh. An example<br />

line might look like this (displayed as multiple lines, but a single line in the<br />

file):<br />

ip-fe ssh-dss<br />

AAzAB3NzaC1kc3MAAACBAPoyES6+Akk+z3RfCkEHCkmYuYzqL2+1nwo4LeTVW<br />

pCD1QsvrYRmpsfwpzYLXiSJdZSA8hfePWmMfrkvAAk4ueN8L3ZT4QfCTwqvHV<br />

vSctpibf8n<br />

aUmzloovBndOX9TIHyP/Ljfzzep4wL17+5hr1AHXldzrmgeEKp6ect1wxAAAA<br />

FQDR56dAKFA4WgAiRmUJailtLFp8swAAAIBB1yrhF5P0jO+vpSnZrvrHa0Ok+<br />

Y9apeJp3sessee30NlqKbJqWj5DOoRejr2VfTxZROf8LKuOY8tD6I59I0vlcQ<br />

812E5iw1GCZfNefBmWbegWVKFwGlNbqBnZK7kDRLSOKQtuhYbGPcrVlSjuVps<br />

fWEju64FTqKEetA8l8QEgAAAIBNtPDDwdmXRvDyc0gvAm6lPOIsRLmgmdgKXT<br />

GOZUZ0zwxSL7GP1nEyFk9wAxCrXv3xPKxQaezQKs+KL95FouJvJ4qrSxxHdd1<br />

NYNR0DavEBVQgCaspgWvWQ8cL<br />

0aUQmTbggLrtD9zETVU5PCgRlQL6I3Y5sCCHuO7/UvTH9nneCg==<br />

Change the file to mode 600 when you are finished editing.<br />

D000046-005 B 3-31


3–TrueScale Cluster Setup and Administration<br />

<strong>Host</strong> Environment Setup for MPI<br />

4. On each node, the system file /etc/ssh/sshd_config must be edited, so<br />

that the following four lines are uncommented (no # at the start of the line)<br />

and set to yes. (These lines are usually there, but are commented out and<br />

set to no by default.)<br />

RhostsAuthentication yes<br />

RhostsRSAAuthentication yes<br />

<strong>Host</strong>basedAuthentication yes<br />

PAMAuthenticationViaKbdInt yes<br />

5. After creating or editing the three files in Steps 2, 3, and 4, sshd must be<br />

restarted on each system. If you are already logged in via ssh (or any other<br />

user is logged in via ssh), their sessions or programs will be terminated, so<br />

restart only on idle nodes. Type the following (as root) to notify sshd to use<br />

the new configuration files:<br />

# killall -HUP sshd<br />

NOTE:<br />

This command terminates all ssh sessions into that system. Run from<br />

the console, or have a way to log into the console in case of any<br />

problem.<br />

At this point, any end user should be able to login to the ip-fe front end system<br />

and use ssh to login to any TrueScale node without being prompted for a<br />

password or pass phrase.<br />

Configuring for ssh Using ssh-agent<br />

The ssh-agent, a daemon that caches decrypted private keys, can be used to<br />

store the keys. Use ssh-add to add your private keys to ssh-agent’s cache.<br />

When ssh establishes a new connection, it communicates with ssh-agent to<br />

acquire these keys, rather than prompting you for a passphrase.<br />

The process is described in the following steps:<br />

1. Create a key pair. Use the default file name, and be sure to enter a<br />

passphrase.<br />

$ ssh-keygen -t rsa<br />

2. Enter a passphrase for your key pair when prompted. Note that the key<br />

agent does not survive X11 logout or system reboot:<br />

$ ssh-add<br />

3. The following command tells ssh that your key pair should let you in:<br />

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys<br />

3-32 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

<strong>Host</strong> Environment Setup for MPI<br />

Edit the ~/.ssh/config file so that it reads like this:<br />

<strong>Host</strong>*<br />

ForwardAgent yes<br />

ForwardX11 yes<br />

Check<strong>Host</strong>IP no<br />

Strict<strong>Host</strong>KeyChecking no<br />

This file forwards the key agent requests back to your desktop. When you<br />

log into a front end node, you can use ssh to compute nodes without<br />

passwords.<br />

4. Follow your administrator’s cluster policy for setting up ssh-agent on the<br />

machine where you will be running ssh commands. Alternatively, you can<br />

start the ssh-agent by adding the following line to your ~/.bash_profile<br />

(or equivalent in another shell):<br />

eval ‘ssh-agent‘<br />

Use back quotes rather than single quotes. Programs started in your login<br />

shell can then locate the ssh-agent and query it for keys.<br />

5. Finally, test by logging into the front end node, and from the front end node<br />

to a compute node, as follows:<br />

$ ssh frontend_node_name<br />

$ ssh compute_node_name<br />

For more information, see the man pages for ssh(1), ssh-keygen(1),<br />

ssh-add(1), and ssh-agent(1).<br />

Process Limitation with ssh<br />

Process limitation with ssh is primarily an issue when using the mpirun option<br />

-distributed=off. The default setting is now -distributed=on; therefore, in<br />

most cases, ssh process limitations will not be encountered. This limitation for the<br />

-distributed=off case is described in the following paragraph. See “Process<br />

Limitation with ssh” on page F-19 for an example of an error message associated<br />

with this limitation.<br />

MPI jobs that use more than 10 processes per node may encounter an ssh<br />

throttling mechanism that limits the amount of concurrent per-node connections<br />

to 10. If you need to use more processes, you or your system administrator must<br />

increase the value of MaxStartups in your /etc/ssh/sshd_config file.<br />

D000046-005 B 3-33


3–TrueScale Cluster Setup and Administration<br />

Checking Cluster and <strong>Software</strong> Status<br />

Checking Cluster and <strong>Software</strong> Status<br />

ipath_control<br />

InfiniBand status, link speed, and PCIe bus width can be checked by running the<br />

program ipath_control. Sample usage and output are as follows:<br />

$ ipath_control -iv<br />

$Id: <strong>QLogic</strong> OFED Release 1.4.2 $ $Date: 2009-03-10-10:15 $<br />

0: Version: ChipABI 2.0, InfiniPath_QLE7280, InfiniPath1 5.2,<br />

PCI 2, SW Compat 2<br />

0: Status: 0xe1 Initted Present IB_link_up IB_configured<br />

0: LID=0x1f MLID=0xc042 GUID=00:11:75:00:00:ff:89:a6 Serial:<br />

AIB0810A30297<br />

0: HRTBT:Auto RX_polarity_invert:Auto RX_lane_reversal: Auto<br />

0: LinkWidth:4X of 1X|4X Speed:DDR of SDR|DDR<br />

0: LocalBus: PCIe,2500MHz,x16<br />

3-34 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

Checking Cluster and <strong>Software</strong> Status<br />

iba_opp_query<br />

iba_opp_query is used to check the operation of the Distributed SA. You can run it<br />

from any node to verify that the replica on that node is working correctly. See<br />

“iba_opp_query” on page I-4 for detailed usage information.<br />

# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107<br />

Query Parameters:<br />

resv1<br />

0x0000000000000107<br />

dgid ::<br />

sgid ::<br />

dlid<br />

0x75<br />

slid<br />

0x31<br />

hop<br />

0x0<br />

flow<br />

0x0<br />

tclass<br />

0x0<br />

num_path<br />

0x0<br />

pkey<br />

0x0<br />

qos_class<br />

0x0<br />

sl<br />

0x0<br />

mtu<br />

0x0<br />

rate<br />

0x0<br />

pkt_life<br />

0x0<br />

preference<br />

0x0<br />

resv2<br />

0x0<br />

resv3<br />

0x0<br />

Using HCA qib0<br />

Result:<br />

resv1<br />

0x0000000000000107<br />

dgid<br />

fe80::11:7500:79:e54a<br />

sgid<br />

fe80::11:7500:79:e416<br />

dlid<br />

0x75<br />

slid<br />

0x31<br />

hop<br />

0x0<br />

flow<br />

0x0<br />

tclass<br />

0x0<br />

num_path<br />

0x0<br />

pkey<br />

0xffff<br />

qos_class<br />

0x0<br />

sl<br />

0x1<br />

mtu<br />

0x4<br />

D000046-005 B 3-35


3–TrueScale Cluster Setup and Administration<br />

Checking Cluster and <strong>Software</strong> Status<br />

rate<br />

pkt_life<br />

preference<br />

resv2<br />

resv3<br />

0x6<br />

0x10<br />

0x0<br />

0x0<br />

0x0<br />

ibstatus<br />

Another useful program is ibstatus. Sample usage and output are as follows:<br />

$ ibstatus<br />

Infiniband device ’qib0’ port 1 status:<br />

default gid: fe80:0000:0000:0000:0011:7500:00ff:89a6<br />

base lid: 0x1f<br />

sm lid:<br />

0x1<br />

state:<br />

4: ACTIVE<br />

phys state: 5: LinkUp<br />

rate:<br />

20 Gb/sec (4X DDR)<br />

ibv_devinfo<br />

ibv_devinfo queries RDMA devices. Use the -v option to see more information.<br />

Sample usage:<br />

$ ibv_devinfo<br />

hca_id: qib0<br />

fw_ver: 0.0.0<br />

node_guid:<br />

0011:7500:00ff:89a6<br />

sys_image_guid:<br />

0011:7500:00ff:89a6<br />

vendor_id:<br />

0x1175<br />

vendor_part_id: 29216<br />

hw_ver:<br />

0x2<br />

board_id:<br />

InfiniPath_QLE7280<br />

phys_port_cnt: 1<br />

port: 1<br />

state: PORT_ACTIVE (4)<br />

max_mtu: 4096 (5)<br />

active_mtu: 4096 (5)<br />

sm_lid: 1<br />

port_lid: 31<br />

port_lmc:<br />

0x00<br />

3-36 D000046-005 B


3–TrueScale Cluster Setup and Administration<br />

Checking Cluster and <strong>Software</strong> Status<br />

ipath_checkout<br />

ipath_checkout is a bash script that verifies that the installation is correct and<br />

that all the nodes of the network are functioning and mutually connected by the<br />

TrueScale fabric. It must be run on a front end node, and requires specification of<br />

a nodefile. For example:<br />

$ ipath_checkout [options] nodefile<br />

The nodefile lists the hostnames of the nodes of the cluster, one hostname per<br />

line. The format of nodefile is as follows:<br />

hostname1<br />

hostname2<br />

...<br />

For more information on these programs, see “ipath_control” on page I-12,<br />

“ibstatus” on page I-7, and “ipath_checkout” on page I-10.<br />

D000046-005 B 3-37


3–TrueScale Cluster Setup and Administration<br />

Checking Cluster and <strong>Software</strong> Status<br />

3-38 D000046-005 B


4 Running <strong>QLogic</strong> MPI on<br />

<strong>QLogic</strong> Adapters<br />

Introduction<br />

<strong>QLogic</strong> MPI<br />

This section provides information on using the <strong>QLogic</strong> Message-Passing Interface<br />

(MPI). Examples are provided for setting up the user environment, and for<br />

compiling and running MPI programs.<br />

The MPI standard is a message-passing library or collection of routines used in<br />

distributed-memory parallel programming. It is used in data exchange and task<br />

synchronization between processes. The goal of MPI is to provide portability and<br />

efficient implementation across different platforms and architectures.<br />

<strong>QLogic</strong>’s implementation of the MPI standard is derived from the MPICH<br />

reference implementation version 1.2.7. The <strong>QLogic</strong> MPI (TrueScale) libraries<br />

have been highly tuned for the <strong>QLogic</strong> interconnect, and will not run over other<br />

interconnects.<br />

<strong>QLogic</strong> MPI is an implementation of the original MPI 1.2 standard. The MPI-2<br />

standard provides several enhancements of the original standard. Of the MPI-2<br />

features, <strong>QLogic</strong> MPI includes only the MPI-IO features implemented in ROMIO<br />

version 126 and the generalized MPI_All to allow communication exchange.<br />

The <strong>QLogic</strong> MPI implementation in this release supports hybrid MPI/OpenMP and<br />

other multi-threaded programs, as long as only one thread uses MPI. For more<br />

information, see “<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP Applications” on<br />

page 4-25.<br />

D000046-005 B 4-1


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Introduction<br />

PSM<br />

The PSM TrueScale Messaging API, or PSM API, is <strong>QLogic</strong>'s low-level user-level<br />

communications interface for the TrueScale family of products. Other than using<br />

some environment variables with the PSM prefix, MPI users typically need not<br />

interact directly with PSM. The PSM environment variables apply to other MPI<br />

implementations as long as the environment with the PSM variables is correctly<br />

forwarded. See “Environment Variables” on page 4-20 for a summary of the<br />

commonly used environment variables.<br />

For more information on PSM, email <strong>QLogic</strong> at support@qlogic.com.<br />

Other MPIs<br />

In addition to <strong>QLogic</strong> MPI, other high-performance MPIs such as HP-MPI version<br />

2.3, Open MPI version 1.4, Ohio State University MVAPICH version 1.2,<br />

MVAPICH2 version 1.4, and Scali (Platform) MPI, have been ported to the PSM<br />

interface.<br />

Open MPI, MVAPICH, HP-MPI, and Scali also run over InfiniBand Verbs (the<br />

Open Fabrics Alliance API that provides support for user-level upper-layer<br />

protocols like MPI). Intel MPI, although not ported to the PSM interface, is<br />

supported over uDAPL, which uses InfiniBand Verbs. For more information, see<br />

Section 5 Using Other MPIs.<br />

Linux File I/O in MPI Programs<br />

MPI node programs are Linux programs that can execute file I/O operations to<br />

local or remote files in the usual ways through APIs of the language in use.<br />

Remote files are accessed via a network file system, typically NFS. Parallel<br />

programs usually need to have some data in files to be shared by all of the<br />

processes of an MPI job. Node programs can also use non-shared, node-specific<br />

files, such as for scratch storage for intermediate results or for a node’s share of a<br />

distributed database.<br />

There are different ways of handling file I/O of shared data in parallel<br />

programming. You may have one process, typically on the front end node or on a<br />

file server that is the only process to touch the shared files, and passes data to<br />

and from the other processes via MPI messages. Alternately, the shared data files<br />

can be accessed directly by each node program. In this case, the shared files are<br />

available through some network file support, such as NFS. Also, in this case, the<br />

application programmer is responsible for ensuring file consistency, either through<br />

proper use of file locking mechanisms offered by the operating system and the<br />

programming language, such as fcntl in C, or by using MPI synchronization<br />

operations.<br />

4-2 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Getting Started with MPI<br />

MPI-IO with ROMIO<br />

MPI-IO is the part of the MPI-2 standard, supporting collective and parallel file I/O<br />

operations. One advantage of using MPI-IO is that it can take care of managing<br />

file locks when file data is shared among nodes.<br />

<strong>QLogic</strong> MPI includes ROMIO version 1.2.6. ROMIO is a high-performance,<br />

portable implementation of MPI-IO from Argonne National Laboratory. ROMIO<br />

includes everything defined in the MPI-2 I/O chapter of the MPI-2 standard except<br />

support for file interoperability and user-defined error handlers for files. Of the<br />

MPI-2 features, <strong>QLogic</strong> MPI includes only the MPI-IO features implemented in<br />

ROMIO version 126 and the generalized MPI_All to allow communication<br />

exchange. See the ROMIO documentation at http://www.mcs.anl.gov/romio for<br />

details.<br />

NFS, PanFS, and local (UFS) support is enabled.<br />

Getting Started with MPI<br />

Copy Examples<br />

This section shows how to compile and run some simple example programs that<br />

are included in the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> software product. Compiling and running<br />

these examples enables you to verify that <strong>QLogic</strong> MPI and its components have<br />

been properly installed on the cluster. See “<strong>QLogic</strong> MPI Troubleshooting” on<br />

page F-11 if you have problems compiling or running these examples.<br />

These examples assume that your cluster’s policy allows you to use the mpirun<br />

script directly, without having to submit the job to a batch queuing system.<br />

Start by copying the examples to your working directory:<br />

$ cp /usr/mpi/qlogic/share/mpich/examples/basic/* .<br />

or<br />

$ cp /usr/share/mpich/examples/basic/* .<br />

Create the mpihosts File<br />

Next, create an MPI hosts file in the same working directory. It contains the host<br />

names of the nodes in your cluster that run the examples, with one host name per<br />

line. Name this file mpihosts. The contents can be in the following format:<br />

hostname1<br />

hostname2<br />

...<br />

More details on the mpihosts file can be found in “mpihosts File Details” on<br />

page 4-15.<br />

D000046-005 B 4-3


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Getting Started with MPI<br />

Compile and Run an Example C Program<br />

In this step you will compile and run your MPI program.<br />

<strong>QLogic</strong> MPI uses some shell scripts to find the appropriate include files and<br />

libraries for each supported language. Use the script mpicc to compile an MPI<br />

program in C and the script mpirun to execute the file.<br />

The supplied example program cpi.c computes an approximation to pi. First,<br />

compile it to an executable named cpi. For example:<br />

$ mpicc -o cpi cpi.c<br />

By default, mpicc runs the GNU gcc compiler, and is used for both compiling and<br />

linking, the same function as the gcc command.<br />

NOTE:<br />

For information on using other compilers, see “To Use Another Compiler” on<br />

page 4-9.<br />

Then, run the program with several different specifications for the number of<br />

processes:<br />

$ mpirun -np 2 -m mpihosts ./cpi<br />

Process 0 on hostname1<br />

Process 1 on hostname2<br />

pi is approximately 3.1416009869231241,<br />

Error is 0.0000083333333309<br />

wall clock time = 0.000149<br />

In this example, ./cpi designates the executable of the example program in the<br />

working directory. The -np parameter to mpirun defines the number of<br />

processes to be used in the parallel computation. Here is an example with four<br />

processes, using the same two hosts in the mpihosts file:<br />

$ mpirun -np 4 -m mpihosts ./cpi<br />

Process 3 on hostname1<br />

Process 0 on hostname2<br />

Process 2 on hostname2<br />

Process 1 on hostname1<br />

pi is approximately 3.1416009869231249,<br />

Error is 0.0000083333333318<br />

wall clock time = 0.000603<br />

4-4 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Getting Started with MPI<br />

Generally, mpirun tries to distribute the specified number of processes evenly<br />

among the nodes listed in the mpihosts file. However, if the number of<br />

processes exceeds the number of nodes listed in the mpihosts file, then some<br />

nodes will be assigned more than one instance of the program.<br />

When you run the program several times with the same value of the -np<br />

parameter, the output lines may display in different orders. This is because they<br />

are issued by independent asynchronous processes, so their order is<br />

non-deterministic.<br />

Details on other ways of specifying the mpihosts file are provided in “mpihosts<br />

File Details” on page 4-15.<br />

More information on the mpirun options are in “Using mpirun” on page 4-16 and<br />

Appendix A mpirun Options Summary. “Process Allocation” on page 4-11<br />

explains how processes are allocated by using hardware and software contexts.<br />

Examples Using Other Programming Languages<br />

This section gives similar examples for computing pi for Fortran 77 and<br />

Fortran 90. Fortran 95 usage is similar to Fortran 90. The C++ example uses the<br />

traditional “Hello, World” program. All programs are located in the same directory.<br />

fpi.f is a Fortran 77 program that computes pi in a way similar to cpi.c.<br />

Compile and link, and run it as follows:<br />

$ mpif77 -o fpi fpi.f<br />

$ mpirun -np 2 -m mpihosts ./fpi<br />

pi3f90.f90 is a Fortran 90 program that does the same computation. Compile<br />

and link, and run it as follows:<br />

$ mpif90 -o pi3f90 pi3f90.f90<br />

$ mpirun -np 2 -m mpihosts ./pi3f90<br />

The C++ program hello++.cc is a parallel processing version of the traditional<br />

“Hello, World” program. Notice that this version makes use of the external C<br />

bindings of the MPI functions if the C++ bindings are not present.<br />

D000046-005 B 4-5


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Compile and run it as follows:<br />

$ mpicxx -o hello hello++.cc<br />

$ mpirun -np 10 -m mpihosts ./hello<br />

Hello World! I am 9 of 10<br />

Hello World! I am 2 of 10<br />

Hello World! I am 4 of 10<br />

Hello World! I am 1 of 10<br />

Hello World! I am 7 of 10<br />

Hello World! I am 6 of 10<br />

Hello World! I am 3 of 10<br />

Hello World! I am 0 of 10<br />

Hello World! I am 5 of 10<br />

Hello World! I am 8 of 10<br />

Each of the scripts invokes the GNU compiler for the respective language and the<br />

linker. See “To Use Another Compiler” on page 4-9 for an example of how to use<br />

other compilers. The use of mpirun is the same for programs in all languages.<br />

<strong>QLogic</strong> MPI Details<br />

The following sections provide more details on the use of <strong>QLogic</strong> MPI. These<br />

sections assume that you are familiar with standard MPI. For more information,<br />

see the references in “References for MPI” on page J-1. This implementation<br />

includes the man pages from the MPICH implementation for the numerous MPI<br />

functions.<br />

4-6 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Use Wrapper Scripts for Compiling and Linking<br />

The scripts in Table 4-1 invoke the compiler and linker for programs in each of the<br />

respective languages, and take care of referring to the correct include files and<br />

libraries in each case.<br />

Table 4-1. <strong>QLogic</strong> MPI Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpicc<br />

mpicxx<br />

C<br />

C++<br />

mpif77 Fortran 77<br />

mpif90 Fortran 90<br />

mpif95 Fortran 95<br />

On x86_64, these scripts (by default) call the GNU compiler and linker. To use<br />

other compilers, see “To Use Another Compiler” on page 4-9.<br />

These scripts all provide the command line options listed in Table 4-2.<br />

Table 4-2. Command Line Options for Scripts<br />

Command<br />

Meaning<br />

-help<br />

-show<br />

-echo<br />

-compile_info<br />

-link_info<br />

Provides help<br />

Lists each of the compiling and linking commands that would be<br />

called without actually calling them<br />

Gets verbose output of all the commands in the script<br />

Shows how to compile a program<br />

Shows how to link a program<br />

In addition, each of these scripts allow a command line option for specifying a<br />

different compiler/linker as an alternative to the GNU Compiler Collection (GCC).<br />

For more information, see “To Use Another Compiler” on page 4-9.<br />

Most other command line options are passed on to the invoked compiler and<br />

linker. The GNU compiler and alternative compilers all accept numerous<br />

command line options. See the GCC compiler documentation and the man pages<br />

for gcc and gfortran for complete information on available options. See the<br />

corresponding documentation for any other compiler/linker you may call for its<br />

options. Man pages for mpif90(1), mpif77(1), mpicc(1), and mpiCC(1) are<br />

available.<br />

D000046-005 B 4-7


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Configuring MPI Programs for <strong>QLogic</strong> MPI<br />

When configuring an MPI program (generating header files and/or Makefiles) for<br />

<strong>QLogic</strong> MPI, you usually need to specify mpicc, mpicxx, and so on as the<br />

compiler, rather than gcc, g++, etc.<br />

Specifying the compiler is typically done with commands similar to the following,<br />

assuming that you are using sh or bash as the shell:<br />

$ export CC=mpicc<br />

$ export CXX=mpicxx<br />

$ export F77=mpif77<br />

$ export F90=mpif90<br />

$ export F95=mpif95<br />

The shell variables will vary with the program being configured, but these<br />

examples show frequently used variable names. If you use csh, use commands<br />

similar to the following:<br />

$ setenv CC mpicc<br />

You may need to pass arguments to configure directly, for example:<br />

$ ./configure -cc=mpicc -fc=mpif77 -c++=mpicxx<br />

-c++linker=mpicxx<br />

You may also need to edit a Makefile to achieve this result, adding lines similar to:<br />

CC=mpicc<br />

F77=mpif77<br />

F90=mpif90<br />

F95=mpif95<br />

CXX=mpicxx<br />

In some cases, the configuration process may specify the linker. <strong>QLogic</strong><br />

recommends that the linker be specified as mpicc, mpif90, etc. in these cases.<br />

This specification automatically includes the correct flags and libraries, rather than<br />

trying to configure to pass the flags and libraries explicitly. For example:<br />

LD=mpicc<br />

LD=mpif90<br />

These scripts pass appropriate options to the various compiler passes to include<br />

header files, required libraries, etc. While the same effect can be achieved by<br />

passing the arguments explicitly as flags, the required arguments may vary from<br />

release to release, so it is good practice to use the provided scripts.<br />

4-8 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

To Use Another Compiler<br />

<strong>QLogic</strong> MPI and all other MPIs that run on TrueScale, support a number of<br />

compilers, in addition to the default GNU Compiler Collection (GCC, including gcc,<br />

g++ and gfortran, but g77 is not supported) versions 3.3 and later. These include<br />

the PathScale Compiler Suite 3.0, 3.1, and 3.2; PGI 5.2, 6.0, 7.1, 8.0, and 9.0;<br />

and Intel 9.x, 10.1, and 11.x.<br />

NOTE:<br />

The PathScale compiler suite is not supported on the RHEL 4 U8 or the<br />

SLES 11 distribution.<br />

These compilers can be invoked on the command line by passing options to the<br />

wrapper scripts. Command line options override environment variables, if set.<br />

Tables 4-3, 4-4, and 4-5 show the options for each of the compilers.<br />

In each case, ..... stands for the remaining options to the mpicxx script, the<br />

options to the compiler in question, and the names of the files that it operates.<br />

Table 4-3. Intel<br />

Compiler<br />

Command<br />

C $ mpicc -cc=icc .....<br />

C++<br />

$ mpicc -CC=icpc<br />

Fortran 77 $ mpif77 -fc=ifort .....<br />

Fortran 90/95 $ mpif90 -f90=ifort .....<br />

$ mpif95 -f95=ifort .....<br />

Table 4-4. Portland Group (PGI)<br />

Compiler<br />

Command<br />

C mpicc -cc=pgcc .....<br />

C++<br />

mpicc -CC=pgCC<br />

Fortran 77 mpif77 -fc=pgf77 .....<br />

Fortran 90/95 mpif90 -f90=pgf90 .....<br />

mpif95 -f95=pgf95 .....<br />

D000046-005 B 4-9


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Table 4-5. PathScale Compiler Suite<br />

Compiler<br />

Command<br />

C mpicc -cc=pathcc .....<br />

C++ mpicc -CC=pathCC .....<br />

Fortran 77 mpif77 -fc=pathf95 .....<br />

Fortran 90/95 mpif90 -f90=pathf95 .....<br />

mpif90 -f95=pathf95 .....<br />

NOTE: pathf95 invokes the Fortran 77,<br />

Fortran 90, and Fortran 95 compilers.<br />

Also, use mpif77, mpif90, or mpif95 for linking; otherwise, .true. may have<br />

the wrong value.<br />

If you are not using the provided scripts for linking, link a sample program using<br />

the -show option as a test (without the actual build) to see what libraries to add to<br />

your link line. Some examples of the using the PGI compilers follow.<br />

For Fortran 90 programs:<br />

$ mpif90 -f90=pgf90 -show pi3f90.f90 -o pi3f90<br />

pgf90 -I/usr/include/mpich/pgi5/x86_64 -c -I/usr/include<br />

pi3f90.f90 -c<br />

pgf90 pi3f90.o -o pi3f90 -lmpichf90 -lmpich -lmpichabiglue_pgi5<br />

Fortran 95 programs will be similar to the above.<br />

For C programs:<br />

$ mpicc -cc=pgcc -show cpi.c<br />

pgcc -c cpi.c<br />

pgcc cpi.o -lmpich -lpgftnrtl -lmpichabiglue_pgi5<br />

Compiler and Linker Variables<br />

When you use environment variables (e.g., $MPICH_CC) to select the compiler<br />

mpicc (and others) will use, the scripts will also set the matching linker variable<br />

(for example, $MPICH_CLINKER), if it is not already set. When both the<br />

environment variable and command line options are used (-cc=gcc), the<br />

command line variable is used.<br />

When both the compiler and linker variables are set, and they do not match for the<br />

compiler you are using, the MPI program may fail to link; or, if it links, it may not<br />

execute correctly. For a sample error message, see “Compiler/Linker Mismatch”<br />

on page F-14.<br />

4-10 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Process Allocation<br />

Normally MPI jobs are run with each node program (process) being associated<br />

with a dedicated <strong>QLogic</strong> infiniband adapter hardware context that is mapped to a<br />

CPU.<br />

If the number of node programs is greater than the available number of hardware<br />

contexts, software context sharing increases the number of node programs that<br />

can be run. Each adapter supports four software contexts per hardware context,<br />

so up to four node programs (from the same MPI job) can share that hardware<br />

context. There is a small additional overhead for each shared context.<br />

Table 4-6 shows the maximum number of contexts available for each adapter.<br />

Table 4-6. Available Hardware and <strong>Software</strong> Contexts<br />

Adapter<br />

Available Hardware<br />

Contexts (same as number<br />

of supported CPUs)<br />

Available Contexts when<br />

<strong>Software</strong> Context Sharing is<br />

Enabled<br />

QLE7140 4 16<br />

QLE7240/<br />

QLE7280<br />

QLE7342/<br />

QLE7340<br />

16 64<br />

16 64<br />

The default hardware context/CPU mappings can be changed on the TrueScale<br />

DDR and QDR InfiniBand Adapters (QLE72x0 and QLE734x). See “TrueScale<br />

Hardware Contexts on the DDR and QDR InfiniBand Adapters” on page 4-12 for<br />

more details.<br />

Context sharing is enabled by default. How the system behaves when context<br />

sharing is enabled or disabled is described in “Enabling and Disabling <strong>Software</strong><br />

Context Sharing” on page 4-13.<br />

When running a job in a batch system environment where multiple jobs may be<br />

running simultaneously, it is useful to restrict the number of TrueScale contexts<br />

that are made available on each node of an MPI. See “Restricting TrueScale<br />

Hardware Contexts in a Batch Environment” on page 4-13.<br />

Errors that may occur with context sharing are covered in “Context Sharing Error<br />

Messages” on page 4-14.<br />

There are multiple ways of specifying how processes are allocated. You can use<br />

the mpihosts file, the -np and -ppn options with mpirun, and the<br />

MPI_NPROCS and PSM_SHAREDCONTEXTS_MAX environment variables. How<br />

these all are set are covered later in this document.<br />

D000046-005 B 4-11


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

TrueScale Hardware Contexts on the DDR and QDR<br />

InfiniBand Adapters<br />

On the QLE7240 and QLE7280 DDR adapters, adapter receive resources are<br />

statically partitioned across the TrueScale contexts according to the number of<br />

TrueScale contexts enabled. The following defaults are automatically set<br />

according to the number of online CPUs in the node:<br />

For four or less CPUs: 5 (4 + 1 for kernel)<br />

For five to eight CPUs: 9 (8 + 1 for kernel)<br />

For nine or more CPUs: 17 (16 + 1 for kernel)<br />

On the QLE7340 and QLE7342 QDR adapters, adapter receive resources are<br />

statically partitioned across the TrueScale contexts according to the number of<br />

TrueScale contexts enabled. The following defaults are automatically set<br />

according to the number of online CPUs in the node:<br />

For four or less CPUs: 6 (4 + 2)<br />

For five to eight CPUs: 10 (8 + 2)<br />

For nine or more CPUs: 18 (16 + 2)<br />

The one additional context on QDR adapters are to support the kernel on each<br />

port.<br />

Performance can be improved in some cases by disabling TrueScale hardware<br />

contexts when they are not required so that the resources can be partitioned more<br />

effectively.<br />

To disable this behavior, explicitly configure for the number you want to use with<br />

the cfgctxts module parameter in the file /etc/modprobe.conf (or<br />

/etc/modprobe.conf.local on SLES).<br />

The maximum that can be set is 17 on DDR InfiniBand Adapters and 18 on QDR<br />

InfiniBand Adapters.<br />

The driver must be restarted if this default is changed. See “Managing the<br />

TrueScale Driver” on page 3-22.<br />

4-12 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

NOTE:<br />

In rare cases, setting contexts automatically on DDR and QDR InfiniBand<br />

Adapters can lead to sub-optimal performance where one or more<br />

TrueScale hardware contexts have been disabled and a job is run that<br />

requires software context sharing. Since the algorithm ensures that there is<br />

at least one TrueScale context per online CPU, this case occurs only if the<br />

CPUs are over-subscribed with processes (which is not normally<br />

recommended). In this case, it is best to override the default to use as many<br />

TrueScale contexts as are available, which minimizes the amount of<br />

software context sharing required.<br />

Enabling and Disabling <strong>Software</strong> Context Sharing<br />

By default, context sharing is enabled; it can also be specifically disabled.<br />

Context Sharing Enabled: The MPI library provides PSM the local process<br />

layout so that TrueScale contexts available on each node can be shared if<br />

necessary; for example, when running more node programs than contexts. All<br />

PSM jobs assume that they can make use of all available TrueScale contexts to<br />

satisfy the job requirement and try to give a context to each process.<br />

When context sharing is enabled on a system with multiple <strong>QLogic</strong> adapter<br />

(TrueScale) boards (units) and the IPATH_UNIT environment variable is set, the<br />

number of TrueScale contexts made available to MPI jobs is restricted to the<br />

number of contexts available on that unit. When multiple TrueScale devices are<br />

present, it restricts the use to a specific TrueScale unit. By default, all configured<br />

units are used in round robin order.<br />

Context Sharing Disabled: Each node program tries to obtain exclusive access<br />

to an TrueScale hardware context. If no hardware contexts are available, the job<br />

aborts.<br />

To explicitly disable context sharing, set this environment variable in one of the<br />

two following ways:<br />

PSM_SHAREDCONTEXTS=0<br />

PSM_SHAREDCONTEXTS=NO<br />

The default value of PSM_SHAREDCONTEXTS is 1 (enabled).<br />

Restricting TrueScale Hardware Contexts<br />

in a Batch Environment<br />

If required for resource sharing between multiple jobs in batch systems, you can<br />

restrict the number of TrueScale hardware contexts that are made available on<br />

each node of an MPI job by setting that number in the<br />

PSM_SHAREDCONTEXTS_MAX environment variable.<br />

D000046-005 B 4-13


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

For example, if you are running two different jobs on nodes using the QLE7140,<br />

set PSM_SHAREDCONTEXTS_MAX to 2 instead of the default 4. Each job would<br />

then have at most two of the four available hardware contexts. Both of the jobs<br />

that want to share a node would have to set PSM_SHAREDCONTEXTS_MAX=2 on<br />

that node before sharing begins.<br />

However, note that setting PSM_SHAREDCONTEXTS_MAX=2 as a clusterwide<br />

default would unnecessarily penalize nodes that are dedicated to running single<br />

jobs. So a per-node setting, or some level of coordination with the job scheduler<br />

with setting the environment variable, is recommended.<br />

If some nodes have more cores than others, then the setting must be adjusted<br />

properly for the number of cores on each node.<br />

Additionally, you can explicitly configure for the number of contexts you want to<br />

use with the cfgctxts module parameter. This will override the default settings<br />

(on the QLE7240 and QLE7280) based on the number of CPUs present on each<br />

node. See “TrueScale Hardware Contexts on the DDR and QDR InfiniBand<br />

Adapters” on page 4-12.<br />

Context Sharing Error Messages<br />

The error message when the context limit is exceeded is:<br />

No free InfiniPath contexts available on /dev/ipath<br />

This message appears when the application starts.<br />

Error messages related to contexts may also be generated by ipath_checkout<br />

or mpirun. For example:<br />

PSM found 0 available contexts on InfiniPath device<br />

The most likely cause is that the cluster has processes using all the available<br />

PSM contexts. Clean up these processes before restarting the job.<br />

Running in Shared Memory Mode<br />

<strong>QLogic</strong> MPI supports running exclusively in shared memory mode; no <strong>QLogic</strong><br />

adapter is required for this mode of operation. This mode is used for running<br />

applications on a single node rather than on a cluster of nodes.<br />

To enable shared memory mode, use either a single node in the mpihosts file or<br />

use these options with mpirun:<br />

$ mpirun -np= -ppn=<br />

needs to be equal in both cases.<br />

NOTE:<br />

For this release, must be ≤ 64.<br />

4-14 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

When you are using a non-<strong>QLogic</strong> MPI that uses the TrueScale PSM layer,<br />

ensure that the parallel job is contained on a single node and set the<br />

PSM_DEVICES environment variable:<br />

PSM_DEVICES="shm,self"<br />

If you are using <strong>QLogic</strong> MPI, you do not need to set this environment variable; it is<br />

set automatically if np == ppn.<br />

When running on a single node with <strong>QLogic</strong> MPI, no infiniband adapter hardware<br />

is required if -disable-dev-check is passed to mpirun.<br />

mpihosts File Details<br />

As noted in “Create the mpihosts File” on page 4-3, an mpihosts file (also called<br />

a machines file, nodefile, or hostsfile) has been created in your current working<br />

directory. This file names the nodes that the node programs may run.<br />

The two supported formats for the mpihosts file are:<br />

hostname1<br />

hostname2<br />

...<br />

or<br />

hostname1:process_count<br />

hostname2:process_count<br />

...<br />

In the first format, if the -np count (number of processes to spawn in the mpirun<br />

command) is greater than the number of lines in the machine file, the hostnames<br />

will be repeated (in order) as many times as necessary for the requested number<br />

of node programs.<br />

In the second format, process_count can be different for each host, and is<br />

normally the number of available processors on the node. When not specified, the<br />

default value is one. The value of process_count determines how many node<br />

programs will be started on that host before using the next entry in the mpihosts<br />

file. When the full mpihosts file is processed, and there are additional processes<br />

requested, processing starts again at the start of the file.<br />

NOTE:<br />

To create an mpihosts file, use the ibhosts program. It will generate a<br />

list of available nodes that are already connected to the switch.<br />

There are several alternative ways of specifying the mpihosts file:<br />

D000046-005 B 4-15


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Using mpirun<br />

• As noted in “Compile and Run an Example C Program” on page 4-4, you<br />

can use the command line option -m:<br />

$mpirun -np n -m mpihosts [other options] program-name<br />

In this case, if the named file cannot be opened, the MPI job fails.<br />

An alternate mechanism to -m for specifying hosts is the -H or -hosts<br />

followed by a host list. The host list can follow one of the following examples:<br />

host-[01-02,04,06-08] , or<br />

host-01,host-02,host-04,host-06,host-07,host-08<br />

• When neither the -m or the -H options are used, mpirun checks the<br />

environment variable MPIHOSTS for the name of the MPI hosts file. If this<br />

variable is defined and the file it names cannot be opened, the MPI job fails.<br />

• In the absence of the -m option, the -H option, and the MPIHOSTS<br />

environment variable, mpirun uses the file ./mpihosts, if it exists.<br />

• If none of these four methods of specifying the hosts file are used, mpirun<br />

looks for the file ~/.mpihosts.<br />

If you are working in the context of a batch queuing system, it may provide a job<br />

submission script that generates an appropriate mpihosts file.<br />

The script mpirun is a front end program that starts a parallel MPI job on a set of<br />

nodes in an TrueScale cluster. mpirun may be run on any i386 or x86_64<br />

machine inside or outside the cluster, as long as it is on a supported Linux<br />

distribution, and has TCP connectivity to all TrueScale cluster machines to be<br />

used in a job.<br />

The script starts, monitors, and terminates the node programs. mpirun uses ssh<br />

(secure shell) to log in to individual cluster machines and prints any messages<br />

that the node program prints on stdout or stderr, on the terminal where<br />

mpirun is invoked.<br />

4-16 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

NOTE:<br />

The mpi-frontend-* RPM needs to be installed on all nodes that will be<br />

using mpirun. Alternatively, you can use the mpirun option<br />

-distributed=off, which requires that only the mpi-frontend RPM is<br />

installed on the node where mpirun is invoked. Using -distributed=off<br />

can have a negative impact on mpirun’s performance when running<br />

large-scale jobs. More specifically, this option increases the memory usage<br />

on the host where mpirun is started and will slow down the job startup,<br />

since it will spawn MPI processes serially.<br />

The general syntax is:<br />

$ mpirun [mpirun_options...] program-name [program options]<br />

program-name is usually the pathname to the executable MPI program. When<br />

the MPI program resides in the current directory and the current directory is not in<br />

your search path, then program-name must begin with ‘./’, for example:<br />

./program-name<br />

Unless you want to run only one instance of the program, use the -np option, for<br />

example:<br />

$ mpirun -np n [other options] program-name<br />

This option spawns n instances of program-name. These instances are called<br />

node programs.<br />

Generally, mpirun tries to distribute the specified number of processes evenly<br />

among the nodes listed in the mpihosts file. However, if the number of<br />

processes exceeds the number of nodes listed in the mpihosts file, then some<br />

nodes will be assigned more than one instance of the program.<br />

Another command line option, -ppn, instructs mpirun to assign a fixed number p<br />

of node programs (processes) to each node, as it distributes n instances among<br />

the nodes:<br />

$ mpirun -np n -m mpihosts -ppn p program-name<br />

This option overrides the :process_count specifications, if any, in the lines of<br />

the mpihosts file. As a general rule, mpirun distributes the n node programs<br />

among the nodes without exceeding, on any node, the maximum number of<br />

instances specified by the :process_count option. The value of<br />

the :process_count option is specified by either the -ppn command line<br />

option or in the mpihosts file.<br />

D000046-005 B 4-17


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

NOTE:<br />

When the -np value is larger than the number of nodes in the mpihosts file<br />

times the -ppn value, mpirun cycles back through the hostsfile, assigning<br />

additional node programs per host.<br />

Typically, the number of node programs should not be larger than the number of<br />

processor cores, at least not for compute-bound programs.<br />

This option specifies the number of processes to spawn. If this option is not set,<br />

then environment variable MPI_NPROCS is checked. If MPI_NPROCS is not set,<br />

the default is to determine the number of processes based on the number of hosts<br />

in the machinefile -M or the list of hosts -H.<br />

-ppn processes-per-node<br />

This option creates up to the specified number of processes per node.<br />

Each node program is started as a process on one node. While a node program<br />

may fork child processes, the children themselves must not call MPI functions.<br />

The -distributed=on|off option has been added to mpirun. This option<br />

reduces overhead by enabling mpirun to start processes in parallel on multiple<br />

nodes. Initially, mpirun spawns one mpirun child per node from the root node,<br />

each of which in turn spawns the number of local processes for that particular<br />

node. Control the use of distributed mpirun job spawning mechanism with this<br />

option:<br />

-distributed [=on|off]<br />

The default is on. To change the default, put this option in the global<br />

mpirun.defaults file or a user-local file. See “Environment for Node<br />

Programs” on page 4-19 and “Environment Variables” on page 4-20 for details.<br />

mpirun monitors the parallel MPI job, terminating when all the node programs in<br />

that job exit normally, or if any of them terminates abnormally.<br />

Killing the mpirun program kills all the processes in the job. Use CTRL+C to kill<br />

mpirun.<br />

Console I/O in MPI Programs<br />

mpirun sends any output printed to stdout or stderr by any node program to<br />

the terminal. This output is line-buffered, so the lines output from the various node<br />

programs will be non-deterministically interleaved on the terminal. Using the -l<br />

option to mpirun will label each line with the rank of the node program where it<br />

was produced.<br />

Node programs do not normally use interactive input on stdin, and by default,<br />

stdin is bound to /dev/null. However, for applications that require standard<br />

input redirection, <strong>QLogic</strong> MPI supports two mechanisms to redirect stdin:<br />

4-18 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

• When mpirun is run from the same node as MPI rank 0, all input piped to<br />

the mpirun command is redirected to rank 0.<br />

• When mpirun is not run from the same node as MPI rank 0, or if the input<br />

must be redirected to all or specific MPI processes, the -stdin option can<br />

redirect a file as standard input to all nodes (or to a particular node) as<br />

specified by the -stdin-target option.<br />

Environment for Node Programs<br />

TrueScale-related environment variables are propagated to node programs.<br />

These include environment variables that begin with the prefix IPATH_, PSM_,<br />

MPI_ or LD_. Some other variables (such as HOME) are set or propagated by<br />

ssh(1).<br />

NOTE:<br />

The environment variable LD_BIND_NOW is not supported for <strong>QLogic</strong> MPI<br />

programs. Not all symbols referenced in the shared libraries can be resolved<br />

on all installations. (They provide a variety of compatible behaviors for<br />

different compilers, etc.) Therefore, the libraries are built to run in lazy<br />

binding mode; the dynamic linker evaluates and binds to symbols only when<br />

needed by the application in a given runtime environment.<br />

mpirun checks for these environment variables in the shell where it is invoked,<br />

and then propagates them correctly. The environment on each node is whatever it<br />

would be for the user’s login via ssh, unless you are using a Multi-Purpose<br />

Daemon (MPD) (see “MPD” on page 4-24).<br />

Environment variables are specified in descending order, as follows:<br />

1. Set in the default shell environment on a remote node, e.g., ~/.bashrc or<br />

equivalents.<br />

2. Set in -rcfile.<br />

3. Set the current shell environment for the mpirun command.<br />

4. If nothing has been set (none of the previous sets have been performed),<br />

the default value of the environment variable is used.<br />

As noted in the above list, using an mpirunrc file overrides any environment<br />

variables already set by the user. You can set environment variables for the node<br />

programs with the -rcfile option of mpirun with the following command:<br />

$ mpirun -np n -m mpihosts -rcfile mpirunrc program_name<br />

In the absence of this option, mpirun checks to see if a file called<br />

$HOME/.mpirunrc exists in the user's home directory. In either case, the file is<br />

sourced by the shell on each node when the node program starts.<br />

D000046-005 B 4-19


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

The .mpirunrc command line cannot contain any interactive commands. It can<br />

contain commands that output on stdout or stderr.<br />

There is a global options file for mpirun arguments. The default location of this<br />

file is:<br />

/opt/infinipath/etc/mpirun.defaults<br />

You can use an alternate file by setting the environment variable<br />

$PSC_MPIRUN_DEFAULTS_PATH. See the mpirun man page for more<br />

information.<br />

Environment Variables<br />

Table 4-7 contains a summary of the environment variables that are used by<br />

TrueScale and mpirun.<br />

Table 4-7. Environment Variables<br />

Name<br />

Description<br />

MPICH_ROOT<br />

This variable is used by mpirun to find the<br />

mpirun-ipath-ssh executable, set up<br />

LD_LIBRARY_PATH, and set up a prefix for all<br />

InfiniPath pathnames. This variable is used by the<br />

--prefix argument (or is the same as --prefix),<br />

if installing TrueScale RPMs in an alternate<br />

location.<br />

Default: Unset<br />

IPATH_PORT Specifies the port to use for the job, 1 or 2.<br />

Specifying 0 will autoselect IPATH_PORT.<br />

Default: Unset<br />

IPATH_SL<br />

IPATH_UNIT<br />

LD_LIBRARY_PATH<br />

Service Level for QDR Adapters, these are used<br />

to work with the switch's Vfabric feature.<br />

Default: Unset<br />

This variable is for context sharing. When multiple<br />

TrueScale devices are present, this variable<br />

restricts the use to a specific TrueScale unit. By<br />

default, all configured units are used in round<br />

robin order.<br />

Default: Unset<br />

This variable specifies the path to the run-time<br />

library. It is often set in the .mpirunrc file.<br />

Default: Unset<br />

4-20 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Table 4-7. Environment Variables (Continued)<br />

Name<br />

Description<br />

MPICH_CC<br />

MPICH_CCC<br />

MPICH_F90<br />

MPIHOSTS<br />

MPI_NPROCS<br />

MPI_SHELL<br />

PSM_DEVICES<br />

PSC_MPIRUN_DEFAULTS_PATH<br />

PSM_SHAREDCONTEXTS<br />

PSM_SHAREDCONTEXTS_MAX<br />

This variable selects the compiler to use for<br />

mpicc, and others.<br />

This variable selects the compiler to use for<br />

mpicxx, and others.<br />

This variable selects the compiler to use for<br />

mpif90, and others.<br />

This variable sets the name of the machines<br />

(mpihosts) file.<br />

Default: Unset<br />

This variable specifies the number of MPI processes<br />

to spawn.<br />

Default: Unset<br />

Specifies the name of the program to log into<br />

remote hosts.<br />

Default: ssh unless MPI_SHELL is defined.<br />

Non-<strong>QLogic</strong> MPI users can set this variable to<br />

enable running in shared memory mode on a single<br />

node. This variable is automatically set for<br />

<strong>QLogic</strong> MPI.<br />

Default: PSM_DEVICES="self,ipath"<br />

This variable sets the path to a user-local mpirun<br />

defaults file.<br />

Default:<br />

/opt/infinipath/etc/mpirun.defaults<br />

This variable overrides automatic context sharing<br />

behavior. YES is equivalent to 1 (see Default).<br />

Default: PSM_SHAREDCONTEXTS=1<br />

This variable restricts the number of TrueScale<br />

contexts that are made available on each node of<br />

an MPI job.<br />

Default:<br />

PSM_SHAREDCONTEXTS_MAX=4 (QLE7140)<br />

Up to 16 on (QLE7240 and QLE7280; set automatically<br />

based on number of CPUs on node)<br />

D000046-005 B 4-21


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Details<br />

Table 4-7. Environment Variables (Continued)<br />

Name<br />

IPATH_HCA_SELECTION_ALG<br />

Description<br />

This variable provides user-level support to specify<br />

<strong>Host</strong> Channel Adapter/port selection algorithm<br />

through the environment variable. The default<br />

option is Round Robin that allocates the <strong>Host</strong><br />

Channel Adapters in a round robin fashion. The<br />

older mechanism option is Packed that fills all<br />

contexts on a <strong>Host</strong> Channel Adapter before allocating<br />

from the next <strong>Host</strong> Channel Adapter.<br />

For example: In the case of using two single-port<br />

<strong>Host</strong> Channel Adapters, the default or<br />

IPATH_HCA_SELECTION_ALG="Round Robin"<br />

setting, will allow 2 or more MPI processes per<br />

node to use both <strong>Host</strong> Channel Adapters and to<br />

achieve performance improvements compared to<br />

what can be achieved with one <strong>Host</strong> Channel<br />

Adapter.<br />

Running Multiple Versions of TrueScale or MPI<br />

The variable MPICH_ROOT sets a root prefix for all InfiniPath-related paths. It is<br />

used by mpirun to try to find the mpirun-ipath-ssh executable, and it also<br />

sets up the LD_LIBRARY_PATH for new programs. Consequently, multiple<br />

versions of the TrueScale software releases can be installed on some or all<br />

nodes, and <strong>QLogic</strong> MPI and other versions of MPI can be installed at the same<br />

time. It may be set in the environment, in mpirun.defaults, or in an rcfile (such<br />

as .mpirunrc, .bashrc, or .cshrc) that will be invoked on remote nodes.<br />

If you have installed the software into an alternate location using the --prefix<br />

option with rpm, --prefix would have been set to $MPICH_ROOT.<br />

If MPICH_ROOT is not set, the normal PATH is used unless mpirun is invoked with<br />

a full pathname.<br />

NOTE:<br />

mpirun-ssh was renamed mpirun-ipath-ssh to avoid name conflicts<br />

with other MPI implementations.<br />

4-22 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Performance Tuning<br />

Job Blocking in Case of Temporary InfiniBand Link Failures<br />

By default, as controlled by mpirun’s quiescence parameter -q, an MPI job is<br />

killed for quiescence in the event of an InfiniBand link failure (or unplugged cable).<br />

This quiescence timeout occurs under one of the following conditions:<br />

• A remote rank’s process cannot reply to out-of-band process checks.<br />

• MPI is inactive on the InfiniBand link for more than 15 minutes.<br />

To keep remote process checks but disable triggering quiescence for temporary<br />

InfiniBand link failures, use the -disable-mpi-progress-check option with a<br />

nonzero -q option. To disable quiescence triggering altogether, use -q 0. No<br />

matter how these options are used, link failures (temporary or other) are always<br />

logged to syslog.<br />

If the link is down when the job starts and you want the job to continue blocking<br />

until the link comes up, use the -t -1 option.<br />

Performance Tuning<br />

CPU Affinity<br />

These methods may be used at runtime. Performance settings that are typically<br />

set by the system administrator are listed in “Performance Settings and<br />

Management Tips” on page 3-25.<br />

InfiniPath attempts to run each node program with CPU affinity set to a separate<br />

logical processor, up to the number of available logical processors. If CPU affinity<br />

is already set (with sched_setaffinity() or with the taskset utility), then<br />

InfiniPath will not change the setting.<br />

Use the taskset utility with mpirun to specify the mapping of MPI processes to<br />

logical processors. This combination makes the best use of available memory<br />

bandwidth or cache locality when running on dual-core Symmetric<br />

MultiProcessing (SMP) cluster nodes.<br />

The following example uses the NASA Advanced Supercomputing (NAS) Parallel<br />

Benchmark’s Multi-Grid (MG) benchmark and the -c option to taskset.<br />

$ mpirun -np 4 -ppn 2 -m $hosts taskset -c 0,2 bin/mg.B.4<br />

$ mpirun -np 4 -ppn 2 -m $hosts taskset -c 1,3 bin/mg.B.4<br />

The first command forces the programs to run on CPUs (or cores) 0 and 2. The<br />

second command forces the programs to run on CPUs 1 and 3. See the taskset<br />

man page for more information on usage.<br />

To turn off CPU affinity, set the environment variable IPATH_NO_CPUAFFINITY.<br />

This environment variable is propagated to node programs by mpirun.<br />

D000046-005 B 4-23


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

MPD<br />

mpirun Tunable Options<br />

MPD<br />

There are some mpirun options that can be adjusted to optimize communication.<br />

The most important one is:<br />

-long-len, -L [default: 64000]<br />

This option determines the length of the message that the rendezvous protocol<br />

(instead of the eager protocol) must use. The default value for -L was chosen for<br />

optimal unidirectional communication. Applications that have this kind of traffic<br />

pattern benefit from this higher default value. Other values for -L are appropriate<br />

for different communication patterns and data size. For example, applications that<br />

have bidirectional traffic patterns may benefit from using a lower value.<br />

Experimentation is recommended.<br />

Two other options that are useful are:<br />

-long-len-shmem, -s [default: 16000]<br />

This option determines the length of the message within the rendezvous protocol<br />

(instead of the eager protocol) to be used for intra-node communications. This<br />

option is for messages going through shared memory. The InfiniPath rendezvous<br />

messaging protocol uses a two-way handshake (with MPI synchronous send<br />

semantics) and receive-side DMA.<br />

-rndv-window-size, -W [default: 262144]<br />

When sending a large message using the rendezvous protocol, <strong>QLogic</strong> MPI splits<br />

it into a number of fragments at the source and recombines them at the<br />

destination. Each fragment is sent as a single rendezvous stage. This option<br />

specifies the maximum length of each fragment. The default is 262144 bytes.<br />

For more information on tunable options, type:<br />

$ mpirun -h<br />

MPD Description<br />

The complete list of options is contained in Appendix A.<br />

The Multi-Purpose Daemon (MPD) is an alternative to mpirun for launching MPI<br />

jobs. It is described briefly in the following sections.<br />

MPD was developed by Argonne National Laboratory (ANL) as part of the<br />

MPICH-2 system. While the ANL MPD had some advantages over the use of their<br />

mpirun (faster launching, better cleanup after crashes, better tolerance of node<br />

failures), the <strong>QLogic</strong> mpirun offers the same advantages.<br />

4-24 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP Applications<br />

Using MPD<br />

The disadvantage of MPD is reduced security, since it does not use ssh to launch<br />

node programs. It is also more complex to use than mpirun because it requires<br />

starting a ring of MPD daemons on the nodes. Therefore, <strong>QLogic</strong> recommends<br />

using the normal mpirun mechanism for starting jobs, as described in the<br />

previous chapter. However, if you want to use MPD, it is included in the InfiniPath<br />

software.<br />

To start an MPD environment, use the mpdboot program. You must provide<br />

mpdboot with a file that lists the machines that will run the mpd daemon. The<br />

format of this file is the same as for the mpihosts file in the mpirun command.<br />

Here is an example of how to run mpdboot:<br />

$ mpdboot -f hostsfile<br />

After mpdboot has started the MPD daemons, it will print a status message and<br />

drop into a new shell.<br />

To leave the MPD environment, exit from this shell. This will terminate the<br />

daemons.<br />

To use rsh instead of ssh with mpdboot, set the environment variable MPD_RSH<br />

to the pathname of the desired remote shell. For example:<br />

MPD_RSH=‘which rsh‘ mpdboot -n 16 -f hosts<br />

To run an MPI program from within the MPD environment, use the mpirun<br />

command. You do not need to provide an mpihosts file or a count of CPUs; by<br />

default, mpirun uses all nodes and CPUs available within the MPD environment.<br />

To check the status of the MPD daemons, use the mpdping command.<br />

NOTE:<br />

To use MPD, the software package mpi-frontend-*.rpm and python<br />

(available with your distribution) must be installed on all nodes. See the<br />

<strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more details on software<br />

installation.<br />

<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP<br />

Applications<br />

<strong>QLogic</strong> MPI supports hybrid MPI/OpenMP applications, provided that MPI<br />

routines are called only by the master OpenMP thread. This application is called<br />

the funneled thread model. Instead of MPI_Init/MPI_INIT (for C/C++ and<br />

Fortran respectively), the program can call<br />

MPI_Init_thread/MPI_INIT_THREAD to determine the level of thread<br />

support, and the value MPI_THREAD_FUNNELED will be returned.<br />

D000046-005 B 4-25


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Debugging MPI Programs<br />

To use this feature, the application must be compiled with both OpenMP and MPI<br />

code enabled. To do this, use the -mp flag on the mpicc compile line.<br />

As mentioned previously, MPI routines can be called only by the master OpenMP<br />

thread. The hybrid executable is executed as usual using mpirun, but typically<br />

only one MPI process is run per node and the OpenMP library will create<br />

additional threads to utilize all CPUs on that node. If there are sufficient CPUs on<br />

a node, you may want to run multiple MPI processes and multiple OpenMP<br />

threads per node.<br />

The number of OpenMP threads is typically controlled by the OMP_NUM_THREADS<br />

environment variable in the .mpirunrc file. (OMP_NUM_THREADS is used by<br />

other compilers’ OpenMP products, but is not a <strong>QLogic</strong> MPI environment<br />

variable.) Use this variable to adjust the split between MPI processes and<br />

OpenMP threads. Usually, the number of MPI processes (per node) times the<br />

number of OpenMP threads will be set to match the number of CPUs per node. An<br />

example case would be a node with four CPUs, running one MPI process and four<br />

OpenMP threads. In this case, OMP_NUM_THREADS is set to four.<br />

OMP_NUM_THREADS is on a per-node basis.<br />

See “Environment for Node Programs” on page 4-19 for information on setting<br />

environment variables.<br />

At the time of publication, the MPI_THREAD_SERIALIZED and<br />

MPI_THREAD_MULTIPLE models are not supported.<br />

Debugging MPI Programs<br />

MPI Errors<br />

NOTE:<br />

When there are more threads than CPUs, both MPI and OpenMP<br />

performance can be significantly degraded due to over-subscription of the<br />

CPUs.<br />

Debugging parallel programs is substantially more difficult than debugging serial<br />

programs. Thoroughly debugging the serial parts of your code before parallelizing<br />

is good programming practice.<br />

Almost all MPI routines (except MPI_Wtime and MPI_Wtick) return an error<br />

code; either as the function return value in C functions or as the last argument in a<br />

Fortran subroutine call. Before the value is returned, the current MPI error handler<br />

is called. By default, this error handler aborts the MPI job. Therefore, you can get<br />

information about MPI exceptions in your code by providing your own handler for<br />

MPI_ERRORS_RETURN. See the man page for the MPI_Errhandler_set for<br />

details.<br />

4-26 D000046-005 B


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

Debugging MPI Programs<br />

NOTE:<br />

MPI does not guarantee that an MPI program can continue past an error.<br />

Using Debuggers<br />

See the standard MPI documentation referenced in Appendix J for details on the<br />

MPI error codes.<br />

The InfiniPath software supports the use of multiple debuggers, including<br />

pathdb, gdb, and the system call tracing utility strace. These debuggers let you<br />

set breakpoints in a running program, and examine and set its variables.<br />

Symbolic debugging is easier than machine language debugging. To enable<br />

symbolic debugging, you must have compiled with the -g option to mpicc so that<br />

the compiler will have included symbol tables in the compiled object code.<br />

To run your MPI program with a debugger, use the -debug or<br />

-debug-no-pause and -debugger options for mpirun. See the man pages to<br />

pathdb, gdb, and strace for details. When running under a debugger, you get<br />

an xterm window on the front end machine for each node process. Therefore, you<br />

can control the different node processes as desired.<br />

To use strace with your MPI program, the syntax is:<br />

$ mpirun -np n -m mpihosts strace program-name<br />

The following features of <strong>QLogic</strong> MPI facilitate debugging:<br />

• Stack backtraces are provided for programs that crash.<br />

• The -debug and -debug-no-pause options are provided for mpirun.<br />

These options make each node program start with debugging enabled. The<br />

-debug option allows you to set breakpoints, and start running programs<br />

individually. The -debug-no-pause option allows postmortem inspection.<br />

Be sure to set -q 0 when using -debug.<br />

• Communication between mpirun and node programs can be printed by<br />

specifying the mpirun -verbose option.<br />

• MPI implementation debug messages can be printed by specifying the<br />

mpirun -psc-debug-level option. This option can substantially impact<br />

the performance of the node program.<br />

• Support is provided for progress timeout specifications, deadlock detection,<br />

and generating information about where a program is stuck.<br />

D000046-005 B 4-27


4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />

<strong>QLogic</strong> MPI Limitations<br />

• Several misconfigurations (such as mixed use of 32-bit/64-bit executables)<br />

are detected by the runtime.<br />

• A formatted list containing information useful for high-level MPI application<br />

profiling is provided by using the -print-stats option with mpirun.<br />

Statistics include minimum, maximum, and median values for message<br />

transmission protocols as well as a more detailed information for expected<br />

and unexpected message reception. See “MPI Stats” on page F-30 for more<br />

information and a sample output listing.<br />

NOTE:<br />

The TotalView ® debugger can be used with the Open MPI supplied in this<br />

release. Consult the TotalView documentation for more information.<br />

<strong>QLogic</strong> MPI Limitations<br />

The current version of <strong>QLogic</strong> MPI has the following limitations:<br />

• There are no C++ bindings to MPI; use the extern C MPI function calls.<br />

• In MPI-IO file I/O calls in the Fortran binding, offset, or displacement<br />

arguments are limited to 32 bits. Thus, for example, the second argument of<br />

MPI_File_seek must be between -2 31 and 2 31 -1, and the argument to<br />

MPI_File_read_at must be between 0 and 2 32 -1.<br />

4-28 D000046-005 B


5 Using Other MPIs<br />

Introduction<br />

This section provides information on using other MPI implementations.<br />

Support for multiple high-performance MPI implementations has been added.<br />

Most implementations run over both PSM and OpenFabrics Verbs (see<br />

Table 5-1).<br />

Table 5-1. Other Supported MPI Implementations<br />

MPI<br />

Implementation<br />

Runs Over<br />

Compiled<br />

With<br />

Comments<br />

Open MPI 1.5<br />

PSM<br />

Verbs<br />

GCC, Intel, PGI,<br />

PathScale<br />

Provides some MPI-2 functionality<br />

(one-sided operations<br />

and dynamic<br />

processes).<br />

Available as part of the<br />

<strong>QLogic</strong> download.<br />

Can be managed by<br />

mpi-selector.<br />

MVAPICH version<br />

1.2<br />

PSM<br />

Verbs<br />

GCC, Intel, PGI,<br />

PathScale<br />

Provides MPI-1 functionality.<br />

Available as part of the<br />

<strong>QLogic</strong> download.<br />

Can be managed by<br />

mpi-selector.<br />

MVAPICH2 version<br />

1.4<br />

PSM<br />

Verbs<br />

GCC, Intel, PGI,<br />

PathScale<br />

Provides MPI-2 Functionality.<br />

Can be managed by<br />

MPI-Selector.<br />

Platform MPI 7 and<br />

HP-MPI 2.3<br />

PSM<br />

Verbs<br />

GCC (default)<br />

Provides some MPI-2 functionality<br />

(one-sided operations).<br />

Available for purchase from<br />

HP.<br />

D000046-005 B 5-1


5–Using Other MPIs<br />

Installed Layout<br />

Table 5-1. Other Supported MPI Implementations (Continued)<br />

MPI<br />

Implementation<br />

Runs Over<br />

Compiled<br />

With<br />

Comments<br />

Platform (Scali) 5.6<br />

PSM<br />

GCC (default)<br />

Provides MPI-1 functionality.<br />

Verbs<br />

Available for purchase from<br />

Platform.<br />

Intel MPI version 4.0<br />

TMI/PSM,<br />

uDAPL<br />

GCC (default)<br />

Provides MPI-1 and MPI-2<br />

functionality.<br />

Available for purchase from<br />

Intel.<br />

Table Notes<br />

MVAPICH and Open MPI have been have been compiled for PSM to support the following versions<br />

of the compilers:<br />

• (GNU) gcc 4.1.0<br />

• (PathScale) pathcc 3.2<br />

• (PGI) pgcc 9.0<br />

• (Intel) icc 11.1<br />

These MPI implementations run on multiple interconnects, and have their own<br />

mechanisms for selecting the interconnect that runs on. Basic information about<br />

using these MPIs is provided in this section. However, for more detailed<br />

information, see the documentation provided with the version of MPI that you want<br />

to use.<br />

Installed Layout<br />

By default, the MVAPICH and Open MPI MPIs are installed in this directory tree:<br />

/usr/mpi//-<br />

The <strong>QLogic</strong>-supplied MPIs precompiled with the GCC, PathScale, PGI, and the<br />

Intel compilers will also have -qlc appended after .<br />

For example:<br />

/usr/mpi/gcc/openmpi-1.5-qlc<br />

If a prefixed installation location is used, /usr is replaced by $prefix.<br />

The following examples assume that the default path for each MPI implementation<br />

to mpirun is:<br />

/usr/mpi///bin/mpirun<br />

Again, /usr may be replaced by $prefix. This path is sometimes referred to as<br />

$mpi_home/bin/mpirun in the following sections.<br />

5-2 D000046-005 B


5–Using Other MPIs<br />

Open MPI<br />

Open MPI<br />

Installation<br />

See the documentation for HP-MPI, Intel MPI, and Platform MPI for their default<br />

installation directories.<br />

Open MPI is an open source MPI-2 implementation from the Open MPI Project.<br />

Pre-compiled versions of Open MPI version 1.5 that run over PSM and are built<br />

with the GCC, PGI, PathScale, and Intel compilers are available with the <strong>QLogic</strong><br />

download.<br />

Open MPI that runs over Verbs and is pre-compiled with the GNU compiler is also<br />

available.<br />

Open MPI can be managed with the mpi-selector utility, as described in<br />

“Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility” on<br />

page 5-6.<br />

Follow the instructions in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for<br />

installing Open MPI.<br />

Newer versions than the one supplied with this release can be installed after<br />

<strong>QLogic</strong> OFED 1.4.2 has already been installed; these may be downloaded from<br />

the Open MPI web site. Note that versions that are released after the <strong>QLogic</strong><br />

OFED 1.4.2 release will not be supported.<br />

Setup<br />

If you use the mpi-selector tool, the necessary setup is done for you. If you do<br />

not use this tool, you can put your Open MPI installation directory in the PATH:<br />

add /bin to PATH<br />

The is the directory path where the desired MPI was installed.<br />

Compiling Open MPI Applications<br />

As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />

scripts that invoke the underlying compiler (see Table 5-2).<br />

Table 5-2. Open MPI Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpicc<br />

mpiCC, mpicxx, or mpic++<br />

C<br />

C++<br />

mpif77 Fortran 77<br />

D000046-005 B 5-3


5–Using Other MPIs<br />

Open MPI<br />

Table 5-2. Open MPI Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpif90 Fortran 90<br />

To compile your program in C, type:<br />

$ mpicc mpi_app_name.c -o mpi_app_name<br />

Running Open MPI Applications<br />

By default, Open MPI shipped with the InfiniPath software stack will run over PSM<br />

once it is installed.<br />

Here is an example of a simple mpirun command running with four processes:<br />

$ mpirun -np 4 -machinefile mpihosts mpi_app_name<br />

To specify the PSM transport explicitly, add --mca mtl psm to the above<br />

command line.<br />

To run over InfiniBand Verbs instead, use this mpirun command line:<br />

$ mpirun -np 4 -machinefile mpihosts --mca btl sm --mca btl<br />

openib,self --mca mtl ^psm mpi_app_name<br />

The following command enables shared memory:<br />

--mca btl sm<br />

The following command enables openib transport and communication to self:<br />

--mca btl openib, self<br />

The following command disables PSM transport:<br />

--mca mtl ^psm<br />

In these commands, btl stands for byte transport layer and mtl for matching<br />

transport layer.<br />

PSM transport works in terms of MPI messages. OpenIB transport works in terms<br />

of byte streams.<br />

Alternatively, you can use Open MPI with a sockets transport running over IPoIB,<br />

for example:<br />

$ mpirun -np 4 -machinefile mpihosts --mca btl sm --mca btl<br />

tcp,self --mca btl_tcp_if_exclude eth0 --mca btl_tcp_if_include<br />

ib0 --mca mtl ^psm mpi_app_name<br />

Note that eth0 and psm are excluded, while ib0 is included. These instructions<br />

may need to be adjusted for your interface names.<br />

Note that in Open MPI, machinefile is also known as the hostfile.<br />

5-4 D000046-005 B


5–Using Other MPIs<br />

MVAPICH<br />

Further Information on Open MPI<br />

MVAPICH<br />

Installation<br />

For more information about Open MPI, see:<br />

http://www.open-mpi.org/<br />

http://www.open-mpi.org/faq<br />

Pre-compiled versions of MVAPICH 1.2 built with the GNU, PGI, PathScale, and<br />

Intel compilers, and that run over PSM, are available with the <strong>QLogic</strong> download.<br />

MVAPICH that runs over Verbs and is pre-compiled with the GNU compiler is also<br />

available.<br />

MVAPICH can be managed with the mpi-selector utility, as described in<br />

“Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility” on<br />

page 5-6.<br />

To install MVAPICH, follow the instructions in the appropriate installation guide.<br />

Newer versions than the one supplied with this release can be installed after<br />

<strong>QLogic</strong> OFED 1.4.2 has already been installed; these may be downloaded from<br />

the MVAPICH web site. Note that versions that are released after the <strong>QLogic</strong><br />

OFED 1.4.2 release will not be supported.<br />

Setup<br />

To launch MPI jobs, the MVAPICH installation directory must be included in PATH<br />

and LD_LIBRARY_PATH.<br />

When using sh for launching MPI jobs, run the command:<br />

$ source /usr/mpi///bin/mpivars.sh<br />

When using csh for launching MPI jobs, run the command:<br />

$ source /usr/mpi///bin/mpivars.csh<br />

Compiling MVAPICH Applications<br />

As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />

scripts that invoke the underlying compiler (see Table 5-3).<br />

Table 5-3. MVAPICH Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpicc<br />

C<br />

D000046-005 B 5-5


5–Using Other MPIs<br />

Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility<br />

Table 5-3. MVAPICH Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpiCC, mpicxx<br />

C++<br />

mpif77 Fortran 77<br />

mpif90 Fortran 90<br />

To compile your program in C, type:<br />

$ mpicc mpi_app_name.c -o mpi_app_name<br />

To check the default configuration for the installation, check the following file:<br />

/usr/mpi///etc/mvapich.conf<br />

Running MVAPICH Applications<br />

By default, the MVAPICH shipped with the InfiniPath software stack runs over<br />

PSM once it is installed.<br />

Here is an example of a simple mpirun command running with four processes:<br />

$ mpirun -np 4 -hostfile mpihosts mpi_app_name<br />

Password-less ssh is used unless the -rsh option is added to the command line<br />

above.<br />

Further Information on MVAPICH<br />

For more information about MVAPICH, see:<br />

http://mvapich.cse.ohio-state.edu/<br />

Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI<br />

with the mpi-selector Utility<br />

When multiple MPI implementations have been installed on the cluster, you can<br />

use the mpi-selector to switch between them. The MPIs that can be managed<br />

with the mpi-selector are:<br />

• Open MPI<br />

• MVAPICH<br />

• MVAPICH2<br />

• <strong>QLogic</strong> MPI<br />

The mpi-selector is an OFED utility that is installed as a part of <strong>QLogic</strong> OFED<br />

1.4.2. Its basic functions include:<br />

5-6 D000046-005 B


5–Using Other MPIs<br />

Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility<br />

• Listing available MPI implementations<br />

• Setting a default MPI to use (per user or site wide)<br />

• Unsetting a default MPI to use (per user or site wide)<br />

• Querying the current default MPI in use<br />

Following is an example for listing and selecting an MPI:<br />

$ mpi-selector --list<br />

mpi-1.2.3<br />

mpi-3.4.5<br />

$ mpi-selector --set mpi-3.4.5<br />

The new default take effect in the next shell that is started. See the<br />

mpi-selector man page for more information.<br />

For <strong>QLogic</strong> MPI inter-operation with the mpi-selector utility, you must install all<br />

<strong>QLogic</strong> MPI RPMs using a prefixed installation. Once the $prefix for <strong>QLogic</strong><br />

MPI has been determined, install the qlogic-mpi-register with the same<br />

$prefix, this registers <strong>QLogic</strong> MPI with the mpi-selector utility and shows<br />

<strong>QLogic</strong> MPI as an available MPI implementation with the four different compilers.<br />

See the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for information on prefixed<br />

installations.<br />

The example shell scripts mpivars.sh and mpivars.csh, for registering with<br />

mpi-selector, are provided as part of the mpi-devel RPM in<br />

$prefix/share/mpich/mpi-selector-{intel,gnu,pathscale,pgi}<br />

directories.<br />

For all non-GNU compilers that are installed outside standard Linux search paths,<br />

set up the paths so that compiler binaries and runtime libraries can be resolved.<br />

For example, set LD_LIBRARY_PATH, both in your local environment and in an rc<br />

file (such as .mpirunrc, .bashrc, or .cshrc), are invoked on remote nodes.<br />

See “Environment for Node Programs” on page 4-19 and “Compiler and Linker<br />

Variables” on page 4-10 for information on setting up the environment and<br />

“Specifying the Run-time Library Path” on page F-15 for information on setting the<br />

run-time library path. Also see “Run Time Errors with Different MPI<br />

Implementations” on page F-17 for information on run time errors that may occur if<br />

there are MPI version mismatches.<br />

NOTE:<br />

The Intel-compiled versions require that the Intel compiler be installed and<br />

that paths to the Intel compiler runtime libraries be resolvable from the user’s<br />

environment. The version used is Intel 10.1.012.<br />

D000046-005 B 5-7


5–Using Other MPIs<br />

HP-MPI and Platform MPI 7<br />

HP-MPI and Platform MPI 7<br />

Installation<br />

Platform Computing acquired HP-MPI from HP. Platform MPI 7 (formerly HP–MPI)<br />

is a high performance, production–quality implementation of the Message Passing<br />

Interface (MPI), with full MPI-2 funcionality. HP-MPI / Platform MPI 7 is distributed<br />

by over 30 commercial software vendors, so you may need to use it if you use<br />

certain HPC applications, even if you don't purchase the MPI separately.<br />

Follow the instructions for downloading and installing Platform MPI 7 from the<br />

Platform Computing web site.<br />

Setup<br />

Edit two lines in the hpmpi.conf file as follows:<br />

Change,<br />

MPI_ICMOD_PSM__PSM_MAIN = "^ib_ipath"<br />

to,<br />

MPI_ICMOD_PSM__PSM_MAIN = "^"<br />

Change,<br />

to,<br />

MPI_ICMOD_PSM__PSM_PATH = "^ib_ipath"<br />

MPI_ICMOD_PSM__PSM_PATH = "^"<br />

Compiling Platform MPI 7 Applications<br />

As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />

scripts that invoke the underlying compiler (see Table 5-4).<br />

Table 5-4. Platform MPI 7 Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpicc<br />

mpiCC<br />

C<br />

C<br />

mpi77 Fortran 77<br />

mpif90 Fortran 90<br />

5-8 D000046-005 B


5–Using Other MPIs<br />

Platform (Scali) MPI 5.6<br />

To compile your program in C using the default compiler, type:<br />

$ mpicc mpi_app_name.c -o mpi_app_name<br />

Running Platform MPI 7 Applications<br />

Here is an example of a simple mpirun command running with four processes,<br />

over PSM:<br />

$ mpirun -np 4 -hostfile mpihosts -PSM mpi_app_name<br />

To run over InfiniBand Verbs, type:<br />

$ mpirun -np 4 -hostfile mpihosts -IBV mpi_app_name<br />

To run over TCP (which could be IPoIB if the hostfile is setup for IPoIB interfaces),<br />

type:<br />

$ mpirun -np 4 -hostfile mpihosts -TCP mpi_app_name<br />

More Information on Platform MPI 7<br />

For more information on Platform MPI 7, see the Platform Computing web site<br />

Platform (Scali) MPI 5.6<br />

Installation<br />

Platform MPI 5.6 was formerly known as Scali MPI Connect. The version tested<br />

with this release is 5.6.4.<br />

Follow the instructions for downloading and installing Platform MPI 5.6 from the<br />

Platform (Scali) web site.<br />

Setup<br />

To run over PSM, by default, add the line networks=infinpath to the file<br />

opt/scali/etc/ScaMPI.conf.<br />

If running over InfiniBand Verbs, Platform MPI needs to know which InfiniBand<br />

adapter to use. This is achieved by creating the file<br />

/opt/scali/etc/iba_params.conf using a line such as:<br />

hcadevice=qib0<br />

For a second InfiniPath card, ipath1 would be used, and so on.<br />

Compiling Platform MPI 5.6 Applications<br />

As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />

scripts that invoke the underlying compiler (see Table 5-5). The scripts default to<br />

using gcc/g++/g77.<br />

D000046-005 B 5-9


5–Using Other MPIs<br />

Platform (Scali) MPI 5.6<br />

Table 5-5. Platform MPI Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpicc<br />

mpic++<br />

C<br />

C++<br />

mpif77 Fortran 77<br />

mpif90 Fortran 90<br />

To compile your program in C using the default compiler, type:<br />

$ mpicc mpi_app_name.c -o mpi_app_name<br />

To invoke another compiler, in this case PathScale, use the -cc1 option, for<br />

example:<br />

$ mpicc -cc1 pathcc mpi_app_name.c -o mpi_app_name<br />

Running Platform MPI 5.6 Applications<br />

Here is an example of a simple mpirun command running with four processes,<br />

over PSM:<br />

$ mpirun -np 4 -machinefile mpihosts mpi_app_name<br />

or if you have not set /opt/scali/etc/ScaMPI.conf to use PSM by default,<br />

use:<br />

$ mpirun -np 4 -machinefile mpihosts -networks=infinipath<br />

mpi_app_name<br />

Once installed, Platform MPI uses the PSM transport by default. To specify PSM<br />

explicitly, add -networks infinipath to the above command.<br />

To run Scali MPI over InfiniBand Verbs, type:<br />

$ mpirun -np 4 -machinefile mpihosts -networks ib,smp mpi_app_name<br />

This command indicates that ib is used for inter-node communications, and smp<br />

is used for intra-node communications.<br />

To run over TCP (or IPoIB), type:<br />

$ mpirun -np 4 -machinefile mpihosts -networks tcp,smp mpi_app_name<br />

Further Information on Platform MPI 5.6<br />

For more information on using Platform MPI 5.6, see:<br />

http://www.platform.com/cluster-computing/platform-mpi.<br />

5-10 D000046-005 B


5–Using Other MPIs<br />

Intel MPI<br />

Intel MPI<br />

Installation<br />

Intel MPI version 4.0 is the version tested with this release.<br />

Follow the instructions for download and installation of Intel MPI from the Intel web<br />

site.<br />

Setup<br />

Intel MPI can be run over Tag Matching Interface (TMI)<br />

The setup for Intel MPI is described in the following steps:<br />

1. Make sure that the TMI psm provider is installed on every node and all<br />

nodes have the same version installed. In this release it is called tmi-1.0 and<br />

is supplied with the <strong>QLogic</strong> <strong>OFED+</strong> host software package. It can be<br />

installed either with the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> installation or using<br />

the rpm files after the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> tar file has been<br />

unpacked. For example:<br />

$ rpm -qa | grep tmi<br />

tmi-1.0-1<br />

2. Verify that there is a /etc/tmi.conf file. It should be installed by the<br />

tmi-1.0-1 RPM. The file tmi.conf contains a list of TMI psm providers. In<br />

particular it must contain an entry for the PSM provider in a form similar to:<br />

psm 1.0 libtmip_psm.so " " # Comments OK<br />

Intel MPI can also be run over uDAPL, which uses InfiniBand Verbs. uDAPL is the<br />

user mode version of the Direct Access Provider Library (DAPL), and is provided<br />

as a part of the OFED packages. You will also have to have IPoIB configured.<br />

The setup for Intel MPI is described in the following steps:<br />

1. Make sure that DAPL 1.2 or 2.0 is installed on every node and all nodes<br />

have the same version installed. In this release they are called<br />

compat-dapl. Both versions are supplied with the OpenFabrics RPMs and<br />

are included in the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> package. They can be<br />

installed either with the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> installation or using<br />

the rpm files after the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> tar file has been<br />

unpacked. For example:<br />

Using DAPL 1.2.<br />

$ rpm -qa | grep compat-dapl<br />

compat-dapl-1.2.12-1.x86_64.rpm<br />

compat-dapl-debuginfo-1.2.12-1.x86_64.rpm<br />

compat-dapl-devel-1.2.12-1.x86_64.rpm<br />

D000046-005 B 5-11


5–Using Other MPIs<br />

Intel MPI<br />

compat-dapl-devel-static-1.2.12-1.x86_64.rpm<br />

compat-dapl-utils-1.2.12-1.x86_64.rpm<br />

Using DAPL 2.0.<br />

$ rpm -qa | grep dapl<br />

dapl-devel-static-2.0.19-1<br />

compat-dapl-1.2.14-1<br />

dapl-2.0.19-1<br />

dapl-debuginfo-2.0.19-1<br />

compat-dapl-devel-static-1.2.14-1<br />

dapl-utils-2.0.19-1<br />

compat-dapl-devel-1.2.14-1<br />

dapl-devel-2.0.19-1<br />

2. Verify that there is a /etc/dat.conf file. It should be installed by the<br />

dapl- RPM. The file dat.conf contains a list of interface adapters<br />

supported by uDAPL service providers. In particular, it must contain<br />

mapping entries for OpenIB-cma for dapl 1.2.x and ofa-v2-ib for<br />

dapl 2.0.x, in a form similar to this (each on one line):<br />

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2<br />

"ib0 0" ""<br />

and<br />

ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0<br />

"ib0 0" ""<br />

3. On every node, type the following command (as a root user):<br />

# modprobe rdma_ucm<br />

To ensure that the module is loaded when the driver is loaded, add<br />

RDMA_UCM_LOAD=yes to the /etc/infiniband/openib.conf file.<br />

(Note that rdma_cm is also used, but it is loaded automatically.)<br />

4. Bring up an IPoIB interface on every node, for example, ib0. See the<br />

instructions for configuring IPoIB for more details.<br />

Intel MPI has different bin directories for 32-bit (bin) and 64-bit (bin64); 64-bit is<br />

the most commonly used.<br />

To launch MPI jobs, the Intel installation directory must be included in PATH and<br />

LD_LIBRARY_PATH.<br />

When using sh for launching MPI jobs, run the following command:<br />

$ source /bin64/mpivars.sh<br />

5-12 D000046-005 B


5–Using Other MPIs<br />

Intel MPI<br />

When using csh for launching MPI jobs, run the following command:<br />

$ source /bin64/mpivars.csh<br />

Substitute bin if using 32-bit.<br />

Compiling Intel MPI Applications<br />

As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommended that you use the included wrapper<br />

scripts that invoke the underlying compiler. The default underlying compiler is<br />

GCC, including gfortran. Note that there are more compiler drivers (wrapper<br />

scripts) with Intel MPI than are listed here (see Table 5-6); check the Intel<br />

documentation for more information.<br />

Table 5-6. Intel MPI Wrapper Scripts<br />

Wrapper Script Name<br />

Language<br />

mpicc<br />

mpiCC<br />

C<br />

C++<br />

mpif77 Fortran 77<br />

mpif90 Fortran 90<br />

mpiicc<br />

mpiicpc<br />

mpiifort<br />

C (uses Intel C compiler)<br />

C++ (uses Intel C++ compiler)<br />

Fortran 77/90 (uses Intel Fortran compiler)<br />

To compile your program in C using the default compiler, type:<br />

$ mpicc mpi_app_name.c -o mpi_app_name<br />

To use the Intel compiler wrappers (mpiicc, mpiicpc, mpiifort), the Intel<br />

compilers must be installed and resolvable from the user’s environment.<br />

Running Intel MPI Applications<br />

Here is an example of a simple mpirun command running with four processes:<br />

$ mpirun -np 4 -f mpihosts mpi_app_name<br />

For more information, follow the Intel MPI instructions for usage of mpirun,<br />

mpdboot, and mpiexec (mpirun is a wrapper script that invoked both mpdboot<br />

and mpiexec). Remember to use -r ssh with mpdboot if you use ssh.<br />

Pass the following option to mpirun to select TMI:<br />

-genv I_MPI_FABRICS tmi<br />

Pass the following option to mpirun to select uDAPL:<br />

D000046-005 B 5-13


5–Using Other MPIs<br />

Improving Performance of Other MPIs Over InfiniBand Verbs<br />

uDAPL 1.2:<br />

-genv I_MPI_DEVICE rdma:OpenIB-cma<br />

uDAPL 2.0:<br />

-genv I_MPI_DEVICE rdma:ofa-v2-ib<br />

To help with debugging, you can add this option to the Intel mpirun command:<br />

TMI:<br />

-genv TMI_DEBUG 1<br />

uDAPL:<br />

-genv I_MPI_DEBUG 2<br />

Further Information on Intel MPI<br />

For more information on using Intel MPI, see: http://www.intel.com/<br />

Improving Performance of Other MPIs Over<br />

InfiniBand Verbs<br />

Performance of MPI applications when using an MPI implementation over<br />

InfiniBand Verbs can be improved by tuning the InfiniBand MTU size.<br />

NOTE:<br />

No manual tuning is necessary for PSM-based MPIs, since the PSM layer<br />

determines the largest possible InfiniBand MTU for each source/destination<br />

path.<br />

The maximum supported MTU size of InfiniBand adapter cards is 4K.<br />

Support for 4K InfiniBand MTU requires switch support for 4K MTU. The method<br />

to set the InfiniBand MTU size varies by MPI implementation:<br />

• Open MPI defaults to the lower of either the InfiniBand MTU size or switch<br />

MTU size.<br />

• MVAPICH defaults to an InfiniBand MTU size of 1024 bytes. This can be<br />

over-ridden by setting an environment variable:<br />

$ export VIADEV_DEFAULT_MTU=MTU4096<br />

Valid values are MTU256, MTU512, MTU1024, MTU2048 and MTU4096. This<br />

environment variable must be set for all processes in the MPI job. To do so,<br />

use ~/.bashrc or use of /usr/bin/env.<br />

5-14 D000046-005 B


5–Using Other MPIs<br />

Improving Performance of Other MPIs Over InfiniBand Verbs<br />

• HP-MPI over InfiniBand Verbs automatically determines the InfiniBand MTU<br />

size.<br />

• Platform (Scali) MPI defaults to an InfiniBand MTU of 1KB. This can be<br />

changed by adding a line to /opt/scali/etc/iba_params.conf, for<br />

example:<br />

mtu=2048<br />

A value of 4096 is not allowed by the Scali software (as of Scali<br />

Connect 5.6.0); in this case, a default value of 1024 bytes is used. This<br />

problem has been reported to support at Platform Inc. The largest value that<br />

can currently be used is 2048 bytes.<br />

• Intel MPI over uDAPL (which uses InfiniBand Verbs) automatically<br />

determines the InfiniBand MTU size.<br />

D000046-005 B 5-15


5–Using Other MPIs<br />

Improving Performance of Other MPIs Over InfiniBand Verbs<br />

5-16 D000046-005 B


6 Performance Scaled<br />

Messaging<br />

Introduction<br />

Performance Scaled Messaging (PSM) provides support for full Virtual Fabric<br />

(vFabric) integration, allowing users to specify InfiniBand Service Level (SL) and<br />

Partition Key (PKey), or to provide a configured Service ID (SID) to target a<br />

vFabric. Support for using InfiniBand path record queries to the <strong>QLogic</strong> Fabric<br />

Manager during connection setup is also available, enabling alternative switch<br />

topologies such as Mesh/Torus. Note that this relies on the Distributed SA cache<br />

from FastFabric.<br />

All PSM enabled MPIs can leverage these capabilities transparently, but only two<br />

MPIs (<strong>QLogic</strong> MPI and OpenMPI) are configured to support it natively. Native<br />

support here means that MPI specific mpirun switches are available to<br />

activate/deactivate these features. Other MPIs will require use of environment<br />

variables to leverage these capabilities. With MPI applications, the environment<br />

variables need to be propagated across all nodes/processes and not just the node<br />

from where the job is submitted/run. The mechanisms to do this are MPI specific,<br />

but for two common MPIs the following may be helpful:<br />

• OpenMPI: Use –x ENV_VAR=ENV_VAL in the mpirun command line.<br />

Example:<br />

mpirun –np 2 –machinefile machinefile -x<br />

PSM_ENV_VAR=PSM_ENV_VAL prog prog_args<br />

• MVAPICH2: Use mpirun_rsh to perform job launch. Do not use mpiexec<br />

or mpirun. Specify the environment variable and value in the mpirun<br />

command line before the program argument.<br />

Example:<br />

mpirun_rsh –np 2 –hostfile machinefile PSM_ENV_VAR=PSM_ENV_VAL<br />

prog prog_args<br />

Some of the features available require appropriate versions of associated<br />

software and firmware for correct operation. These requirements are listed in the<br />

relevant sections.<br />

D000046-005 B 6-1


6–Performance Scaled Messaging<br />

Virtual Fabric Support<br />

Virtual Fabric Support<br />

Virtual Fabric (vFabric) in PSM is supported with the <strong>QLogic</strong> Fabric Manager. The<br />

latest version of the <strong>QLogic</strong> Fabric Manager contains a sample qlogic_fm.xml<br />

file with pre-configured vFabrics for PSM. Sixteen unique Service IDs have been<br />

allocated for PSM enabled MPI vFabrics to ease their testing however any<br />

Service ID can be used. Refer to the <strong>QLogic</strong> Fabric Manager <strong>User</strong> <strong>Guide</strong> on how<br />

to configure vFabrics.<br />

There are two ways to use vFabric with PSM. The “legacy” method requires the<br />

user to specify the appropriate SL and Pkey for the vFabric in question. For<br />

complete integration with vFabrics, users can now specify a Service ID (SID) that<br />

identifies the vFabric to be used. PSM will automatically obtain the SL and Pkey to<br />

use for the vFabric from the <strong>QLogic</strong> Fabric Manager via path record queries.<br />

Using SL and PKeys<br />

SL and Pkeys can be specified natively for OpenMPI and <strong>QLogic</strong> MPI. For other<br />

MPIs use the following list of environment variables to specify the SL and Pkey.<br />

The environment variables need to be propagated across all processes for correct<br />

operation.<br />

NOTE:<br />

This is available with OpenMPI v1.3.4rc4 and above only!<br />

• OpenMPI: Use mca parameters (mtl_psm_ib_service_level and<br />

mtl_psm_ib_pkey) to specify the pkey on the mpirun command line.<br />

Example:<br />

mpirun –np 2 –machinefile machinefile -mca<br />

mtl_psm_ib_service_level SL -mca mtl_psm_ib_pkey Pkey prog<br />

prog_args.<br />

• <strong>QLogic</strong> MPI: Requires use of IPATH_SL environment variable to specify<br />

the SL and the –p switch to mpirun for the Pkey.<br />

Example:<br />

IPATH_SL=SL mpirun –np 2 –m machinefile -p Pkey prog prog_args<br />

• Other MPIs can use the following environment variables that are propagated<br />

across all processes. This process is MPI library specific but samples on<br />

how to do this for OpenMPI and MVAPICH2 are listed in the “Introduction”<br />

on page 6-1.<br />

IPATH_SL=SL # Service Level to Use 0-15<br />

<br />

PSM_PKEY=Pkey # Pkey to use<br />

6-2 D000046-005 B


6–Performance Scaled Messaging<br />

Using Service ID<br />

Using Service ID<br />

Full vFabric integration with PSM is available, allowing the user to specify a SID.<br />

For correct operation, PSM requires the following components to be available and<br />

configured correctly.<br />

• <strong>QLogic</strong> host Fabric Manager Configuration – PSM MPI vFabrics need to be<br />

configured and enabled correctly in the qlogic_fm.xml file. 16 unique<br />

SIDs have been allocated in the sample file.<br />

• <strong>OFED+</strong> library needs to be installed on all nodes. This is available as part of<br />

Fast Fabrics tools.<br />

• <strong>QLogic</strong> Distributed SA needs to be installed, configured and activated on all<br />

the nodes. This is part of FastFabrics tools. Please refer to <strong>QLogic</strong> Fast<br />

Fabric <strong>User</strong> <strong>Guide</strong> on how to configure and activate the Distributed SA. The<br />

SIDs configured in the <strong>QLogic</strong> Fabric Manager configuration file should also<br />

be provided to the Distributed SA for correct operation.<br />

Service ID can be specified natively for OpenMPI and <strong>QLogic</strong> MPI. For other MPIs<br />

use the following list of environment variables. The environment variables need to<br />

be propagated across all processes for correct operation.<br />

• OpenMPI: Use mca parameters (mtl_psm_ib_service_id and<br />

mtl_psm_path_query) to specify the service id on the mpirun command<br />

line. Example:<br />

mpirun –np 2 –machinefile machinefile -mca mtl_psm_path_query<br />

opp -mca mtl_psm_ib_service_id SID prog prog_args<br />

• <strong>QLogic</strong> MPI: Use the –P and –S switch to mpirun command line to specify<br />

the Path record query library (always opp for OFED Plus Path in this<br />

release) and Service ID to use. Example:<br />

mpirun –np 2 –m machinefile -P opp –S SID prog prog_args<br />

• Other MPIs can use the following environment variables:<br />

PSM_PATH_REC=opp # Path record query mechanism to<br />

use. Always specify opp<br />

PSM_IB_SERVICE_ID=SID # Service ID to use<br />

SL2VL mapping from the Fabric Manager<br />

PSM is able to use the SL2VL table as programmed by the <strong>QLogic</strong> Fabric<br />

Manager. Prior releases required manual specification of the SL2VL mapping via<br />

an environment variable.<br />

D000046-005 B 6-3


6–Performance Scaled Messaging<br />

Verifying SL2VL tables on <strong>QLogic</strong> 7300 Series Adapters<br />

Verifying SL2VL tables on <strong>QLogic</strong> 7300 Series<br />

Adapters<br />

iba_saquery can be used to get the SL2VL mapping for any given port<br />

however, <strong>QLogic</strong> 7300 series adapters exports the SL2VL mapping via sysfs files.<br />

These files are used by PSM to implement the SL2VL tables automatically. The<br />

SL2VL tables are per port and available under /sys/class/infiniband/hca<br />

name/ports/port #/sl2vl. The directory contains 16 files numbered 0-15<br />

that specify the SL. Listing the SL files returns the VL as programmed by the SL.<br />

6-4 D000046-005 B


7 Dispersive Routing<br />

Infiniband uses deterministic routing that is keyed from the Destination LID (DLID)<br />

of a port. The Fabric Manager programs the forwarding tables in a switch to<br />

determine the egress port a packet takes based on the DLID.<br />

Deterministic routing can create hotspots even in full bisection bandwidth (FBB)<br />

fabrics for certain communication patterns if the communicating node pairs map<br />

onto a common upstream link, based on the forwarding tables. Since routing is<br />

based on DLIDs, the InfiniBand fabric provides the ability to assign multiple LIDs<br />

to a physical port using a feature called Lid Mask Control (LMC). The total number<br />

of DLIDs assigned to a physical port is 2^LMC with the LIDS being assigned in a<br />

sequential manner. The common InfiniBand fabric uses a LMC of 0, meaning<br />

each port has 1 LID assigned to it. With non-zero LMC fabrics, this results in<br />

multiple potential paths through the fabric to reach the same physical port. For<br />

example, multiple DLID entries in the port forwarding table that could map to<br />

different egress ports.<br />

Dispersive routing, as implemented in the PSM, attempts to avoid congestion<br />

hotspots described above by “spraying” messages across these paths. A<br />

congested path will not bottleneck messages flowing down the alternate paths<br />

that are not congested. The current implementation of PSM supports fabrics with<br />

a maximum LMC of 3 (8 LIDs assigned per port). This can result in a maximum of<br />

64 possible paths between a SLID, DLID pair ([SLID, DLID],[SLID, DLID+1],<br />

[SLID,DLID+2]…..[SLID,DLID+8],[SLID+1, DLID],[SLID+1, DLID+1]…..[SLID+7,<br />

DLID+8]). Keeping state associated with these many paths requires large amount<br />

of memory resources, with empirical data showing not much gain in performance<br />

beyond utilizing a small set of multiple paths. Therefore PSM reduces the number<br />

of paths actually used in the above case to 8 where the following paths are the<br />

only ones considered for transmission — [SLID, DLID], [SLID + 1, DLID + 1],<br />

[SLID + 2, DLID + 2] ….. [SLID + N, DLID + N]. This makes the resource<br />

requirements manageable while providing most of the benefits of dispersive<br />

routing (congestion avoidance by utilizing multiple paths).<br />

D000046-005 B 7-1


7–Dispersive Routing<br />

Internally, PSM utilizes dispersive routing differently for small and large<br />

messages. Large messages are any messages greater-than or equal-to 64K. For<br />

large messages, the message is split into message fragments of 128K by default<br />

(called a window). Each of these message windows is sprayed across a distinct<br />

path between ports. All packets belonging to a window utilize the same path<br />

however the windows themselves can take a different path through the fabric.<br />

PSM assembles the windows that make up an MPI message before delivering it to<br />

the application. This allows limited out of order semantics through the fabrics to be<br />

maintain with little overhead. Small messages on the other hand always utilize a<br />

single path when communicating to a remote node however different processes<br />

executing on a node can utilize different paths for their communication between<br />

the nodes. For example, two nodes A and B each with 8 processors per node.<br />

Assuming the fabric is configured for a LMC of 3, PSM constructs 8 paths through<br />

the fabric as described above and a 16 process MPI application that spans these<br />

nodes (8 process per node). Then:<br />

• Each MPI process is automatically bound to a given CPU core numbered<br />

between 0-7. PSM does this at startup to get improved cache hit rates and<br />

other benefits.<br />

• Small Messages sent from a process on core N will use path N.<br />

NOTE:<br />

Only path N will be used by this process for all communications to any<br />

process on the remote node.<br />

• For a large message, each process will utilize all of the 8 paths and spray<br />

the windowed messages across it.<br />

The above highlights the default path selection policy that is active in PSM when<br />

running on non-zero LMC configured fabrics. There are 3 other path selection<br />

policies that determine how to select the path (or path index from the set of<br />

available paths) used by a process when communicating with a remote node. The<br />

above path policy is called adaptive. The 3 remaining path policies are static<br />

policies that assign a static path on job startup for both small and large message<br />

transfers.<br />

• Static_Src: Only one path per process is used for all remote<br />

communications. The path index is based on the CPU number the process<br />

is running.<br />

NOTE:<br />

Multiple paths are still used in the fabric if multiple processes (each on<br />

a different CPU) are communicating.<br />

7-2 D000046-005 B


7–Dispersive Routing<br />

• Static_Dest: The path selection is based on the CPU index of the<br />

destination process. Multiple paths can be used if data transfer is to different<br />

remote processes within a node. If multiple processes from Node A send a<br />

message to a single process on Node B only one path will be used across all<br />

processes.<br />

• Static_Base: The only path that is used is the base path [SLID,DLID]<br />

between nodes regardless of the LMC of the fabric or the number of paths<br />

available. This is similar to how PSM operated till the IFS 5.1 release.<br />

NOTE:<br />

A fabric configured with LMC of 0 even with the default adaptive policy<br />

enabled operates as the Static_Base policy as there only exists a<br />

single path between any pairs of port.<br />

D000046-005 B 7-3


7–Dispersive Routing<br />

7-4 D000046-005 B


8 gPXE<br />

gPXE Setup<br />

gPXE is an open source (GPL) network bootloader. It provides a direct<br />

replacement for proprietary PXE ROMs. See http://etherboot.org/wiki/index.php<br />

for documentation and general information.<br />

At least two machines and a switch are needed (or connect the two machines<br />

back-to-back and run <strong>QLogic</strong> Fabric Manager on the server).<br />

• A DHCP server<br />

• A boot server or http server (can be the same as the DHCP server)<br />

• A node to be booted<br />

Use a QLE7340 or QLE7342 adapter for the node.<br />

The following software is included with the <strong>QLogic</strong> <strong>OFED+</strong> installation software<br />

package:<br />

• gPXE boot image<br />

• patch for DHCP server<br />

• tool to install gPXE boot image in EPROM of card<br />

• sample gPXE script<br />

Everything that can be done with the proprietary PXE loader over Ethernet, can be<br />

done with the gPXE loader over IB. The gPXE boot code is only a mechanism to<br />

load an initial boot image onto the system. It is up to the downloaded boot image<br />

to do the rest.<br />

For example, the boot image could be:<br />

• A stand-alone memory test program<br />

• A diskless kernel image that mounts its file systems via NFS<br />

Refer to http://www.faqs.org/docs/Linux-HOWTO/Diskless-HOWTO.html<br />

• A Linux install image like kickstart, which then installs software to the local<br />

hard drive(s). Refer to<br />

http://www.faqs.org/docs/Linux-HOWTO/KickStart-HOWTO.html<br />

D000046-005 B 8-1


8–gPXE<br />

Preparing the DHCP Server in Linux<br />

Required Steps<br />

• A second stage boot loader<br />

• A live CD Linux image<br />

• A gPXE script<br />

1. Download a copy of the gPXE image.<br />

Located at:<br />

• The executable to flash the EXPROM on the TrueScale InfiniBand<br />

adapters is located at: /usr/sbin/ipath_exprom<br />

• The gPXE driver for QLE7200 series InfiniBand adapters (the<br />

EXPROM image) is located at:<br />

/usr/share/infinipath/gPXE/iba7220.rom<br />

• The gPXE driver for QLE7300 series InfiniBand adapters (the<br />

EXPROM image) is located at:<br />

/usr/share/infinipath/gPXE/iba7322.rom<br />

2. In order for dhcpd to correctly load, assign IP addresses to the InfiniBand<br />

adapter GUID. The dhcpd on the existing DHCP server may need to be<br />

patched. This patch will be provided via the gPXE rpm installation.<br />

3. Write the ROM image to the InfiniBand adapter.<br />

This only needs to be done once per InfiniBand adapter.<br />

ipath_exprom -e -w iba7xxx.rom<br />

In some cases, executing the above command results in a hang. If you<br />

experience a hang, type CTRL+C to quit, then execute one flag at a time:<br />

ipath_exprom -e iba7xxx.rom<br />

ipath_exprom -w iba7xxx.rom<br />

4. Enable booting from the InfiniBand adapter (gPXE device) in the BIOS<br />

Preparing the DHCP Server in Linux<br />

Installing DHCP<br />

When the boot session starts, the gPXE firmware attempts to bring up an adapter<br />

network link. If it succeeds to bring up a connected link, the gPXE firmware<br />

communicates with the DHCP server. The DHCP server assigns an IP address to<br />

the gPXE client and provides it with the location of the boot program.<br />

gPXE requires that the DHCP server runs on a machine that supports IP over IB.<br />

8-2 D000046-005 B


8–gPXE<br />

Preparing the DHCP Server in Linux<br />

NOTE:<br />

Prior to installing DHCP, make sure that <strong>QLogic</strong> <strong>OFED+</strong> is already installed<br />

on your DHCP server.<br />

1. Download and install the latest DHCP server from www.isc.org.<br />

Standard DHCP fields holding MAC address are not large enough to contain<br />

an IPoIB hardware address. To overcome this problem, DHCP over<br />

InfiniBand messages convey a client identifier field used to identify the<br />

DHCP session. This client identifier field can be used to associate an IP<br />

address with a client identifier value, such that the DHCP server will grant<br />

the same IP address to any client that conveys this client identifier.<br />

2. Unpack the latest downloaded DHCP server.<br />

tar zxf dhcp-release.tar.gz<br />

3. Uncomment the line /* #define USE_SOCKETS */ in<br />

dhcp-release/includes/site.h<br />

4. Change to the main directory.<br />

cd dhcp-release<br />

NOTE:<br />

If there is an older version of DHCP installed, save it before continuing<br />

with the following steps.<br />

5. Configure the source.<br />

./configure<br />

6. When the configuration of DHCP is finished, build the DHCP server.<br />

make<br />

Configuring DHCP<br />

7. When the DHCP has successfully finished building, install DHCP.<br />

make install<br />

1. From the client host, find the GUID of the <strong>Host</strong> Channel Adapter by using<br />

p1info or look at the GUID label on the Infiniband adapter.<br />

2. Turn the GUID into a MAC address and specify the port of the InfiniBand<br />

adapter that is going to be used at the end, using b0 for port0 or b1 for<br />

port1.<br />

D000046-005 B 8-3


8–gPXE<br />

Netbooting Over InfiniBand<br />

For example for a GUID that reads 0x00117500005a6eec, the MAC<br />

address would read: 00:11:75:00:00:5a:6e:ec:b0<br />

3. Add the MAC address to the DHCP server.<br />

The following is the sample /etc/dhcpd.conf file that specifies the <strong>Host</strong><br />

Channel Adapter GUID for the hardware address:<br />

#<br />

# DHCP Server Configuration file.<br />

# see /usr/share/doc/dhcp*/dhcpd.conf.sample<br />

#<br />

ddns-update-style none;<br />

subnet 10.252.252.0 netmask 255.255.255.0 {<br />

option subnet-mask 255.255.255.0;<br />

range dynamic-bootp 10.252.252.100 10.252.252.109;<br />

host hl5-0 {<br />

hardware unknown-32 00:11:75:00:00:7e:c1:b0;<br />

option host-name "hl5";<br />

}<br />

host hl5-1 {<br />

hardware unknown-32 00:11:75:00:00:7e:c1:b1;<br />

option host-name "hl5";<br />

}<br />

}<br />

filename<br />

"http://10.252.252.1/images/uniboot/uniboot.php";<br />

In this example, host hl5 has a dual port InfiniBand adapter. hl5-0<br />

corresponds to port 0, and hl5-1 corresponds to port 1 on the adapter.<br />

4. Restart the DHCP server<br />

Netbooting Over InfiniBand<br />

The following procedures are an example of netbooting over InfinBand, using an<br />

HTTP boot server.<br />

8-4 D000046-005 B


8–gPXE<br />

Netbooting Over InfiniBand<br />

Prerequisites<br />

• Required steps from above have been executed.<br />

• The BIOS has been configured to enable booting from the InfiniBand<br />

adapter. The gPXE InfiniBand device should be listed as the first boot<br />

device.<br />

• Apache server has been configured with PHP on your network, and is<br />

configured to serve pages out of /vault.<br />

• It is understood in this example that users would have their own tools and<br />

files for diskless booting with an http boot server.<br />

Boot Server Setup<br />

NOTE:<br />

The dhcpd and apache configuration files referenced in this example<br />

are included as examples, and are not part of the <strong>QLogic</strong> <strong>OFED+</strong><br />

installed software. Your site boot servers may be different, see their<br />

documentation for equivalent information.<br />

Instructions on installing and configuring a dhcp server or a boot server<br />

are beyond the scope of this document.<br />

Configure the boot server for your site.<br />

NOTE:<br />

gPXE supports several file transfer methods such as TFTP, HTTP, iSCSI.<br />

This example uses HTTP since it generally scales better and is the preferred<br />

choice.<br />

NOTE:<br />

This step involves setting up a http server and needs to be done by a user<br />

that understands server setup on the http server is being used<br />

1. Install Apache.<br />

2. Create an images.conf file and a kernels.conf file and place them in<br />

the /etc/httpd/conf.d directory. This sets up aliases for and tells<br />

apache where to find them:<br />

/images — http://10.252.252.1/images/<br />

/kernels — http://10.252.252.1/kernels/<br />

D000046-005 B 8-5


8–gPXE<br />

Netbooting Over InfiniBand<br />

The following is an example of the images.conf file<br />

Alias /images /vault/images<br />

<br />

AllowOverride All<br />

Options Indexes FollowSymLinks<br />

Order allow,deny<br />

Allow from all<br />

<br />

The following is an example of the kernels.conf file<br />

Alias /kernels /boot<br />

<br />

AllowOverride None<br />

Order allow,deny<br />

Allow from all<br />

<br />

3. Make a uniboot directory:<br />

mkdir -p /vault/images/uniboot<br />

4. Create a initrd.img file<br />

Prerequisites<br />

• “gPXE Setup” on page 8-1 has been completed.<br />

• “Preparing the DHCP Server in Linux” on page 8-2 has been<br />

completed<br />

To add an InfiniBand driver into the initrd file, The InfiniBand modules<br />

need to be copied to the diskless image. The host machine needs to be<br />

pre-installed with the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> that is appropriate for<br />

the kernel version the diskless image will run. The <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong><br />

<strong>Software</strong> is available for download from<br />

http://driverdownloads.qlogic.com/<strong>QLogic</strong>DriverDownloads_UI/default.aspx<br />

NOTE:<br />

The remainder of this section assumes that <strong>QLogic</strong> <strong>OFED+</strong> has been<br />

installed on the <strong>Host</strong> machine.<br />

8-6 D000046-005 B


8–gPXE<br />

Netbooting Over InfiniBand<br />

WARNING!!<br />

The following procedure modifies critical files used in the boot<br />

procedure. It must be executed by users with expertise in the boot<br />

process. Improper application of this procedure may prevent the<br />

diskless machine from booting.<br />

a. If /vault/images/initrd.img file is already present on the server<br />

machine, back it up. For example:<br />

cp -a /vault/images/initrd.img /vault/images/<br />

initrd.img.bak<br />

D000046-005 B 8-7


8–gPXE<br />

Netbooting Over InfiniBand<br />

b. The infinipath rpm will install the file<br />

/usr/share/infinipath/gPXE/gpxe-qib-modify-initrd<br />

with contents similar to the following example. You can either run the<br />

script to generate a new initrd image, or use it as an example, and<br />

customize as appropriate for your site.<br />

# This assumes you will use the currently running version<br />

of linux, and<br />

# that you are starting from a fully configured machine of<br />

the same type<br />

# (hardware configuration), and BIOS settings.<br />

#<br />

# start with a known path, to get the system commands<br />

PATH=/sbin:/usr/sbin:/bin:/usr/bin:$PATH<br />

# start from a copy of the current initd image<br />

mkdir -p /var/tmp/initrd-ib<br />

cd /var/tmp/initrd-ib<br />

kern=$(uname -r)<br />

if [ -e /boot/initrd-${kern}.img ]; then<br />

initrd=/boot/initrd-${kern}.img<br />

elif [ -e /boot/initrd ]; then<br />

initrd=/boot/initrd<br />

else<br />

echo Unable to locate correct initrd, fix script and<br />

re-run<br />

exit 1<br />

fi<br />

cp ${initrd} initrd-ib-${kern}.img<br />

# Get full original listing<br />

gunzip -dc initrd-ib-${kern}.img | cpio -it --quiet |<br />

grep -v '^\.$' | sort -o Orig-listing<br />

8-8 D000046-005 B


8–gPXE<br />

Netbooting Over InfiniBand<br />

# start building modified image<br />

rm -rf new # for retries<br />

mkdir new<br />

cd new<br />

# extract previous contents<br />

gunzip -dc ../initrd-ib-${kern}.img | cpio --quiet -id<br />

# add infiniband modules<br />

mkdir -p lib/ib<br />

find /lib/modules/${kern}/updates -type f | \<br />

egrep<br />

'(iw_cm|ib_(mad|addr|core|sa|cm|uverbs|ucm|umad|ipoib|qib<br />

).ko|rdma_|ipoib_helper)' | \<br />

xargs -I '{}' cp -a '{}' lib/ib<br />

# Some distros have ipoib_helper, others don't require it<br />

if [ -e lib/ib/ipoib_helper ]; then<br />

helper_cmd='/sbin/insmod /lib/ib/ipoib_helper.ko'<br />

fi<br />

# On some kernels, the qib driver will require the dca<br />

module<br />

if modinfo -F depends ib_qib | grep -q dca; then<br />

cp $(find /lib/modules/$(uname -r) -name dca.ko) lib/ib<br />

dcacmd='/sbin/insmod /lib/ib/dca.ko'<br />

else<br />

dcacmd=<br />

fi<br />

# IB requires loading an IPv6 module. If you do not have<br />

it in your initrd, add it<br />

if grep -q ipv6 ../Orig-listing; then<br />

# already added, and presumably insmod'ed, along with any<br />

dependencies<br />

v6cmd=<br />

else<br />

echo -e 'Adding IPv6 and related modules\n'<br />

cp /lib/modules/${kern}/kernel/net/ipv6/ipv6.ko lib<br />

IFS=' ' v6cmd='echo "Loading IPV6"<br />

D000046-005 B 8-9


8–gPXE<br />

Netbooting Over InfiniBand<br />

/sbin/insmod /lib/ipv6.ko'<br />

# Some versions of IPv6 have dependencies, add them.<br />

xfrm=$(modinfo -F depends ipv6)<br />

if [ ${xfrm} ]; then<br />

cp $(find /lib/modules/$(uname -r) -name ${xfrm}.ko)<br />

lib<br />

IFS=' ' v6cmd='/sbin/insmod /lib/'${xfrm}'.ko<br />

'"$v6cmd"<br />

crypto=$(modinfo -F depends $xfrm)<br />

if [ ${crypto} ]; then<br />

cp $(find /lib/modules/$(uname -r) -name<br />

${crypto}.ko) lib<br />

IFS=' ' v6cmd='/sbin/insmod /lib/'${crypto}'.ko<br />

'"$v6cmd"<br />

fi<br />

fi<br />

fi<br />

# we need insmod to load the modules; if not present it,<br />

copy it<br />

mkdir -p sbin<br />

grep -q insmod ../Orig-listing || cp /sbin/insmod sbin<br />

echo -e 'NOTE: you will need to config ib0 in the normal<br />

way in your booted root<br />

filesystem, in order to use it for NFS, etc.\n'<br />

# Now build the commands to load the additional modules.<br />

We add them just after<br />

# the last existing insmod command, so all other<br />

dependences will be resolved<br />

# You can change the location if desired or necessary.<br />

# loading order is important. You can verify the order<br />

works ahead of time<br />

# by running "/etc/init.d/openibd stop", and then running<br />

these commands<br />

# manually by cut and paste<br />

# This will work on SLES, although different than the<br />

standard mechanism<br />

cat > ../init-cmds


8–gPXE<br />

Netbooting Over InfiniBand<br />

# Start of IB module block<br />

$v6cmd<br />

echo "loading IB modules"<br />

/sbin/insmod /lib/ib/ib_addr.ko<br />

/sbin/insmod /lib/ib/ib_core.ko<br />

/sbin/insmod /lib/ib/ib_mad.ko<br />

/sbin/insmod /lib/ib/ib_sa.ko<br />

/sbin/insmod /lib/ib/ib_cm.ko<br />

/sbin/insmod /lib/ib/ib_uverbs.ko<br />

/sbin/insmod /lib/ib/ib_ucm.ko<br />

/sbin/insmod /lib/ib/ib_umad.ko<br />

/sbin/insmod /lib/ib/iw_cm.ko<br />

/sbin/insmod /lib/ib/rdma_cm.ko<br />

/sbin/insmod /lib/ib/rdma_ucm.ko<br />

$dcacmd<br />

/sbin/insmod /lib/ib/ib_qib.ko<br />

$helper_cmd<br />

/sbin/insmod /lib/ib/ib_ipoib.ko<br />

echo "finished loading IB modules"<br />

# End of IB module block<br />

EOF<br />

# first get line number where we append (after last insmod<br />

if any, otherwse<br />

# at start<br />

line=$(egrep -n insmod init | sed -n '$s/:.*//p')<br />

if [ ! "${line}" ]; then line=1; fi<br />

sed -e "${line}r ../init-cmds" init > init.new<br />

# show the difference, then rename<br />

echo -e 'Differences between original and new init<br />

command script\n'<br />

diff init init.new<br />

mv init.new init<br />

chmod 700 init<br />

# now rebuilt the initrd image<br />

find . | cpio --quiet -H newc -o | gzip ><br />

../initrd-${kern}.img<br />

D000046-005 B 8-11


8–gPXE<br />

Netbooting Over InfiniBand<br />

cd ..<br />

# get the file list in the new image<br />

gunzip -dc initrd-${kern}.img | cpio --quiet -it | grep<br />

-v '^\.$' | sort -o New-listing<br />

# and show the differences.<br />

echo -e '\nChanges in files in initrd image\n'<br />

diff Orig-listing New-listing<br />

# copy the new initrd to wherever you have configure the<br />

dhcp server to look<br />

# for it (here we assume it's /images)<br />

mkdir -p /images<br />

cp initrd-${kern}.img /images<br />

echo -e '\nCompleted initrd for IB'<br />

ls -l /images/initrd-${kern}.img<br />

c. Run the<br />

usr/share/infinipath/gPXE/gpxe-qib-modify-initrd<br />

script to create the initrd.img file.<br />

At this stage, the initrd.img file is ready and located at the location<br />

where the DHCP server was configured to look for it.<br />

5. Create auniboot.php file and save it to /vault/images/uniboot.<br />

NOTE:<br />

The uniboot.php generates a gPXE script that will attempt to<br />

boot from the /boot/vmlinuz-2.6.18-128.el5 kernel. If<br />

you want to boot from a different kernel, edit uniboot.php with<br />

the appropriate kernel string in the $kver variable.<br />

8-12 D000046-005 B


8–gPXE<br />

Netbooting Over InfiniBand<br />

The following is an example of a uniboot.php file:<br />

<<br />

header ( 'Content-type: text/plain' );<br />

function strleft ( $s1, $s2 ) {<br />

return substr ( $s1, 0, strpos ( $s1, $s2 ) );<br />

}<br />

function baseURL() {<br />

$s = empty ( $_SERVER["HTTPS"] ) '' :<br />

( $_SERVER["HTTPS"] == "on" ) "s" : "";<br />

$protocol = strleft ( strtolower (<br />

$_SERVER["SERVER_PROTOCOL"] ), "/" ).$s;<br />

$port = ( $_SERVER["SERVER_PORT"] == "80" ) "" :<br />

( ":".$_SERVER["SERVER_PORT"] );<br />

return $protocol."://".$_SERVER['SERVER_NAME'].$port;<br />

}<br />

$baseurl = baseURL();<br />

$selfurl = $baseurl.$_SERVER['REQUEST_URI'];<br />

$dirurl = $baseurl.( dirname ( $_SERVER['SCRIPT_NAME'] )<br />

);<br />

$kver = "2.6.18-164.11.1.el5";<br />

echo


8–gPXE<br />

HTTP Boot Setup<br />

This is the kernel that will boot.<br />

This file can be copied from any machine that has RHEL5.3 installed.<br />

2. Start httpd<br />

Steps on the gPXE Client<br />

1. Ensure that the <strong>Host</strong> Channel Adapter is listed as the first bootable device in<br />

the BIOS.<br />

2. Reboot the test node(s) and enter the BIOS boot setup.<br />

This is highly dependent on the BIOS for the system but you should see a<br />

menu for boot options and a submenu for boot devices.<br />

Select gPXE IB as the first boot device.<br />

When you power on the system or press the reset button, the system will<br />

execute the boot code on the <strong>Host</strong> Channel Adapter that will query the<br />

DHCP server for the IP address and boot image to download.<br />

Once the boot image is downloaded, the BIOS/<strong>Host</strong> Channel Adapter is<br />

finished and the boot image is ready.<br />

3. Verify system boots off of the kernel image on the boot server. The best way<br />

to do this is to boot into a different kernel from the one installed on the hard<br />

drive on the client, or to un-plug the hard drive on the client and verify that on<br />

boot up, a kernel and file system exist.<br />

HTTP Boot Setup<br />

gPXE supports booting diskless machines. To enable using an IB driver, the<br />

(remote) kernel or initrd image must include and be configured to load that driver.<br />

This can be achieved either by compiling the <strong>Host</strong> Channel Adapter driver into the<br />

kernel, or by adding the device driver module into the initrd image and loading it.<br />

1. Make a new directory<br />

mdir /vault/images/uniboot<br />

2. Change directories<br />

cd /vault/images/uniboot<br />

3. Create a initrd.img file using the information and example in Step 4<br />

of Boot Server Setup.<br />

4. Create a uniboot.php file using the example in Step 5 of Boot Server<br />

Setup.<br />

8-14 D000046-005 B


8–gPXE<br />

HTTP Boot Setup<br />

5. Create an images.conf file and a kernels.conf file using the<br />

examples in Step 2 of Boot Server Setup and place them in the<br />

/etc/httpd/conf.d directory.<br />

6. Edit /etc/dhcpd.conf file to boot the clients using HTTP<br />

filename "http://172.26.32.9/images/uniboot/uniboot.php";<br />

7. Restart the DHCP server<br />

8. Start HTTP if it is not already running:<br />

/etc/init.d/httpd start<br />

D000046-005 B 8-15


8–gPXE<br />

HTTP Boot Setup<br />

8-16 D000046-005 B


A<br />

mpirun Options Summary<br />

This section summarizes the most commonly used options to mpirun. See the<br />

mpirun (1) man page for a complete listing.<br />

Job Start Options<br />

-mpd<br />

This option is used after running mpdboot to start a daemon, rather than using the<br />

default ssh protocol to start jobs. See the mpdboot(1) man page for more<br />

information. None of the other mpirun options (with the exception of -h) are valid<br />

when using this option.<br />

-ssh<br />

This option uses the ssh program to start jobs, either directly or through<br />

distributed startup. This is the default.<br />

Essential Options<br />

-H, -hosts hostlist<br />

When this option is used, the list of possible hosts to run on is taken from the<br />

specified hostlist, which has precedence over the -machinefile option. The<br />

hostlist can be comma-delimited or quoted as a space-delimited list. The<br />

hostlist specification allows compressed representation of the form:<br />

host-[01-02,04,06-08], is equivalent to:<br />

host-01,host-02,host-04,host-06,host-07,host-08<br />

If the -np count is unspecified, it is adjusted to the number of hosts in the<br />

hostlist. If the -ppn count is specified, each host will receive as many<br />

processes.<br />

-machinefile filename, -m filename<br />

This option specifies the machines (mpihosts) file that contains the list of hosts to<br />

be used for this job. The default is $MPIHOSTS, then ./mpihosts, and finally<br />

~/.mpihosts.<br />

-nonmpi<br />

This option runs a non-MPI program, and is required if the node program makes<br />

no MPI calls. This option allows non-<strong>QLogic</strong> MPI applications to use mpirun’s<br />

parallel spawning mechanism.<br />

D000046-005 B A-1


A–mpirun Options Summary<br />

Spawn Options<br />

-np np<br />

This option specifies the number of processes to spawn. If this option is not set,<br />

then the environment variable MPI_NPROCS is checked. If MPI_NPROCS is not<br />

set, the default is to determine the number of processes based on the number of<br />

hosts in the machinefile -M or the list of hosts -H.<br />

-ppn processes-per-node<br />

This option creates up to the specified number of processes per node.<br />

By default, a limit is enforced that depends on how many InfiniPath contexts are<br />

supported by the node (depends on the hardware type and the number of<br />

InfiniPath cards present).<br />

InfiniPath context (port) sharing is supported, beginning with the InfiniPath 2.0<br />

release. This feature allows running up to four times as many processes per node<br />

as was previously possible, with a small additional overhead for each shared<br />

context.<br />

Context sharing is enabled automatically if needed. Use of the full number of<br />

available contexts is assumed. To restrict the number of contexts, use the<br />

environment variable PSM_SHAREDCONTEXTS_MAX to divide the available<br />

number of contexts.<br />

Context sharing behavior can be overriden by using the environment variable<br />

PSM_SHAREDCONTEXTS. Setting this variable to zero disables context sharing,<br />

and jobs that require more than the available number of contexts cannot be run.<br />

Setting this variable it to one (the default) causes context sharing to be enabled if<br />

needed.<br />

-rcfile node-shell-script<br />

This is the startup script for setting the environment on nodes. Before starting<br />

node programs, mpirun checks to see if a file called .mpirunrc exists in the<br />

user’s home directory. If the file exists, it is sourced into the running remote<br />

shell. Use -rcfile node-shell-script or .mpirunrc to set paths and other<br />

environment variables such as LD_LIBRARY_PATH.<br />

Default: $HOME/.mpirunrc<br />

Spawn Options<br />

-distributed [=on|off]<br />

This option controls use of the distributed mpirun job spawning mechanism. The<br />

default is on. To change the default, put this option in the global<br />

mpirun.defaults file or a user-local file (see the environment variable<br />

PSC_MPIRUN_DEFAULTS_PATH for details). When the option appears more than<br />

once on the command line, the last setting controls the behavior.<br />

Default: on.<br />

A-2 D000046-005 B


A–mpirun Options Summary<br />

Quiescence Options<br />

Quiescence Options<br />

-disable-mpi-progress-check<br />

This option disables the MPI communication progress check without disabling the<br />

ping reply check.<br />

If quiescence or a lack of ping reply is detected, the job and all compute<br />

processes are terminated.<br />

-i, -ping-interval,seconds<br />

This options specifies the seconds to wait between ping packets to mpirun<br />

(if -q > 0).<br />

Default: 60<br />

-q, -quiescence-timeout,seconds<br />

This option specifies the wait time (in seconds) for quiescence (absence of MPI<br />

communication or lack of ping reply) on the nodes. It is useful for detecting<br />

deadlocks. A value of zero disables quiescence detection.<br />

Default: 900<br />

Verbosity Options<br />

-job-info<br />

This option prints brief job startup and shutdown timing information.<br />

-no-syslog<br />

When this option is specified, critical errors are not sent through syslog. By<br />

default, critical errors are sent to the console and through syslog.<br />

-V, -verbose<br />

This option prints diagnostic messages from mpirun itself. The verbose option is<br />

useful in troubleshooting.<br />

Verbosity will also list the IPATH_* and PSM_* environment variable settings that<br />

affect MPI operation.<br />

Startup Options<br />

-I, -open-timeout seconds<br />

This option tries for the number of seconds to open the InfiniPath device. If<br />

seconds is -1 (negative one), the node program waits indefinitely. Use this option<br />

to avoid having all queued jobs in a batch queue fail when a node fails for some<br />

reason, or is taken down for administrative purposes. The -t option is also<br />

normally set to -1.<br />

D000046-005 B A-3


A–mpirun Options Summary<br />

Stats Options<br />

Stats Options<br />

-k, -kill-timeout seconds<br />

This option indicates the time to wait for other ranks after the first rank exits.<br />

Default: 60<br />

-listen-addr |<br />

This option specifies the hostname (or IPv4 address) to listen on for incoming<br />

socket connections. It is useful for an mpirun front-end multihomed host. By<br />

default, mpirun assumes that ranks can independently resolve the hostname<br />

obtained on the head node with gethostname(2). To change the default, put<br />

this option in the global mpirun.defaults file or a user-local file.<br />

-runscript<br />

This is the script used to run the node program.<br />

-t, -timeout seconds<br />

This option waits for specified time (in seconds) for each node to establish<br />

connection back to mpirun. If seconds is -1 (negative one), mpirun will wait<br />

indefinitely.<br />

Default: 60<br />

-M [=stats_types], -print-stats [=stats_types]<br />

Statistics include minimum, maximum, and median values for message<br />

transmission protocols as well as more detailed information for expected and<br />

unexpected message reception. If the option is provided without an argument,<br />

stats_types is assumed to be mpi.<br />

The following stats_types can be specified:<br />

mpi<br />

ipath<br />

p2p<br />

counters<br />

devstats.<br />

all<br />

Shows an MPI-level summary (expected, unexpected message)<br />

Shows a summary of InfiniPath interconnect communication<br />

Shows detailed per-MPI rank communication information<br />

Shows low-level InfiniPath device counters<br />

Shows InfiniPath driver statistics<br />

Shows statistics for all stats_types<br />

One or more statistics types can be specified by separating them with a comma.<br />

For example, -print-stats=ipath,counters displays InfiniPath<br />

communication protocol as well as low-level device counter statistics. For details,<br />

see “MPI Stats” on page F-30.<br />

A-4 D000046-005 B


A–mpirun Options Summary<br />

Tuning Options<br />

-statsfile file-prefix<br />

This option specifies an alternate file to receive the output from the<br />

-print-stats option.<br />

Default: stderr<br />

-statsmode absolute|diffs<br />

When printing process statistics with the -print-stats option, this option<br />

specifies if the printed statistics have the absolute values of the <strong>QLogic</strong> adapter<br />

chip counters and registers or if there are differences between those values at the<br />

start and end of the process.<br />

Default mode: diffs<br />

Tuning Options<br />

-L, -long-len length<br />

This option determines the length of the message used by the rendezvous<br />

protocol. The InfiniPath rendezvous messaging protocol uses two-way handshake<br />

(with MPI synchronous send semantics) and receive-side DMA.<br />

Default: 64000<br />

-N, -num-send-bufs buffer-count<br />

<strong>QLogic</strong> MPI uses the specified number as the number of packets that can be sent<br />

without having to wait from an acknowledgement from the receiver. Each packet<br />

contains approximately 2048 bytes of user data.<br />

Default: 512<br />

-s,-long-len-shmem length<br />

This option specifies the length of the message used by the rendezvous protocol<br />

for intra-node communications. The InfiniPath rendezvous messaging protocol<br />

uses two-way handshake (with MPI synchronous send semantics) and<br />

receive-side DMA.<br />

Default: 16000<br />

-W, -rndv-window-size length<br />

When sending a large message using the rendezvous protocol, <strong>QLogic</strong> MPI splits<br />

the message into a number of fragments at the source and recombines them at<br />

the destination. Each fragment is sent as a single rendezvous stage. This option<br />

specifies the maximum length of each fragment.<br />

Default: 262144 bytes<br />

D000046-005 B A-5


A–mpirun Options Summary<br />

Shell Options<br />

Shell Options<br />

-shell shell-name<br />

This option specifies the name of the program to use to log into remote hosts.<br />

Default: ssh, unless $MPI_SHELL is defined.<br />

-shellx shell-name<br />

This option specifies the name of program to use to log into remote hosts with X11<br />

forwarding. This option is useful when running with -debug or in xterm.<br />

Default: ssh, unless $MPI_SHELL_X is defined.<br />

Debug Options<br />

-debug<br />

This option starts all the processes under debugger, and waits for the user to set<br />

breakpoints and run the program. The gdb option is used by default, but can be<br />

overridden using the -debugger argument. Other supported debuggers are<br />

strace and the <strong>QLogic</strong> debugger pathdb.<br />

-debug-no-pause<br />

This option is similar to -debug, except that it does not pause at the beginning.<br />

The gdb option is used by default.<br />

-debugger gdb|pathdb|strace<br />

This option uses the specified debugger instead of the default gdb.<br />

-display X-server<br />

This option uses the specified X server for invoking remote xterms. (-debug,<br />

-debug-no-pause, and -in-xterm options use this value.)<br />

Default: whatever is set in $DISPLAY<br />

-in-xterm<br />

This option runs each process in an xterm window. This is implied when -debug<br />

or -debug-no-pause is used.<br />

Default: write to stdout with no stdin<br />

-psc-debug-level mask<br />

This option controls the verbosity of messages printed by the MPI and InfiniPath<br />

protocol layer. The default is 1, which displays error messages. A value of 3 displays<br />

short messaging details such as source, destination, size, etc. A value of FFh prints<br />

detailed information in a messaging layer for each message. Use this option with care,<br />

since too much verbosity will negatively affect application performance.<br />

Default: 1<br />

-xterm xterm<br />

This option specifies the xterm to use.<br />

Default: xterm<br />

A-6 D000046-005 B


A–mpirun Options Summary<br />

Format Options<br />

Format Options<br />

-l, -label-output<br />

This option labels each line of output on stdout and stderr with the rank of the<br />

MPI process that produces the output.<br />

-y, -labelstyle string<br />

This option specifies the label that is prefixed to error messages and statistics.<br />

Process rank is the default prefix. The label that is prefixed to each message can<br />

be specified as one of the following:<br />

Other Options<br />

%n <strong>Host</strong>name that the node process executes<br />

%r Rank of the node process<br />

%p Process ID of the node process<br />

%L LID (InfiniPath local identifier (LID) adapter identifier) of the node<br />

%P InfiniPath port of the node process<br />

%l Local rank of the node process within a node<br />

%% Percent sign<br />

-h -help<br />

This option prints a summary of mpirun options, then exits.<br />

-stdin filename<br />

This option specifies the filename that must be fed as stdin to the node program.<br />

Default: /dev/null<br />

-stdin-target 0..np-1 | -1<br />

This option specifies the process rank that must receive the file specified with the<br />

-stdin option. Negative one (-1) means all ranks.<br />

Default: -1<br />

-v, -version<br />

This option prints the mpirun version, then exits.<br />

-wdir path-to-working_dir<br />

This option sets the working directory for the node program.<br />

Default: -wdir current-working-dir<br />

D000046-005 B A-7


A–mpirun Options Summary<br />

Other Options<br />

Notes<br />

A-8 D000046-005 B


B<br />

Benchmark Programs<br />

Several MPI performance measurement programs are installed from the<br />

mpi-benchmark RPM. This appendix describes these benchmarks and how to<br />

run them. These programs are based on code from the group of Dr. Dhabaleswar<br />

K. Panda at the Network-Based Computing Laboratory at the Ohio State<br />

University. For more information, see: http://mvapich.cse.ohio-state.edu/<br />

These programs allow you to measure the MPI latency and bandwidth between<br />

two or more nodes in your cluster. Both the executables, and the source for those<br />

executables, are shipped. The executables are installed by default under<br />

/usr/mpi/qlogic/bin (though /usr/bin where they are under a<br />

non-default-install). The remainder of this chapter will assume that <strong>QLogic</strong> MPI<br />

was installed in the default location of /usr/mpi/qlogic and that mpi-selector<br />

is used to choose the MPI to be used. The source is installed under<br />

/usr/mpi/qlogic/share/mpich/examples/performance.<br />

The following examples are intended to show only the syntax for invoking these<br />

programs and the meaning of the output. They are not representations of actual<br />

TrueScale performance characteristics.<br />

Benchmark 1: Measuring MPI Latency Between<br />

Two Nodes<br />

In the MPI community, latency for a message of given size is the time difference<br />

between a node program’s calling MPI_Send and the time that the corresponding<br />

MPI_Recv in the receiving node program returns. The term latency, alone without<br />

a qualifying message size, indicates the latency for a message of size zero. This<br />

latency represents the minimum overhead for sending messages, due to both<br />

software overhead and delays in the electronics of the fabric. To simplify the<br />

timing measurement, latencies are usually measured with a ping-pong method,<br />

timing a round-trip and dividing by two.<br />

The program osu_latency, from Ohio State University, measures the latency for<br />

a range of messages sizes from 0 to 4 megabytes. It uses a ping-pong method,<br />

where the rank zero process initiates a series of sends and the rank one process<br />

echoes them back, using the blocking MPI send and receive calls for all<br />

operations. Half the time interval observed by the rank zero process for each<br />

exchange is a measure of the latency for messages of that size, as previously<br />

D000046-005 B B-1


B–Benchmark Programs<br />

Benchmark 1: Measuring MPI Latency Between Two Nodes<br />

defined. The program uses a loop, executing many such exchanges for each<br />

message size, to get an average. The program defers the timing until the<br />

message has been sent and received a number of times, to be sure that all the<br />

caches in the pipeline have been filled.<br />

This benchmark always involves two node programs. It can be run with the<br />

command:<br />

$ mpirun -H host1,host2 osu_latency<br />

-H (or --hosts) allows the specification of the host list on the command line<br />

instead of using a host file (with the -m or -machinefile option). Since only two<br />

hosts are listed, this implies that two host programs will be started (as if -np 2<br />

were specified). The output of the program looks like:<br />

# OSU MPI Latency Test (Version 2.0)<br />

# Size Latency (us)<br />

0 1.06<br />

1 1.06<br />

2 1.06<br />

4 1.05<br />

8 1.05<br />

16 1.30<br />

32 1.33<br />

64 1.30<br />

128 1.36<br />

256 1.51<br />

512 1.84<br />

1024 2.47<br />

2048 3.79<br />

4096 4.99<br />

8192 7.28<br />

16384 11.75<br />

32768 20.57<br />

65536 58.28<br />

131072 98.59<br />

262144 164.68<br />

524288 299.08<br />

1048576 567.60<br />

2097152 1104.50<br />

4194304 2178.66<br />

The first column displays the message size in bytes. The second column displays<br />

the average (one-way) latency in microseconds. This example shows the syntax of<br />

the command and the format of the output, and is not meant to represent actual<br />

values that might be obtained on any particular TrueScale installation.<br />

B-2 D000046-005 B


B–Benchmark Programs<br />

Benchmark 2: Measuring MPI Bandwidth Between Two Nodes<br />

Benchmark 2: Measuring MPI Bandwidth<br />

Between Two Nodes<br />

The osu_bw benchmark measures the maximum rate that you can pump data<br />

between two nodes. This benchmark also uses a ping-pong mechanism, similar to<br />

the osu_latency code, except in this case, the originator of the messages<br />

pumps a number of them (64 in the installed version) in succession using the<br />

non-blocking MPI_I send function, while the receiving node consumes them as<br />

quickly as it can using the non-blocking MPI_Irecv function, and then returns a<br />

zero-length acknowledgement when all of the sent data has been received.<br />

You can run this program by typing:<br />

$ mpirun -H host1,host2 osu_bw<br />

Typical output might look like:<br />

# OSU MPI Bandwidth Test (Version 2.0)<br />

# Size Bandwidth (MB/s)<br />

1 3.549325<br />

2 7.110873<br />

4 14.253841<br />

8 28.537989<br />

16 42.613030<br />

32 81.144290<br />

64 177.331433<br />

128 348.122982<br />

256 643.742171<br />

512 1055.355552<br />

1024 1566.702234<br />

2048 1807.872057<br />

4096 1865.128035<br />

8192 1891.649180<br />

16384 1898.205188<br />

32768 1888.039542<br />

65536 1931.339589<br />

131072 1942.417733<br />

262144 1950.374843<br />

524288 1954.286981<br />

1048576 1956.301287<br />

2097152 1957.351171<br />

4194304 1957.810999<br />

The increase in measured bandwidth with the messages’ size is because the<br />

latency’s contribution to the measured time interval becomes relatively smaller.<br />

D000046-005 B B-3


B–Benchmark Programs<br />

Benchmark 3: Messaging Rate Microbenchmarks<br />

Benchmark 3: Messaging Rate Microbenchmarks<br />

mpi_multibw is the microbenchmark that highlights <strong>QLogic</strong>’s messaging rate<br />

results. This benchmark is a modified form of the OSU Network-Based Computing<br />

Lab’s osu_bw benchmark (as shown in the previous example). It has been<br />

enhanced with the following additional functionality:<br />

• The messaging rate and the bandwidth are reported.<br />

• N/2 is dynamically calculated at the end of the run.<br />

• You can run multiple processes per node and see aggregate bandwidth and<br />

messaging rates.<br />

The benchmark has been updated with code to dynamically determine what<br />

processes are on which host. Here is an example output when running<br />

mpi_multibw:<br />

$ mpirun -np 16 -ppn 8 -H host1,host2 ./mpi_multibw<br />

This will run on eight processes per node. Typical output might look like:<br />

# PathScale Modified OSU MPI Bandwidth Test<br />

(OSU Version 2.2, PathScale $<strong>Rev</strong>ision$)<br />

# Running on 8 procs per node (uni-directional traffic for each<br />

process pair)<br />

# Size Aggregate Bandwidth (MB/s) Messages/s<br />

1 26.890668 26890667.530474<br />

2 53.692685 26846342.327320<br />

4 107.662814 26915703.518342<br />

8 214.526573 26815821.579971<br />

16 88.356173 5522260.840754<br />

32 168.514373 5266074.141949<br />

64 503.086611 7860728.303972<br />

128 921.257051 7197320.710406<br />

256 1588.793989 6206226.519112<br />

512 1716.731626 3352991.457783<br />

1024 1872.073401 1828196.680564<br />

2048 1928.774223 941784.288727<br />

4096 1928.763048 470889.416123<br />

8192 1921.127830 234512.674597<br />

16384 1919.122008 117133.911629<br />

32768 1898.415975 57935.057817<br />

65536 1953.063214 29801.379615<br />

131072 1956.731895 14928.679615<br />

262144 1957.544289 7467.438845<br />

524288 1957.952782 3734.498562<br />

1048576 1958.235791 1867.519179<br />

2097152 1958.333161 933.806019<br />

4194304 1958.400649 466.919100<br />

B-4 D000046-005 B


B–Benchmark Programs<br />

Benchmark 4: Measuring MPI Latency in <strong>Host</strong> Rings<br />

Searching for N/2 bandwidth. Maximum Bandwidth of 1958.400649<br />

MB/s...<br />

Found N/2 bandwidth of 992.943275 MB/s at size 153 bytes<br />

Benchmark 4: Measuring MPI Latency in <strong>Host</strong><br />

Rings<br />

The program mpi_latency measures latency in a ring of hosts. Its syntax is<br />

different from Benchmark 1 in that it takes command line arguments that let you<br />

specify the message size and the number of messages to average the results. For<br />

example, running on the nodes, host1, host2, host3 and host4, the command:<br />

$ mpirun -np 4 -H host[1-4] mpi_latency 100 0<br />

Might produce output like this:<br />

0 1.760125<br />

This output indicates that it took an average of 1.76 microseconds per hop to send<br />

a zero-length message from the first host, to the second, to the third, to the fourth,<br />

and then receive replies back in the other direction.<br />

D000046-005 B B-5


B–Benchmark Programs<br />

Benchmark 4: Measuring MPI Latency in <strong>Host</strong> Rings<br />

Notes<br />

B-6 D000046-005 B


C<br />

VirtualNIC Interface<br />

Configuration and<br />

Administration<br />

VirtualNIC Interface Configuration and<br />

Administration<br />

The VirtualNIC (VNIC) Upper Layer Protocol (ULP) works in conjunction with<br />

firmware running on Virtual Input/Output (VIO) hardware such as the <strong>QLogic</strong><br />

Ethernet Virtual I/O Controller (EVIC) or the InfiniBand/Ethernet Bridge Module<br />

for IBM ® BladeCenter ® , providing virtual Ethernet connectivity.<br />

The VNIC driver, along with <strong>QLogic</strong> EVIC’s two 10 Gigabit ethernet ports, enables<br />

Infiniband clusters to connect to Ethernet networks. This driver also works with the<br />

earlier version of the I/O controller, the VEx.<br />

The <strong>QLogic</strong> VNIC driver creates virtual Ethernet interfaces and tunnels the<br />

Ethernet data to/from the EVIC over InfiniBand using an InfiniBand reliable<br />

connection.<br />

The virtual Ethernet interface supports any Ethernet protocol. It operates like any<br />

other interface: ping, ssh, scp, netperf, etc.<br />

The VNIC interface must be configured before it can be used. Perform the steps in<br />

the following sub-sections to set up and configure the VNIC interface:<br />

Getting Information about Ethernet IOCs on the Fabric<br />

When ib_qlgc_vnic_query is executed without any options, it displays detailed<br />

information about all Virtual I/O IOCs present on the fabric including the EVIC/VEx<br />

Input/Output Controllers (IOCs) present on the fabric.<br />

For writing the configuration file, you will need information about the EVIC/VEx<br />

IOCs present on the fabric, such as their IOCGUID, IOCSTRING, etc.<br />

D000046-005 B C-1


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

NOTE:<br />

An EVIC has 2 IOCs; one for each Ethernet port. Each EVIC contains a<br />

unique set of IOCGUIDs: (e.g., IOC 1 maps to Ethernet Port 1 and IOC 2<br />

maps to Ethernet Port 2).<br />

1. Ensure you are logged in as a root user.<br />

2. Type ib_qlgc_vnic_query<br />

This displays detailed information about all the EVIC/VEx IOCs present on<br />

the fabric. For example:<br />

# ib_qlgc_vnic_query<br />

HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f9,<br />

State = Active<br />

IO Unit Info:<br />

port LID: 0009<br />

port GID: fe8000000000000000066a11de000070<br />

change ID: 0003<br />

max controllers: 0x02<br />

controller[ 1]<br />

GUID: 00066a01de000070<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1<br />

service entries: 2<br />

service[ 0]: 1000066a00000001 /<br />

InfiniNIC.InfiniConSys.Control:01<br />

service[ 1]: 1000066a00000101 /<br />

InfiniNIC.InfiniConSys.Data:01<br />

IO Unit Info:<br />

port LID: 000b<br />

port GID: fe8000000000000000066a21de000070<br />

change ID: 0003<br />

max controllers: 0x02<br />

C-2 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

controller[ 2]<br />

GUID: 00066a02de000070<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2<br />

service entries: 2<br />

service[ 0]: 1000066a00000002 /<br />

InfiniNIC.InfiniConSys.Control:02<br />

service[ 1]: 1000066a00000102 /<br />

InfiniNIC.InfiniConSys.Data:02<br />

HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010fa,<br />

State = Active<br />

IO Unit Info:<br />

port LID: 0009<br />

port GID: fe8000000000000000066a11de000070<br />

change ID: 0003<br />

max controllers: 0x02<br />

controller[ 1]<br />

GUID: 00066a01de000070<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1<br />

service entries: 2<br />

service[ 0]: 1000066a00000001 /<br />

InfiniNIC.InfiniConSys.Control:01<br />

service[ 1]: 1000066a00000101 /<br />

InfiniNIC.InfiniConSys.Data:01<br />

IO Unit Info:<br />

port LID: 000b<br />

port GID: fe8000000000000000066a21de000070<br />

change ID: 0003<br />

max controllers: 0x02<br />

D000046-005 B C-3


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

controller[ 2]<br />

GUID: 00066a02de000070<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2<br />

service entries: 2<br />

service[ 0]: 1000066a00000002 /<br />

InfiniNIC.InfiniConSys.Control:02<br />

service[ 1]: 1000066a00000102 /<br />

InfiniNIC.InfiniConSys.Data:02<br />

Editing the VirtualNIC Configuration file<br />

Look at the qlgc_vnic.cfg.sample file to see how VNIC configuration files<br />

are written. This file can be found with the OFED documentation, or in the<br />

qlgc_vnictools subdirectory of the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong><br />

download. You can use this configuration file as the basis for creating a<br />

configuration file by replacing the destination global identifier (DGID),<br />

IOCGUID, and IOCSTRING values with those of the EVIC/VEx IOCs<br />

present on your fabric.<br />

<strong>QLogic</strong> recommends using the DGID of the EVIC/VEx IOC, as it ensures the<br />

quickest startup of the VNIC service. When DGID is specified, the IOCGUID<br />

must also be specified. For more details, see the qlgc_vnic.cfg sample<br />

file.<br />

1. Edit the VirtualNIC configuration file, /etc/infiniband/qlgc_vnic.cfg.<br />

For each IOC connection, add a CREATE block to the file using the following<br />

format:<br />

{CREATE; NAME="eioc2";<br />

PRIMARY={IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; }<br />

SECONDARY={IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2;}<br />

}<br />

NOTE:<br />

The qlgc_vnic.cfg file is case and format sensitive.<br />

C-4 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

NOTE:<br />

For the following sections, determine the necessary connection type. To<br />

always have a host connect to the same IOC on the same VIO card<br />

regardless of where that card is in the fabric, use format 1. To always have a<br />

host to connect to the same IOC in the same chassis and/or slot, use format<br />

2.<br />

Format 1: Defining an IOC using the IOCGUID<br />

NOTE:<br />

Determine the needed connection type. If a host is to connect to the same<br />

IOC on the same card, no matter where that card is in the fabric, then use<br />

this format.<br />

Use the following format to cause the host to connect to a specific VIO hardware<br />

card, regardless of which chassis and/or slot the VIO hardware card resides:<br />

{CREATE;<br />

NAME="eioc1";<br />

IOCGUID=0x66A0137FFFFE7;}<br />

}<br />

The following is an example of VIO hardware failover:<br />

{CREATE; NAME="eioc1";<br />

PRIMARY={IOCGUID=0x66a01de000003; INSTANCE=1; PORT=1; }<br />

SECONDARY={IOCGUID=0x66a02de000003; INSTANCE=1; PORT=1;}<br />

}<br />

NOTE:<br />

Do not create EIOC names with similar character strings (for example,<br />

eioc3 and eioc30). There is a limitation with certain Linux operating<br />

systems that cannot recognize the subtle differences. The result is that<br />

the user will be unable to ping across the network.<br />

D000046-005 B C-5


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

Format 2: Defining an IOC using the IOCSTRING<br />

Defining the IOC using the IOCSTRING allows VIO hardware to be hot-swapped<br />

in and out of a specific slot. The host will attempt to connect to the specified IOC<br />

(1 or 2) on the VIO hardware that currently resides in the specified slot of the<br />

specified Chassis. Use the following format to allow the host to connect to a VIO<br />

hardware that resides in a specific slot of a specific chassis:<br />

{CREATE;<br />

NAME="eioc1";<br />

IOCSTRING="Chassis 0x00066A0005000001, Slot 1, IOC 1";<br />

RX_CSUM=TRUE;<br />

HEARTBEAT=100; }<br />

NOTE:<br />

The IOCSTRING field is a literal, case-sensitive string, whose syntax must<br />

be exactly in the format shown in the example above, including the<br />

placement of commas. To reduce the likelihood of syntax error, the user<br />

should execute ib_qlgc_vnic_query -es. Note that the chassis serial<br />

number must match the chassis 0x (Hex) value. The slot serial number is<br />

specific to the line card as well.<br />

Each CREATE block must specify a unique NAME. The NAME represents the Ethernet<br />

interface name that will be registered with the Linux Operating System.<br />

Format 3: Starting VNIC using DGID<br />

NOTE:<br />

It is not always necessary to use DGID. It is recommended to use DGID if<br />

the user has a large cluster.<br />

Following is an example of a DGID and IOCGUID VNIC configuration. This<br />

configuration allows for the quickest start up of VNIC service:<br />

{CREATE; NAME="eioc1";<br />

DGID=0xfe8000000000000000066a0258000001;IOCGUID=0x66a01300000<br />

01;<br />

}<br />

This example uses DGID, IOCGUID and IOCSTRING:<br />

{CREATE; NAME="eioc1";<br />

DGID=0xfe8000000000000000066a0258000001;<br />

IOCGUID=0x66a0130000001;<br />

IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";<br />

}<br />

C-6 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Failover Definition<br />

The VirtualNIC configuration file allows the user to define virtual NIC failover as:<br />

1. Failover to a different adapter port on the same adapter.<br />

2. Failover to a port on a different adapter.<br />

3. Failover to a different Ethernet port on the same Ethernet gateway.<br />

4. Failover to a port on a different Ethernet gateway.<br />

5. A combination of scenarios 1 or 2 and 3 or 4.<br />

Failover to a different Adapter port on the same Adapter<br />

In this example, if adapter Port 1 fails (i.e., the InfiniBand connection between the<br />

adapter and the InfiniBand switch is lost) then traffic going over eioc1 will begin<br />

using adapter port 2. In this example the same Ethernet gateway port is being<br />

used for both the primary and the secondary connection.<br />

{CREATE; NAME="eioc1";<br />

PRIMARY={ DGID=fe8000000000000000066a11de00003c;<br />

IOCGUID=00066a01de00003c; INSTANCE=0; PORT=1; }<br />

SECONDARY={ DGID=fe8000000000000000066a11de00003c;<br />

IOCGUID=00066a01de00003c; INSTANCE=1; PORT=2; }<br />

}<br />

Failover to a different Ethernet port on the same<br />

Ethernet gateway<br />

In this example, if the Ethernet port associated with iocguid 0x66a01de00003c<br />

(i.e., Ethernet port 1 on gateway 03c) fails, then traffic going over eioc1 will begin<br />

using Ethernet port 2 on the same Ethernet gateway.<br />

NOTE:<br />

In this example the same adapter port is being used for both the primary and<br />

the secondary connection.<br />

{CREATE; NAME="eioc1";<br />

PRIMARY={ DGID=fe8000000000000000066a11de00003c;<br />

IOCGUID=00066a01de00003c; INSTANCE=0; PORT=1; }<br />

SECONDARY={ DGID=fe8000000000000000066a21de00003c;<br />

IOCGUID=00066a02de00003c; INSTANCE=0; PORT=1; }<br />

}<br />

D000046-005 B C-7


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

Failover to a port on a different Ethernet gateway<br />

In this example, if the Ethernet port associated with iocguid 0x66a02de000048<br />

(i.e., Ethernet port 2 on gateway 48) fails, then traffic going over eioc1 will begin<br />

using Ethernet port 2 on gateway 03c. This type of failover allows traffic to<br />

continue even if the entire gateway card fails or is rebooted. This type of failure<br />

requires a second gateway card to be in the fabric.<br />

NOTE:<br />

In this example the same adapter port is being used for both the primary and<br />

the secondary connection.<br />

{CREATE; NAME="eioc1";<br />

PRIMARY={ DGID=fe8000000000000000066a21de000048;<br />

IOCGUID=00066a02de000048; INSTANCE=0; PORT=1; }<br />

SECONDARY={ DGID=fe8000000000000000066a21de00003c;<br />

IOCGUID=00066a02de00003c; INSTANCE=0; PORT=1; }<br />

}<br />

Combination method<br />

In this example, if either the Ethernet port, the Ethernet gateway or the adapter<br />

port fails, traffic using eioc1 will begin using Ethernet port 2 on gateway 03c via<br />

adapter port 1.<br />

{CREATE; NAME="eioc1";<br />

PRIMARY={ DGID=fe8000000000000000066a21de000048;<br />

IOCGUID=00066a02de000048; INSTANCE=0; PORT=2; }<br />

SECONDARY={ DGID=fe8000000000000000066a21de00003c;<br />

IOCGUID=00066a02de00003c; INSTANCE=0; PORT=1; }<br />

}<br />

Creating VirtualNIC Ethernet Interface Configuration Files<br />

For each Ethernet interface defined in the /etc/infiniband/qlgc_vnic.cfg<br />

file, create an interface configuration file:<br />

For SuSE and SLES OS:<br />

/etc/sysconfig/network/ifcfg-NAME<br />

For RHEL OS:<br />

/etc/sysconfig/network-scripts/ifcfg-NAME<br />

where NAME is the value of the NAME field specified in the CREATE block.<br />

C-8 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

Example of ifcfg-eiocx setup for RedHat systems:<br />

DEVICE=eioc1<br />

BOOTPROTO=static<br />

IPADDR=172.26.48.132<br />

BROADCAST=172.26.63.130<br />

NETMASK=255.255.240.0<br />

NETWORK=172.26.48.0<br />

ONBOOT=yes<br />

TYPE=Ethernet<br />

Example of ifcfg-eiocx setup for SuSE and SLES systems:<br />

BOOTPROTO='static'<br />

IPADDR='172.26.48.130'<br />

BROADCAST='172.26.63.255'<br />

NETMASK='255.255.240.0'<br />

NETWORK='172.26.48.0'<br />

STARTMODE='hotplug'<br />

TYPE='Ethernet'<br />

After modifying the /etc/infiniband/qlgc_vnic.cfg file, restart the<br />

VirtualNIC driver with the following:<br />

/etc/init.d/qlgc_vnic restart<br />

VirtualNIC Multicast<br />

The primary goal of VNIC Multicast is to reduce the replication of multicast packet<br />

transmission from the EVIC to InfiniBand hosts. Figure C-1 demonstrates the<br />

transmission of multicast traffic without using IB_MULTICAST:<br />

D000046-005 B C-9


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

IB <strong>Host</strong> 1<br />

IB <strong>Host</strong> 2<br />

IB Switch<br />

EVIC<br />

IB <strong>Host</strong> 3<br />

Multicast<br />

Data (RC)<br />

Figure C-1. Without IB_Multicast<br />

Figure C-1 is showing that each multicast packet is sent by the EVIC via the<br />

Reliable Connect (RC) Queue Pair to each and every host. In the case when<br />

multicast applications on multiple hosts are sharing the same multicast address,<br />

the result is large amounts of traffic, one per host, between the EVIC and the<br />

switch. This also creates additional work load for the EVIC.<br />

With multicast enabled at the host using IB_MULTICAST (located in the VNIC<br />

configuration file) and also enabled at the EVIC, multicast traffic is handled as<br />

follows:<br />

C-10 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

IB <strong>Host</strong> 1<br />

IB <strong>Host</strong> 2<br />

IB Switch<br />

EVIC<br />

IB <strong>Host</strong> 3<br />

Single packet to<br />

IB_multicast group<br />

traffic<br />

Packet forwarded to<br />

host (UD)<br />

Figure C-2. With IB_Multicast<br />

Figure C-2 is showing that the EVIC sends a multicast packet ONCE to the<br />

InfiniBand switch while posting to a specific InfiniBand multicast group. The switch<br />

then forwards the packet to all hosts who have joined the same multicast group<br />

that the host has joined. Since the EVIC only has to post the packet once instead<br />

of once for each host, there is a substantial savings. Additionally, the packets are<br />

delivered by an unreliable datagram (UD) queue pair, which means there is less<br />

overhead per packet, per host.<br />

For each IOC, the EVIC creates a unique multicast group. When the VNIC driver<br />

provides multicast addresses for a virtual port (i.e., viport) to the EVIC, the EVIC<br />

returns the multicast group for the IOC used by the viport. The VNIC driver then<br />

joins the specified multicast group. When a viport connection is taken down, the<br />

host leaves the multicast group. If an EVIC has two IOCs and the VNIC driver on a<br />

host is configured to create viports using both IOCs, a join is issued by each<br />

viport to join the corresponding IOC it is using.<br />

D000046-005 B C-11


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

NOTE:<br />

When a viport is using two different IOCs for primary and secondary paths,<br />

then the host will join the multicast group for the IOC on the primary path as<br />

well as the multicast group for the IOC on the secondary path. The<br />

secondary path and the corresponding multicast group will not be used until<br />

a failover occurs.<br />

The create multicast group issued by the EVIC, along with the join multicast<br />

group and leave multicast group requests issued by the host are handled by the<br />

subnet manager. For details, refer to the Fabric Manager <strong>User</strong>s <strong>Guide</strong>.<br />

On the EVIC, IB_MULTICAST can be enabled or disabled manually using the CLI.<br />

For details, refer to the Ethernet section of the Hardware CLI Reference <strong>Guide</strong>. A<br />

reboot of the EVIC is required after IB_MULTICAST has been enabled or<br />

disabled.<br />

On the host, IB_MULTICAST is enabled by default. The user must add<br />

IB_MULTICAST=FALSE to the VNIC configuration file to disable the feature. To view<br />

the current status of the feature do the following:<br />

cat<br />

/sys/class/infiniband_qlgc_vnic/interfaces//*<br />

path/multicast_state<br />

For each path, there is a primary_path and a secondary_path directory present<br />

and the contents of the multicast_state file contain:<br />

feature not enabled - feature is disabled at either host or<br />

EVIC or both ends<br />

state=Joined & Attached MGID: MLID:<br />

To disable IB_MULTICAST at the host edit the /etc/infiniband/qlgc_vnic.cfg<br />

file to add IB_MULTICAST=FALSE for the viport inside both Primary and<br />

Secondary.<br />

Starting, Stopping and Restarting the VirtualNIC Driver<br />

Once you have created a configuration file, you can start the VNIC driver and<br />

create the VNIC interfaces specified in the configuration file.<br />

NOTE:<br />

Ensure you are logged in as a root user to start, stop, or restart the VNIC<br />

driver.<br />

C-12 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

To start the qlgc_vnic driver and the <strong>QLogic</strong> VNIC interfaces, use the following<br />

command:<br />

/etc/init.d/qlgc_vnic start<br />

To stop the qlgc_vnic driver and bring down the VNIC interfaces, use the following<br />

command:<br />

/etc/init.d/qlgc_vnic stop<br />

To restart the qlgc_vnic driver, use the following command:<br />

/etc/init.d/qlgc_vnic restart<br />

If you have not started the InfiniBand network stack (<strong>QLogic</strong> <strong>OFED+</strong> or OFED),<br />

then running the /etc/init.d/qlgc_vnic start command also starts the<br />

InfiniBand network stack, since the <strong>QLogic</strong> VNIC service requires the InfiniBand<br />

stack.<br />

If you start the InfiniBand network stack separately, then the correct starting order<br />

is:<br />

• Start the InfiniBand stack.<br />

• Start <strong>QLogic</strong> VNIC service.<br />

For example, if you use <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong>, the correct starting order<br />

is:<br />

/etc/init.d/openibd start<br />

/etc/init.d/qlgc_vnic start<br />

If you want to restart the <strong>QLogic</strong> VNIC interfaces, run the following command:<br />

/etc/init.d/qlgc_vnic restart<br />

You can get information about the <strong>QLogic</strong> VNIC interfaces by using the following<br />

script (as a root user):<br />

ib_qlgc_vnic_info<br />

This information is collected from the<br />

/sys/class/infiniband_qlgc_vnic/interfaces/ directory, where there is a<br />

separate directory corresponding to each VNIC interface.<br />

VNIC interfaces can be deleted by writing the name of the interface to the<br />

/sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file. For<br />

example, to delete interface veth0, run the following command (as a root user):<br />

echo -n veth0 ><br />

/sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic<br />

D000046-005 B C-13


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

Link Aggregation Configuring<br />

Troubleshooting<br />

To configure link aggregation for all ports of a VIO hardware card, do the following:<br />

1. Modify the chassis GUID and slot number.<br />

2. Edit /etc/infiniband/qlgc_vnic.cfg file with information similar to<br />

the following:<br />

{CREATE; NAME="eioc1";<br />

PRIMARY={IOCSTRING="Chassis 0x00066A0050000018, Slot<br />

2, IOC 1"; }<br />

SECONDARY={IOCSTRING="Chassis 0x00066A0050000018, Slot<br />

2, IOC 2"; }<br />

}<br />

3. Create ifcfg-eioc1 file in directory<br />

/etc/sysconfig/network-scripts.<br />

4. Physically connect 2 or 3 ports of a VEx to an Ethernet switch.<br />

5. Using the management interface for the Ethernet switch, configure the<br />

switch to aggregate all 3 ports that connect to the VEx.<br />

6. Restart the qlgc_vnic module with the following command:<br />

/etc/init.d/qlgc_vnic restart<br />

Refer to Appendix G for information about troubleshooting.<br />

VirtualNIC Configuration Variables<br />

NAME<br />

The device name for the interface.<br />

IOCGUID<br />

Defines the port controller GUID to be used to identify the specific EVIC leaf<br />

and port used by the VNIC. All of the IOC GUIDS detected on the fabric can<br />

be found by typing ib_qlgc_vnic_query -e and noting the value of the<br />

reported IOC GUIDs. This can be used instead of IOCSTRING.<br />

IOCSTRING<br />

Defines the IOC Profile ID String of the IOC to be used. All of the IOC Profile<br />

ID Strings detected on the fabric can be found by typing<br />

ib_qlgc_vnic_query -s and noting the value of the reported string. This<br />

can be used instead of IOCGUID.<br />

C-14 D000046-005 B


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

DGID<br />

Defines the Destination Global Identifier (DGID) of the IOC to be used. This<br />

parameter should be used in all CREATE blocks. The ib_qlgc_vnic_query<br />

command will show all the DGIDs that the host can see. The DGID<br />

parameter is identified as dgid in the output of the ib_qlgc_vnic_query<br />

-e command.<br />

INSTANCE<br />

Defaults to 0. The range is 0-255. If a host connects to the same IOC more<br />

than once, each connection must be assigned a unique instance.<br />

RX_CSUM<br />

Defaults to TRUE. When true, RX_CSUM indicates that the receive<br />

checksum should be verified by the VIO hardware.<br />

HEARTBEAT<br />

Defaults to 100. Specifies the time in 1/100 of a second between the<br />

heartbeats that occur between a host and an EVIC.<br />

PORT<br />

Alternative specification for a local adapter port. The first port is 1.<br />

NOTE:<br />

If there is only one adapter in the system, then use PORT/HCA to<br />

specify the adapter port to be used. If there is more than one adapter in<br />

the system then it is recommended to use PortGuid to specify the<br />

adapter port to be used.<br />

PORTGUID<br />

The PORTGUID of the InfiniBand port to be used. Use of the PORTGUID<br />

parameter for configuring the VNIC interface has an advantage on hosts<br />

having more than 1 <strong>Host</strong> Channel Adapter. PORTGUID is persistent for a<br />

given InfiniBand port, meaning VNIC configurations should be consistent<br />

and reliable - unaffected by restarts of the OFED InfiniBand stack on host<br />

having more than 1 adapter.<br />

HCA<br />

An optional <strong>Host</strong> Channel Adapter specification for use with the PORT<br />

specification. The first adapter found by the sysem upon boot is HCA 0. If<br />

there is only one adapter in the system, then the default value of this<br />

parameter (0) can be used, so "HCA" does not need to be specified in the<br />

qlgc_vnic.cfg file. If there is more than one adapter in the system, then<br />

<strong>QLogic</strong> recommends using the PORTGUID parameter to specify an adapter<br />

port (instead of using HCA/PORT) since there is no guarantee that a system<br />

booting up always finds the same adapter first.<br />

D000046-005 B C-15


C–VirtualNIC Interface Configuration and Administration<br />

VirtualNIC Interface Configuration and Administration<br />

IB_MULTICAST<br />

Defaults to TRUE. The InfiniBand multicast parameter can be set to either<br />

TRUE or FALSE. When the parameter is set to true, it will take effect only if<br />

the corresponding CLI command (ethVirtVnic2Mcastset) is executed on<br />

the EVIC. This command is supported on EVIC FW 4.3 and above.<br />

C-16 D000046-005 B


D<br />

SRP Configuration<br />

SRP Configuration Overview<br />

SRP stands for SCSI RDMA Protocol. It allows the SCSI protocol to run over<br />

InfiniBand for Storage Area Network (SAN) usage. SRP interfaces directly to the<br />

Linux file system through the SRP Upper Layer Protocol (ULP). SRP storage can<br />

be treated as another device.<br />

In this release, two versions of SRP are available: <strong>QLogic</strong> SRP and OFED SRP.<br />

<strong>QLogic</strong> SRP is available as part of the <strong>QLogic</strong> OFED <strong>Host</strong> <strong>Software</strong>, <strong>QLogic</strong><br />

InfiniBand Fabric Suite, Rocks Roll, and Platform PCM downloads.<br />

SRP has been tested on targets from DataDirect Networks and Engenio (now<br />

LSI Logic ® ).<br />

NOTE:<br />

Before using SRP, the SRP targets must already be set up by your system<br />

administrator.<br />

Important Concepts<br />

• A SRP Initiator Port is a adapter port through which the host communicates<br />

with a SRP target device (e.g., a Fibre Channel disk array) via a SRP target<br />

port.<br />

• A SRP Target Port is an IOC of the VIO hardware. In the context of VIO<br />

hardware, an IOC can be thought of as a SRP target. An FVIC contains 2<br />

IOCs. IOC1 maps to the first adapter on the FVIC, and IOC2 maps to the<br />

2nd adapter on the FVIC. On an FCBM, there are also 2 IOCs, and IOC1<br />

maps to port 1 of the adapter of the FC BM and IOC2 maps to port 2 of the<br />

adapter of the FC BM.<br />

• A Fibre Channel Target Device is a device containing storage resources that<br />

is located remotely from a Fibre Channel host. In the context of SRP/VIO<br />

hardware, this is typically an array of disks connected via Fibre Channel to<br />

the VIO hardware.<br />

D000046-005 B D-1


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

• A SRP Initiator Extension is a 64-bit numeric value that is appended to the<br />

port GUID of the SRP initiator port, which allows an SRP initiator port to<br />

have multiple SRP maps associated with it. Maps are for FVIC only.<br />

InfiniBand attached storage will use their own mechanism as maps are not<br />

necessary.<br />

• A SRP Initiator is the combination of an SRP initiator port and an SRP<br />

initiator extension.<br />

• A SRP Target is identified by the combination of an SRP target IOC and a<br />

SRP target extension.<br />

• A SRP Session defines a connection between an SRP initiator and a SRP<br />

target.<br />

• A SRP Map associates an SRP session with a Fibre Channel Target Device.<br />

This mapping is configured on the VIO hardware. Maps are for FVIC only.<br />

InfiniBand attached storage will use their own mechanism as maps are not<br />

necessary.<br />

NOTE:<br />

• If a device connected to a map is changed, the SRP driver must<br />

be restarted.<br />

• If the connected device is unreachable for a period of time, the<br />

Linux kernel may set the device offline. If this occurs the SRP<br />

driver must be restarted.<br />

• A SRP Adapter is a collection of SRP sessions. This collection is then<br />

presented to the Linux kernel as if those sessions were all from a single<br />

adapter. All sessions configured for an adapter must ultimately connect to<br />

the same target device.<br />

NOTE:<br />

• The SRP driver must be stopped before OFED (i.e., openibd) is<br />

stopped or restarted. This is due to SRP having references on<br />

OFED modules. The Linux kernel will not allow those OFED<br />

modules to be unloaded.<br />

<strong>QLogic</strong> SRP Configuration<br />

The <strong>QLogic</strong> SRP is installed as part of the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> or the<br />

<strong>QLogic</strong> InfiniBand Fabric Suite. The following section provide procedures to set up<br />

and configure the <strong>QLogic</strong> SRP.<br />

D-2 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Stopping, Starting and Restarting the SRP Driver<br />

To stop the qlgc_srp driver, use the following command:<br />

/etc/init.d/qlgc_srp stop<br />

To start the qlgc_srp driver, use the following command:<br />

/etc/init.d/qlgc_srp start<br />

To restart the qlgc_srp driver, use the following command:<br />

Specifying a Session<br />

/etc/init.d/qlgc_srp restart<br />

In the SRP configuration file, a session command is a block of configuration<br />

commands, surrounded by begin and end statements. Sessions can be specified<br />

in several different ways, but all consist of specifying an SRP initiator and an SRP<br />

target port. For example:<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCGuid: 0x00066AXXXXXXXXXX<br />

initiatorExtension: 2<br />

end<br />

The session command has two parts; the part that specifies the SRP initiator and<br />

the part that specifies the SRP target port. The SRP initiator contains two parts,<br />

the SRP initiator port and the SRP initiator extension. The SRP initiator extension<br />

portion of the SRP initiator is optional, and defaults to a value of 1. However, if a<br />

SRP initiator extension is not specified, each port on the adapter can use only one<br />

SRP map per VIO device. In addition a targetExtension can be specified (the<br />

default is 1).<br />

The SRP Initiator Port may be specified in two different ways:<br />

1. By using the port GUID of the adapter port used for the connection, or<br />

2. Specify the index of the adapter card being used (this is zero-based, so if<br />

there is only one adapter card in the system use a value of 0) and the index<br />

of the port number (1 or 2) of the adapter card being used.<br />

The SRP target port may be specified in two different ways:<br />

D000046-005 B D-3


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

1. By the port GUID of the IOC, or<br />

2. By the IOC profile string that is created by the VIO device (i.e., a string<br />

containing the chassis GUID, the slot number and the IOC number). FVIC<br />

creates the device in this manner, other devices have their own naming<br />

method.<br />

To specify the host InfiniBand port to use, the user can either specify the port<br />

GUID of the local InfiniBand port, or simply use the index numbers of the cards<br />

and the ports on the cards. Cards are numbered from 0 on up, based on the order<br />

they occur in the PCI bus. Ports are numbered in the same way, from first to last.<br />

To see which cards and ports are available for use, type the following command:<br />

ib_qlgc_srp_query<br />

The system returns input similar to the following:<br />

D-4 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

st187:~/qlgc-srp-1_3_0_0_1 # ib_qlgc_srp_query<br />

<strong>QLogic</strong> Corporation. Virtual HBA (SRP) SCSI Query Application, version 1.3.0.0.1<br />

1 IB <strong>Host</strong> Channel Adapter present in system.<br />

HCA Card 0 : 0x0002c9020026041c<br />

Port 1 GUID : 0x0002c9020026041d<br />

Port 2 GUID : 0x0002c9020026041e<br />

SRP Targets :<br />

SRP IOC Profile : FVIC in Chassis 0x00066a000300012a, Slot 17, Ioc 1<br />

SRP IOC GUID : 0x00066a01dd000021<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />

service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />

service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a11dd000021<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a11dd000021<br />

SRP IOC Profile : FVIC in Chassis 0x00066a000300012a, Slot 17, Ioc 2<br />

SRP IOC GUID : 0x00066a02dd000021<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />

service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />

service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a21dd000021<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a21dd000021<br />

SRP IOC Profile : Chassis 0x00066A0050000135, Slot 5, IOC 1<br />

SRP IOC GUID : 0x00066a013800016c<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />

service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />

service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a026000016c<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a026000016c<br />

D000046-005 B D-5


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

SRP IOC Profile : Chassis 0x00066A0050000135, Slot 5, IOC 2<br />

SRP IOC GUID : 0x00066a023800016c<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />

service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />

service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a026000016c<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a026000016c<br />

SRP IOC Profile : Chassis 0x00066A0050000135, Slot 8, IOC 1<br />

SRP IOC GUID : 0x00066a0138000174<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a0260000174<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a0260000174<br />

SRP IOC Profile : Chassis 0x00066A0050000135, Slot 8, IOC 2<br />

SRP IOC GUID : 0x00066a0238000174<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a0260000174<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a0260000174<br />

st187:~/qlgc-srp-1_3_0_0_1 #<br />

Determining the values to use for the configuration<br />

In order to build the configuration file, use the command<br />

ib_qlgc_srp_build_cfg script as follows:<br />

Enter ib_qlgc_srp_build_cfg. The system provides output similar to the<br />

following:<br />

D-6 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

# qlgc_srp.cfg file generated by /usr/sbin/ib_qlgc_srp_build_cfg, version 1.3.0.0.17, on<br />

Mon Aug 25 13:42:16 EDT 2008<br />

#Found <strong>QLogic</strong> OFED SRP<br />

registerAdaptersInOrder: ON<br />

# =============================================================<br />

# IOC Name: BC2FC in Chassis 0x0000000000000000, Slot 6, Ioc 1<br />

# IOC GUID: 0x00066a01e0000149 SRP IU SIZE : 320<br />

# service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

#portGuid: 0x0002c9030000110d<br />

initiatorExtension: 1<br />

targetIOCGuid: 0x00066a01e0000149<br />

targetIOCProfileIdString: "FVIC in Chassis 0x0000000000000000, Slot 6, Ioc 1"<br />

targetPortGid: 0xfe8000000000000000066a01e0000149<br />

targetExtension: 0x0000000000000001<br />

SID: 0x0000494353535250<br />

IOClass: 0xff00<br />

end<br />

adapter<br />

begin<br />

adapterIODepth: 1000<br />

lunIODepth: 16<br />

adapterMaxIO: 128<br />

adapterMaxLUNs: 512<br />

adapterNoConnectTimeout: 60<br />

adapterDeviceRequestTimeout: 2<br />

# set to 1 if you want round robin load balancing<br />

roundrobinmode: 0<br />

# set to 1 if you do not want target connectivity verification<br />

noverify: 0<br />

description: "SRP Virtual HBA 0"<br />

end<br />

The ib_qlgc_srp_build_cfg command creates a configuration file based on<br />

discovered target devices. By default, the information is sent to stdout. In order<br />

to create a configuration file, output should be redirected to a disk file. Enter<br />

ib_qlgc_srp_build_cfg -h for a list and description of the option flags.<br />

D000046-005 B D-7


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

NOTE:<br />

The default configuration generated by ib_qlgc_srp_build_cfg for OFED<br />

is similar to the one generated for the QuickSilver host stack with the<br />

following differences:<br />

• For OFED, the configuration automatically includes IOClass<br />

• For OFED, the configuration automatically includes SID<br />

• For OFED, the configuration provides information on targetPortGid<br />

instead of targetPortGuid<br />

• For OFED, the configuration automatically includes<br />

targetIOCProfileIdString.<br />

Specifying an SRP Initiator Port of a Session by Card and<br />

Port Indexes<br />

The following example specifies a session by card and port indexes. If the system<br />

contains only one adapter, use this method.<br />

session<br />

begin<br />

#Specifies the near side by card index<br />

card: 0 #Specifies first HCA<br />

port: 1 #Specifies first port<br />

targetIOCGuid: 0x00066013800016C<br />

end<br />

Specifying an SRP Initiator Port of Session by Port GUID<br />

The following example specifies a session by port GUID. If the system contains<br />

more than one adapter, use this method.<br />

session<br />

begin<br />

portGuid: 0x00066A00a00001a2 #Specifies port by its GUID<br />

targetIOCGuid: 0x00066A013800016C<br />

end<br />

NOTE:<br />

When using this method, if the port GUIDs are changed, they must also be<br />

changed in the configuration file.<br />

D-8 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Specifying a SRP Target Port<br />

The SRP target can be specified in two different ways. To connect to a particular<br />

SRP target no matter where it is in the fabric, use the first method (By IOCGUID).<br />

To connect to a SRP target that is in a certain chassis/slot, no matter which card it<br />

is on (For FVIC, the user does not want to change the configuration, if cards are<br />

switched in a slot) then use the second method.<br />

1. By IOCGUID. For example:<br />

targetIOCGuid: 0x00066A013800016c<br />

2. By target IOC Profile String. For example:<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A005000010E, Slot 1, IOC 1"<br />

NOTE:<br />

When specifying the targetIOCProfileIdString, the string is case<br />

and format sensitive. The easiest way to get the correct format is to cut<br />

and paste it from the output of the /usr/sbin/ib_qlgc_srp_query<br />

program.<br />

NOTE:<br />

For FVIC, by specifying the SRP Target Port by IOCGUID, this ensures<br />

that the session will always be mapped to the specific port on this<br />

specific VIO hardware card, even if the card is moved to a different slot<br />

in the same chassis or even if it is moved to a different chassis.<br />

NOTE:<br />

For FVIC, by specifying the SRP Target Port by Profile String, this<br />

ensures that the session will always be mapped to the VIO hardware<br />

card in the specific slot of a chassis, even if the VIO hardware card<br />

currently in that slot is replaced by a different VIO hardware card.<br />

D000046-005 B D-9


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Specifying a SRP Target Port of a Session by IOCGUID<br />

The following example specifies a target by IOC GUID:<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCGuid: 0x00066A013800016c #IOC GUID of the InfiniFibre port<br />

end<br />

• 0x00066a10dd000046<br />

• 0x00066a20dd000046<br />

Specifying a SRP Target Port of a Session by Profile String<br />

The following example specifies a target by Profile String:<br />

Specifying an Adapter<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

# FVIC in Chassis 0x00066A005000010E,<br />

# Slot number 1, port 1<br />

targetIOCProfileIdString: “FVIC in Chassis 0x00066A005000010E, Slot 1, IOC 1”<br />

end<br />

An adapter is a collection of sessions. This collection is presented to the Linux<br />

kernel as if the collection was a single Fibre Channel adapter. The host system<br />

has no information regarding session connectivity. It only sees the end target fibre<br />

channel devices. The adapter section of the qlgc_srp configuration file contains<br />

multiple parameters. These parameters are listed in the adapter section of the<br />

ib_qlgc_srp_build_cfg script system output shown in “Determining the values<br />

to use for the configuration” on page D-6 The following example specifies an<br />

adapter:<br />

adapter<br />

begin<br />

description: “Oracle RAID Array”<br />

end<br />

Restarting the SRP Module<br />

For changes to take effect, including changes to the SRP map on the VIO card,<br />

SRP will need to be restarted. To restart the qlgc_srp driver, use the following<br />

command:<br />

/etc/init.d/qlgc_srp restart<br />

D-10 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Configuring an Adapter with Multiple Sessions<br />

Each adapter can have an unlimited number of sessions attached to it. Unless<br />

round robin is specified, SRP will only use one session at a time. However,<br />

there is still an advantage to configuring an adapter with multiple sessions. For<br />

example, if an adapter is configured with only one session and that session fails,<br />

all SCSI I/Os on that session will fail and access to SCSI target devices will be<br />

lost. While the qlgc_srp module will attempt to recover the broken session, this<br />

may take some time (e.g., if a cable was pulled, the FC port has failed, or an<br />

adapter has failed). However, if the host is using an adapter configured with<br />

multiple sessions and the current session fails, the host will automatically switch<br />

to an alternate session. The result is that the host can quickly recover and<br />

continue to access the SCSI target devices.<br />

WARNING!!<br />

When using two VIO hardware cards within one Adapter, the cards must<br />

have identical Fibre Channel configurations and maps. Data corruption can<br />

result from using different configurations and/or maps.<br />

D000046-005 B D-11


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

When the qlgc_srp module encounters an adapter command, that adapter is<br />

assigned all previously defined sessions (that have not been assigned to other<br />

adapters). This makes it easy to configure a system for multiple SRP adapters.<br />

The following is an example configuration that uses multiple sessions and<br />

adapters:<br />

session<br />

begin<br />

card: 0<br />

port: 2<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A005000011D, Slot 1, IOC 1"<br />

initiatorExtension: 3<br />

end<br />

adapter<br />

begin<br />

description: "Test Device"<br />

end<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A005000011D, Slot 2, IOC 1"<br />

initiatorExtension: 2<br />

end<br />

adapter<br />

begin<br />

description: "Test Device 1"<br />

end<br />

D-12 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A005000011D, Slot 1, IOC 2"<br />

initiatorExtension: 2<br />

end<br />

adapter<br />

begin<br />

description: "Test Device 1"<br />

end<br />

Configuring Fibre Channel Failover<br />

Fibre Channel failover is essentially failing over from one session in an adapter to<br />

another session in the same adapter.<br />

Following is a list of the different type of failover scenarios:<br />

• Failing over from one SRP initiator port to another.<br />

• Failing over from a port on the VIO hardware card to another port on the VIO<br />

hardware card.<br />

• Failing over from a port on a VIO hardware card to a port on a different VIO<br />

hardware card within the same virtual I/O chassis.<br />

• Failing over from a port on a VIO hardware card to a port on a different VIO<br />

hardware card in a different virtual I/O chassis.<br />

D000046-005 B D-13


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Failover Configuration File 1: Failing over from one<br />

SRP Initiator port to another<br />

In this failover configuration file, the first session (using adapter Port 1) is used to<br />

reach the SRP Target Port. If a problem is detected in this session (e.g., the<br />

InfiniBand cable on port 1 of the adapter is pulled) then the 2nd session (using<br />

adapter Port 2) will be used.<br />

# service 0: name SRP.T10:0000000000000001 id<br />

0x0000494353535250<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

#portGuid: 0x0002c903000010f1<br />

initiatorExtension: 1<br />

targetIOCGuid: 0x00066a01e0000149<br />

targetIOCProfileIdString: "BC2FC in Chassis<br />

0x0000000000000000, Slot 6, Ioc 1"<br />

targetPortGid: 0xfe8000000000000000066a01e0000149<br />

targetExtension: 0x0000000000000001<br />

SID: 0x0000494353535250<br />

IOClass: 0xff00<br />

end<br />

session<br />

begin<br />

card: 0<br />

port: 2<br />

#portGuid: 0x0002c903000010f2<br />

initiatorExtension: 1<br />

targetIOCGuid: 0x00066a01e0000149<br />

targetIOCProfileIdString: "BC2FC in Chassis<br />

0x0000000000000000, Slot 6, Ioc 1"<br />

targetPortGid: 0xfe8000000000000000066a01e0000149<br />

targetExtension: 0x0000000000000001<br />

SID: 0x0000494353535250<br />

IOClass: 0xff00<br />

end<br />

adapter<br />

begin<br />

adapterIODepth: 1000<br />

lunIODepth: 16<br />

adapterMaxIO: 128<br />

adapterMaxLUNs: 512<br />

adapterNoConnectTimeout: 60<br />

D-14 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

adapterDeviceRequestTimeout: 2<br />

# set to 1 if you want round robin load balancing<br />

roundrobinmode: 0<br />

# set to 1 if you do not want target connectivity<br />

verification<br />

noverify: 0<br />

description: "SRP Virtual HBA 0"<br />

end<br />

Failover Configuration File 2: Failing over from a port on the<br />

VIO hardware card to another port on the VIO hardware card<br />

session<br />

begin<br />

card: 0 (InfiniServ HCA card number)<br />

port: 1 (InfiniServ HCA port number)<br />

targetIOCProfileIdString: "FVIC in Chassis ,<br />

, "<br />

initiatorExtension: 1<br />

end<br />

session<br />

begin<br />

card: 0 (InfiniServ HCA card number)<br />

port: 1 (InfiniServ HCA port number)<br />

targetIOCProfileIdString: "FVIC in Chassis ,<br />

, "<br />

initiatorExtension: 1 (Here the extension should be different<br />

if using the same IOC in this adapter for FVIC, so that<br />

separate maps can be created for each session).<br />

end<br />

adapter<br />

begin<br />

description: “FC port Failover”<br />

end<br />

On the VIO hardware side, the following needs to be ensured:<br />

• The target device is discovered and configured for each of the ports that is<br />

involved in the failover.<br />

• The SRP Initiator is discovered and configured once for each different<br />

initiatorExtension.<br />

• Each map should use a different Configured Device, e.g Configured Device<br />

1 has the Target being discovered over FC Port 1, and Configured Device 2<br />

has the Target being discovered over FC Port 2.)<br />

D000046-005 B D-15


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Failover Configuration File 3: Failing over from a port on a<br />

VIO hardware card to a port on a different VIO hardware card<br />

within the same Virtual I/O chassis<br />

session<br />

begin<br />

card: 0 (InfiniServ HCA card number)<br />

port: 1 (InfiniServ HCA port number)<br />

targetIOCProfileIdString: "FVIC in Chassis ,<br />

, "<br />

initiatorExtension: 1<br />

end<br />

session<br />

begin<br />

card: 0 (InfiniServ HCA card number)<br />

port: 1 (InfiniServ HCA port number)<br />

targetIOCProfileIdString: "FVIC in Chassis ,<br />

, " (Slot number differs to indicate a different<br />

VIO card)<br />

initiatorExtension: 1 (Here the initiator extension can be the<br />

same as in the previous definition, because the SRP map is<br />

being defined on a different FC gateway card)<br />

end<br />

adapter<br />

begin<br />

description: “FC Port Failover”<br />

end<br />

On the VIO hardware side, the following need to be ensured on each FVIC<br />

involved in the failover:<br />

• The target device is discovered and configured through the appropriate FC<br />

port<br />

• The SRP Initiator is discovered and configured once for the proper<br />

initiatorExtension.<br />

• The SRP map created for the initiator connects to the same target<br />

D-16 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Failover Configuration File 4: Failing over from a port on a<br />

VIO hardware card to a port on a different VIO hardware<br />

card in a different Virtual I/O chassis<br />

session<br />

begin<br />

card: 0 (InfiniServ HCA card number)<br />

port: 1 (InfiniServ HCA port number)<br />

targetIOCProfileIdString: "FVIC in Chassis ,<br />

, "<br />

initiatorExtension: 1<br />

end<br />

session<br />

begin<br />

card: 0 (InfiniServ HCA card number)<br />

port: 1 (InfiniServ HCA port number)<br />

targetIOCProfileIdString: "FVIC in Chassis ,<br />

, " (Chassis GUID differs to indicate a card in a<br />

different chassis)<br />

initiatorExtension:1 (Here the initiator extension can be the<br />

same as in the previous definition, because the SRP map is<br />

being defined on a different FC gateway card)<br />

end<br />

adapter<br />

begin<br />

description: “FC Port Failover”<br />

end<br />

On the VIO hardware side, the following need to be ensured on each FVIC<br />

involved in the failover:<br />

• The target device is discovered and configured through the appropriate FC<br />

port<br />

• The SRP Initiator is discovered and configured once for the proper<br />

initiatorExtension.<br />

• The SRP map created for the initiator connects to the same target<br />

D000046-005 B D-17


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Configuring Fibre Channel Load Balancing<br />

The following examples display typical scenarios for how to configure Fibre<br />

Channel load balancing.<br />

In the first example, traffic going to any Fibre Channel Target Device where both<br />

ports of the VIO hardware card have a valid map, are split between the two ports<br />

of the VIO hardware card. If one of the VIO hardware ports goes down, then all of<br />

the traffic will go over the remaining port that is up.<br />

1 Adapter Port and 2 Ports on a Single VIO<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A0050000123, Slot 1, IOC 1"<br />

initiatorExtension: 3<br />

end<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A0050000123, Slot 1, IOC 2"<br />

initiatorExtension: 3<br />

end<br />

adapter<br />

begin<br />

description: "Test Device"<br />

roundrobinmode: 1<br />

end<br />

D-18 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

2 Adapter Ports and 2 Ports on a Single VIO Module<br />

In this example, traffic is load balanced between adapter Port 2/VIO hardware<br />

Port 1 and adapter Port1/VIO hardware Port 1. If one of the sessions goes down<br />

(due to an InfiniBand cable failure or an FC cable failure), all traffic will begin using<br />

the other session.<br />

session<br />

begin<br />

card: 0<br />

port: 2<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A0050000123, Slot 1, IOC 1"<br />

initiatorExtension: 3<br />

end<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A0050000123, Slot 1, IOC 2"<br />

initiatorExtension: 3<br />

end<br />

adapter<br />

begin<br />

description: "Test Device"<br />

roundrobinmode: 1<br />

end<br />

D000046-005 B D-19


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Using the roundrobinmode Parameter<br />

In this example, the two sessions use different VIO hardware cards as well as<br />

different adapter ports. Traffic will be load-balanced between the two sessions. If<br />

there is a failure in one of the sessions (e.g., one of the VIO hardware cards is<br />

rebooted) traffic will begin using the other session.<br />

session<br />

begin<br />

card: 0<br />

port: 2<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A005000011D, Slot 1, IOC 1"<br />

initiatorExtension: 2<br />

end<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

targetIOCProfileIdString: "FVIC in Chassis<br />

0x00066A005000011D, Slot 2, IOC 1"<br />

initiatorExtension: 2<br />

end<br />

adapter<br />

begin<br />

description: "Test Device"<br />

roundrobinmode: 1<br />

end<br />

D-20 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

Configuring SRP for Native InfiniBand Storage<br />

1. <strong>Rev</strong>iew ib_qlgc_srp_query.<br />

<strong>QLogic</strong> Corporation. Virtual HBA (SRP) SCSI Query Application,<br />

version 1.3.0.0.1<br />

1 IB <strong>Host</strong> Channel Adapter present in system.<br />

HCA Card 1 : 0x0002c9020026041c<br />

Port 1 GUID : 0x0002c9020026041d<br />

Port 2 GUID : 0x0002c9020026041e<br />

SRP Targets :<br />

SRP IOC Profile : Native IB Storage SRP Driver<br />

SRP IOC GUID : 0x00066a01dd000021<br />

SRP IU SIZE : 320<br />

SRP IU SG SIZE: 15<br />

SRP IO CLASS : 0xff00<br />

service 0 : name SRP.T10:0000000000000001 id<br />

0x0000494353535250<br />

service 1 : name SRP.T10:0000000000000002 id<br />

0x0000494353535250<br />

service 2 : name SRP.T10:0000000000000003 id<br />

0x0000494353535250<br />

service 3 : name SRP.T10:0000000000000004 id<br />

0x0000494353535250<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID<br />

0xfe8000000000000000066a11dd000021<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID<br />

0xfe8000000000000000066a11dd000021<br />

2. Edit /etc/sysconfig/qlgc_srp.cfg to add this information.<br />

# service : name SRP.T10:0000000000000001 id<br />

0x0000494353535250<br />

session<br />

begin<br />

card: 0<br />

port: 1<br />

#portGuid: 0x0002c903000010f1<br />

initiatorExtension: 1<br />

targetIOCGuid: 0x00066a01e0000149<br />

D000046-005 B D-21


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

targetIOCProfileIdString: “Native IB Storage SRP Driver”<br />

targetPortGid: 0xfe8000000000000000066a01e0000149<br />

targetExtension: 0x0000000000000001<br />

SID: 0x0000494353535250<br />

IOClass: 0x0100<br />

end<br />

session<br />

begin<br />

card: 0<br />

port: 2<br />

#portGuid: 0x0002c903000010f2<br />

initiatorExtension: 1<br />

targetIOCGuid: 0x00066a01e0000149<br />

targetIOCProfileIdString: Native IB Storage SRP Driver<br />

targetPortGid: 0xfe8000000000000000066a01e0000149<br />

targetExtension: 0x0000000000000001<br />

SID: 0x0000494353535250<br />

IOClass: 0x0100<br />

end<br />

adapter<br />

begin<br />

adapterIODepth: 1000<br />

lunIODepth: 16<br />

adapterMaxIO: 128<br />

adapterMaxLUNs: 512<br />

adapterNoConnectTimeout: 60<br />

adapterDeviceRequestTimeout: 2<br />

# set to 1 if you want round robin load balancing<br />

roundrobinmode: 0<br />

# set to 1 if you do not want target connectivity<br />

verification<br />

noverify: 0<br />

description: "SRP Virtual HBA 0"<br />

end<br />

D-22 D000046-005 B


D–SRP Configuration<br />

<strong>QLogic</strong> SRP Configuration<br />

3. Note the correlation between the output of ib_qlgc_srp_query and<br />

qlgc_srp.cfg<br />

Target Path(s):<br />

HCA 0 Port 1 0x0002c9020026041d -> Target Port GID<br />

0xfe8000000000000000066a11dd000021<br />

HCA 0 Port 2 0x0002c9020026041e -> Target Port GID<br />

0xfe8000000000000000066a11dd000021<br />

qlgc_srp.cfg:<br />

session<br />

begin<br />

. . . .<br />

targetIOCGuid: 0x0002C90200400098<br />

targetExtension: 0x0002C90200400098<br />

end<br />

adapter<br />

begin<br />

description: "Native IB storage"<br />

end<br />

Notes<br />

• There is a sample configuration in qlgc_srp.cfg.<br />

<br />

<br />

The correct TargetExtension must be added to session.<br />

It is important to use the IOC ID method since most Profile ID strings<br />

are not guaranteed to be unique.<br />

• Other possible parameters:<br />

<br />

initiatorExtension may be used by the storage device to identify<br />

the host.<br />

Additional Details<br />

• All LUNs found are reported to the Linux SCSI mid-layer.<br />

• Linux may need the max_scsi_luns (2.4 kernels) or max_luns (2.6 kernels)<br />

parameter configured in scsi_mod.<br />

Troubleshooting<br />

For troubleshooting information, refer to “Troubleshooting SRP Issues” on<br />

page G-8.<br />

D000046-005 B D-23


D–SRP Configuration<br />

OFED SRP Configuration<br />

OFED SRP Configuration<br />

To use OFED SRP, follow these steps:<br />

1. Add the line SRP_LOAD=yes to the module list in<br />

/etc/infiniband/openib.conf to have it automatically loaded.<br />

2. Discover the SRP devices on your fabric by running this command (as a root<br />

user):<br />

ibsrpdm<br />

In the output, look for lines similar to these:<br />

GUID: 0002c90200402c04<br />

ID: LSI Storage Systems SRP Driver 200400a0b8114527<br />

service entries: 1<br />

service[ 0]: 200400a0b8114527 / SRP.T10:200400A0B8114527<br />

GUID: 0002c90200402c0c<br />

ID: LSI Storage Systems SRP Driver 200500a0b8114527<br />

service entries: 1<br />

service[ 0]: 200500a0b8114527 / SRP.T10:200500A0B8114527<br />

GUID: 21000001ff040bf6<br />

ID: Data Direct Networks SRP Target System<br />

service entries: 1<br />

service[ 0]: f60b04ff01000021 / SRP.T10:21000001ff040bf6<br />

Note that not all the output is shown here; key elements are expected to<br />

show the match in Step 3.<br />

3. Choose the device you want to use, and run the command again with the -c<br />

option (as a root user):<br />

# ibsrpdm -c<br />

id_ext=200400A0B8114527,ioc_guid=0002c90200402c04,dgid=fe8000<br />

00000000000002c90200402c05,pkey=ffff,service_id=200400a0b8114<br />

527<br />

id_ext=200500A0B8114527,ioc_guid=0002c90200402c0c,dgid=fe8000<br />

00000000000002c90200402c0d,pkey=ffff,service_id=200500a0b8114<br />

527<br />

id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe8000<br />

000000000021000001ff040bf6,pkey=ffff,service_id=f60b04ff01000<br />

021<br />

D-24 D000046-005 B


D–SRP Configuration<br />

OFED SRP Configuration<br />

4. Find the result that corresponds to the target you want, and echo it into the<br />

add_target file:<br />

echo<br />

"id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe800<br />

0000000000021000001ff040bf6,pkey=ffff,service_id=f60b04ff0100<br />

0021,initiator_ext=0000000000000001" ><br />

/sys/class/infiniband_srp/srp-ipath0-1/add_target<br />

5. Look for the newly created devices in the /proc/partitions file. The file<br />

will look similar to this example (the partition names may vary):<br />

# cat /proc/partitions<br />

major minor #blocks name<br />

8 64 142325760 sde<br />

8 65 142319834 sde1<br />

8 80 71162880 sdf<br />

8 81 71159917 sdf1<br />

8 96 20480 sdg<br />

8 97 20479 sdg1<br />

6. Create a mount point (as root) where you will mount the SRP device. For<br />

example:<br />

mkdir /mnt/targetname<br />

mount /dev/sde1 /mnt/targetname<br />

NOTE:<br />

Use sde1 rather than sde. See the mount(8) man page for more<br />

information on creating mount points.<br />

D000046-005 B D-25


D–SRP Configuration<br />

OFED SRP Configuration<br />

Notes<br />

D-26 D000046-005 B


E<br />

Integration with a Batch<br />

Queuing System<br />

Most cluster systems use some kind of batch queuing system as an orderly way to<br />

provide users with access to the resources they need to meet their job’s<br />

performance requirements. One task of the cluster administrator is to allow users<br />

to submit MPI jobs through these batch queuing systems. Two methods are<br />

described in this document:<br />

• Use mpiexec within the Portable Batch System (PBS) environment.<br />

• Invoke a script, similar to mpirun, within the SLURM context to submit MPI<br />

jobs. A sample is provided in “Using SLURM for Batch Queuing” on<br />

page E-2.<br />

Using mpiexec with PBS<br />

mpiexec can be used as a replacement for mpirun within a PBS cluster<br />

environment. The PBS software performs job scheduling.<br />

For PBS-based batch systems, <strong>QLogic</strong> MPI processes can be spawned using the<br />

mpiexec utility distributed and maintained by the Ohio Supercomputer Center<br />

(OSC).<br />

Starting with mpiexec version 0.84, MPI applications compiled and linked with<br />

<strong>QLogic</strong> MPI can use mpiexec and PBS’s Task Manager (TM) interface to spawn<br />

and correctly terminate parallel jobs.<br />

To download the latest version of mpiexec, go to:<br />

http://www.osc.edu/~pw/mpiexec/<br />

To build mpiexec for <strong>QLogic</strong> MPI and install it in /usr/local, type:<br />

$ tar zxvf mpiexec-0.84.tgz<br />

$ cd mpiexec-0.84<br />

$ ./configure --enable-default-comm=mpich-psm && gmake all install<br />

D000046-005 B E-1


E–Integration with a Batch Queuing System<br />

Using SLURM for Batch Queuing<br />

NOTE:<br />

This level of support is specific to <strong>QLogic</strong> MPI, and not to other MPIs that<br />

currently support InfiniPath.<br />

For more usage information, see the OSC mpiexec documentation.<br />

For more information on PBS, go to: http://www.pbsgridworks.com/<br />

Using SLURM for Batch Queuing<br />

The following is an example of the some of the functions that a batch queuing<br />

script might perform. The example is in the context of the Simple Linux Utility<br />

Resource Manager (SLURM) developed at Lawrence Livermore National<br />

Laboratory. These functions assume the use of the bash shell. The following<br />

script is called batch_mpirun:<br />

#! /bin/sh<br />

# Very simple example batch script for <strong>QLogic</strong> MPI, using slurm<br />

# (http://www.llnl.gov/linux/slurm/)<br />

# Invoked as:<br />

#batch_mpirun #cpus mpi_program_name mpi_program_args ...<br />

#<br />

np=$1 mpi_prog="$2" # assume arguments to script are correct<br />

shift 2 # program args are now $@<br />

eval ‘srun --allocate --ntasks=$np --no-shell‘<br />

mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘<br />

srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \<br />

| awk ’{printf "%s:%s\n", $2, $1}’ > $mpihosts_file<br />

mpirun -np $np -m $mpihosts_file "$mpi_prog" $@<br />

exit_code=$<br />

scancel ${SLURM_JOBID}<br />

rm -f $mpihosts_file<br />

exit $exit_code<br />

In the following sections, the setup and the various script functions are discussed<br />

in more detail.<br />

E-2 D000046-005 B


E–Integration with a Batch Queuing System<br />

Using SLURM for Batch Queuing<br />

Allocating Resources<br />

When the mpirun command starts, it requires specification of the number of node<br />

programs it must spawn (via the -np option) and specification of an mpihosts<br />

file listing the nodes that the node programs run on. (See “Environment for Node<br />

Programs” on page 4-19 for more information.) Since performance is usually<br />

important, a user might require that his node program be the only application<br />

running on each node CPU. In a typical batch environment, the MPI user would<br />

still specify the number of node programs, but would depend on the batch system<br />

to allocate specific nodes when the required number of CPUs become available.<br />

Thus, batch_mpirun would take at least an argument specifying the number of<br />

node programs and an argument specifying the MPI program to be executed. For<br />

example:<br />

$ batch_mpirun -np n my_mpi_program<br />

After parsing the command line arguments, the next step of batch_mpirun is to<br />

request an allocation of n processors from the batch system. In SLURM, this uses<br />

the command:<br />

eval ‘srun --allocate --ntasks=$np --no-shell‘<br />

Make sure to use back quotes rather than normal single quotes. $np is the shell<br />

variable that your script has set from the parsing of its command line options. The<br />

--no-shell option to srun prevents SLURM from starting a subshell. The srun<br />

command is run with eval to set the SLURM_JOBID shell variable from the output<br />

of the srun command.<br />

With these specified arguments, the SLURM function srun blocks until there are<br />

$np processors available to commit to the caller. When the requested resources<br />

are available, this command opens a new shell and allocates the number of<br />

processors to the requestor.<br />

Generating the mpihosts File<br />

Once the batch system has allocated the required resources, your script must<br />

generate an mpihosts file that contains a list of nodes that are used. To do this,<br />

the script must determine which nodes the batch system has allocated, and how<br />

many processes can be started on each node. This is the part of the script<br />

batch_mpirun that performs these tasks, for example:<br />

mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘<br />

srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \<br />

| awk ’{printf "%s:%s\n", $2, $1}’ > $mpihosts_file<br />

The first command creates a temporary hosts file with a random name, and<br />

assigns the name to the variable mpihosts_file it has generated.<br />

The next instance of the SLURM srun command runs hostname -s once for<br />

each process slot that SLURM has allocated. If SLURM has allocated two slots on<br />

one node, hostname -s is output twice for that node.<br />

D000046-005 B E-3


E–Integration with a Batch Queuing System<br />

Using SLURM for Batch Queuing<br />

The sort | uniq -c component determines the number of times each unique<br />

line was printed. The awk command converts the result into the mpihosts file<br />

format used by mpirun. Each line consists of a node name, a colon, and the<br />

number of processes to start on that node.<br />

NOTE:<br />

This is one of two formats that the file can use. See “Console I/O in MPI<br />

Programs” on page 4-18 for more information.<br />

Simple Process Management<br />

At this point, the script has enough information to be able to run an MPI program.<br />

The next step is to start the program when the batch system is ready, and notify<br />

the batch system when the job completes. This is done in the final part of<br />

batch_mpirun, for example:<br />

mpirun -np $np -m $mpihosts_file "$mpi_prog" $@<br />

exit_code=$<br />

scancel ${SLURM_JOBID}<br />

rm -f $mpihosts_file<br />

exit $exit_code<br />

Clean Termination of MPI Processes<br />

The InfiniPath software normally ensures clean termination of all MPI programs<br />

when a job ends, but in some rare circumstances an MPI process may remain<br />

alive, and potentially interfere with future MPI jobs. To avoid this problem, run a<br />

script before and after each batch job that kills all unwanted processes. <strong>QLogic</strong><br />

does not provide such a script, but it is useful to know how to find out which<br />

processes on a node are using the <strong>QLogic</strong> interconnect. The easiest way to do<br />

this is with the fuser command, which is normally installed in /sbin.<br />

Run these commands as a root user to ensure that all processes are reported.<br />

# /sbin/fuser -v /dev/ipath<br />

/dev/ipath: 22648m 22651m<br />

In this example, processes 22648 and 22651 are using the <strong>QLogic</strong> interconnect. It<br />

is also possible to use this command (as a root user):<br />

# lsof /dev/ipath<br />

This command displays a list of processes using InfiniPath. Additionally, to get all<br />

processes, including stats programs, ipath_sma, diags, and others, run the<br />

program in this way:<br />

# /sbin/fuser -v /dev/ipath*<br />

lsof can also take the same form:<br />

# lsof /dev/ipath*<br />

E-4 D000046-005 B


E–Integration with a Batch Queuing System<br />

Lock Enough Memory on Nodes when Using SLURM<br />

The following command terminates all processes using the <strong>QLogic</strong> interconnect:<br />

# /sbin/fuser -k /dev/ipath<br />

For more information, see the man pages for fuser(1) and lsof(8).<br />

Note that hard and explicit program termination, such as kill -9 on the mpirun<br />

Process ID (PID), may result in <strong>QLogic</strong> MPI being unable to guarantee that the<br />

/dev/shm shared memory file is properly removed. As many stale files<br />

accumulate on each node, an error message can appear at startup:<br />

node023:6.Error creating shared memory object in shm_open(/dev/shm<br />

may have stale shm files that need to be removed):<br />

If this occurs, administrators should clean up all stale files by using this command:<br />

# rm -rf /dev/shm/psm_shm.*<br />

See “Error Creating Shared Memory Object” on page F-23 for more information.<br />

Lock Enough Memory on Nodes when Using<br />

SLURM<br />

This section is identical to information provided in “Lock Enough Memory on<br />

Nodes When Using a Batch Queuing System” on page F-22. It is repeated here<br />

for your convenience.<br />

<strong>QLogic</strong> MPI requires the ability to lock (pin) memory during data transfers on each<br />

compute node. This is normally done via /etc/initscript, which is created or<br />

modified during the installation of the infinipath RPM (setting a limit of<br />

128 MB, with the command ulimit -l 131072).<br />

Some batch systems, such as SLURM, propagate the user’s environment from<br />

the node where you start the job to all the other nodes. For these batch systems,<br />

you may need to make the same change on the node from which you start your<br />

batch jobs.<br />

If this file is not present or the node has not been rebooted after the infinipath<br />

RPM has been installed, a failure message similar to one of the following will be<br />

generated.<br />

The following message displays during installation:<br />

$ mpirun -np 2 -m ~/tmp/sm mpi_latency 1000 1000000<br />

iqa-19:0.ipath_userinit: mmap of pio buffers at 100000 failed:<br />

Resource temporarily unavailable<br />

iqa-19:0.Driver initialization failure on /dev/ipath<br />

iqa-20:1.ipath_userinit: mmap of pio buffers at 100000 failed:<br />

Resource temporarily unavailable<br />

iqa-20:1.Driver initialization failure on /dev/ipath<br />

D000046-005 B E-5


E–Integration with a Batch Queuing System<br />

Lock Enough Memory on Nodes when Using SLURM<br />

The following message displays after installation:<br />

$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000<br />

node-00:1.ipath_update_tid_err: failed: Cannot allocate memory<br />

mpi_latency:<br />

/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src<br />

mq_ips.c:691:<br />

mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program<br />

unexpectedly quit. Exiting.<br />

You can check the ulimit -l on all the nodes by running ipath_checkout. A<br />

warning similar to this displays if ulimit -l is less than 4096:<br />

!!!ERROR!!! Lockable memory less than 4096KB on x nodes<br />

To fix this error, install the infinipath RPM on the node, and reboot it to ensure<br />

that /etc/initscript is run.<br />

Alternately, you can create your own /etc/initscript and set the ulimit<br />

there.<br />

E-6 D000046-005 B


F<br />

Troubleshooting<br />

This appendix describes some of the tools you can use to diagnose and fix<br />

problems. The following topics are discussed:<br />

• Using LEDs to Check the State of the Adapter<br />

• BIOS Settings<br />

• Kernel and Initialization Issues<br />

• OpenFabrics and InfiniPath Issues<br />

• System Administration Troubleshooting<br />

• Performance Issues<br />

• <strong>QLogic</strong> MPI Troubleshooting<br />

Troubleshooting information for hardware installation is found in the <strong>QLogic</strong><br />

InfiniBand Adapter Hardware Installation <strong>Guide</strong> and software installation is found<br />

in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong>.<br />

Using LEDs to Check the State of the Adapter<br />

The LEDs function as link and data indicators once the InfiniPath software has<br />

been installed, the driver has been loaded, and the fabric is being actively<br />

managed by a subnet manager.<br />

Table F-1 describes the LED states. The green LED indicates the physical link<br />

signal; the amber LED indicates the link. The green LED normally illuminates first.<br />

The normal state is Green On, Amber On. The QLE7240 and QLE7280 have an<br />

additional state, as shown in Table F-1.<br />

Table F-1. LED Link and Data Indicators<br />

Green OFF<br />

Amber OFF<br />

LED States<br />

Indication<br />

The switch is not powered up. The software is neither<br />

installed nor started. Loss of signal.<br />

Verify that the software is installed and configured with<br />

ipath_control -i. If correct, check both cable connectors.<br />

D000046-005 B F-1


F–Troubleshooting<br />

BIOS Settings<br />

Table F-1. LED Link and Data Indicators (Continued)<br />

LED States<br />

Green ON<br />

Amber OFF<br />

Green ON<br />

Amber ON<br />

Green BLINKING (quickly)<br />

Amber ON<br />

Green BLINKING a<br />

Amber BLINKING<br />

Table Notes<br />

Signal detected and the physical link is up. Ready to talk<br />

to SM to bring the link fully up.<br />

If this state persists, the SM may be missing or the link<br />

may not be configured.<br />

Use ipath_control -i to verify the software state. If<br />

all infiniband adapters are in this state, then the SM is<br />

not running. Check the SM configuration, or install and<br />

run opensmd.<br />

The link is configured, properly connected, and ready.<br />

Signal detected. Ready to talk to an SM to bring the link<br />

fully up.<br />

The link is configured. Properly connected and ready to<br />

receive data and link packets.<br />

Indicates traffic<br />

Indication<br />

Locates the adapter<br />

This feature is controlled by ipath_control -b [On<br />

| Off]<br />

a<br />

This feature is available only on the QLE7340, QLE7342, QLE7240 and QLE7280 adapters<br />

BIOS Settings<br />

This section covers issues related to BIOS settings.The most important setting is<br />

Advanced Configuration and Power Interface (ACPI). This setting must be<br />

enabled. If ACPI has been disabled, it may result in initialization problems, as<br />

described in “InfiniPath Interrupts Not Working” on page F-3.<br />

You can check and adjust the BIOS settings using the BIOS Setup utility. Check<br />

the hardware documentation that came with your system for more information.<br />

Kernel and Initialization Issues<br />

Issues that may prevent the system from coming up properly are described in the<br />

following sections.<br />

F-2 D000046-005 B


F–Troubleshooting<br />

Kernel and Initialization Issues<br />

Driver Load Fails Due to Unsupported Kernel<br />

If you try to load the InfiniPath driver on a kernel that InfiniPath software does not<br />

support, the load fails. Error messages similar to this display:<br />

modprobe: error inserting<br />

’/lib/modules/2.6.3-1.1659-smp/updates/kernel/drivers/infiniband/h<br />

w/qib/ib_qib.ko’: -1 Invalid module format<br />

To correct this problem, install one of the appropriate supported Linux kernel<br />

versions, then reload the driver.<br />

Rebuild or Reinstall Drivers if Different Kernel Installed<br />

If you upgrade the kernel, then you must reboot and then rebuild or reinstall the<br />

InfiniPath kernel modules (drivers). <strong>QLogic</strong> recommends using the InfiniBand<br />

Fabric Suite <strong>Software</strong> Installation TUI to preform this rebuild or reinstall. Refer to<br />

the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more information.<br />

InfiniPath Interrupts Not Working<br />

The InfiniPath driver cannot configure the InfiniPath link to a usable state unless<br />

interrupts are working. Check for this problem with the command:<br />

$ grep ib_qib /proc/interrupts<br />

Normal output is similar to this:<br />

CPU0 CPU1<br />

185: 364263 0 IO-APIC-level ib_qib<br />

NOTE:<br />

The output you see may vary depending on board type, distribution, or<br />

update level.<br />

If there is no output at all, the driver initialization failed. For more information on<br />

driver problems, see “Driver Load Fails Due to Unsupported Kernel” on page F-3<br />

or “InfiniPath ib_qib Initialization Failure” on page F-5.<br />

If the output is similar to one of these lines, then interrupts are not being delivered<br />

to the driver.<br />

66: 0 0 PCI-MSI ib_qib<br />

185: 0 0 IO-APIC-level ib_qib<br />

The following message appears when driver has initialized successfully, but no<br />

interrupts are seen within 5 seconds.<br />

ib_qib 0000:82:00.0: No interrupts detected.<br />

D000046-005 B F-3


F–Troubleshooting<br />

Kernel and Initialization Issues<br />

A zero count in all CPU columns means that no InfiniPath interrupts have been<br />

delivered to the processor.<br />

The possible causes of this problem are:<br />

• Booting the Linux kernel with ACPI disabled on either the boot command<br />

line or in the BIOS configuration<br />

• Other infinipath initialization failures<br />

To check if the kernel was booted with the noacpi or pci=noacpi option, use<br />

this command:<br />

$ grep -i acpi /proc/cmdline<br />

If output is displayed, fix the kernel boot command line so that ACPI is enabled.<br />

This command line can be set in various ways, depending on your distribution. If<br />

no output is displayed, check that ACPI is enabled in your BIOS settings.<br />

To track down other initialization failures, see “InfiniPath ib_qib Initialization<br />

Failure” on page F-5.<br />

The program ipath_checkout can also help flag these kinds of problems. See<br />

“ipath_checkout” on page I-10 for more information.<br />

OpenFabrics Load Errors if ib_qib Driver Load Fails<br />

When the ib_qib driver fails to load, the other OpenFabrics drivers/modules will<br />

load and be shown by lsmod, but commands like ibstatus, ibv_devinfo,<br />

and ipath_control -i will fail as follows:<br />

# ibstatus<br />

Fatal error: device ’*’: sys files not found<br />

(/sys/class/infiniband/*/ports)<br />

# ibv_devinfo<br />

libibverbs: Fatal: couldn’t read uverbs ABI version.<br />

No IB devices found<br />

# ipath_control -i<br />

InfiniPath driver not loaded <br />

No InfiniPath info available<br />

F-4 D000046-005 B


F–Troubleshooting<br />

Kernel and Initialization Issues<br />

InfiniPath ib_qib Initialization Failure<br />

There may be cases where ib_qib was not properly initialized. Symptoms of this<br />

may show up in error messages from an MPI job or another program. Here is a<br />

sample command and error message:<br />

$ mpirun -np 2 -m ~/tmp/mbu13 osu_latency<br />

:ipath_userinit: assign_port command failed: Network is<br />

down<br />

:can’t open /dev/ipath, network down<br />

This will be followed by messages of this type after 60 seconds:<br />

MPIRUN: 1 rank has not yet exited 60 seconds<br />

after rank 0 (node ) exited without reaching<br />

MPI_Finalize().<br />

MPIRUN:Waiting at most another 60 seconds for<br />

the remaining ranks to do a clean shutdown before terminating 1<br />

node processes.<br />

If this error appears, check to see if the InfiniPath driver is loaded by typing:<br />

$ lsmod | grep ib_qib<br />

If no output is displayed, the driver did not load for some reason. In this case, try<br />

the following commands (as root):<br />

# modprobe -v ib_qib<br />

# lsmod | grep ib_qib<br />

# dmesg | grep -i ib_qib | tail -25<br />

The output will indicate whether the driver has loaded. Printing out messages<br />

using dmesg may help to locate any problems with ib_qib.<br />

If the driver loaded, but MPI or other programs are not working, check to see if<br />

problems were detected during the driver and <strong>QLogic</strong> hardware initialization with<br />

the command:<br />

$ dmesg | grep -i ib_qib<br />

This command may generate more than one screen of output.<br />

Also, check the link status with the commands:<br />

$ cat /sys/class/infiniband/ipath*/device/status_str<br />

These commands are normally executed by the ipathbug-helper script, but<br />

running them separately may help locate the problem.<br />

See also “status_str” on page I-19 and “ipath_checkout” on page I-10.<br />

D000046-005 B F-5


F–Troubleshooting<br />

OpenFabrics and InfiniPath Issues<br />

MPI Job Failures Due to Initialization Problems<br />

If one or more nodes do not have the interconnect in a usable state, messages<br />

similar to the following appear when the MPI program is started:<br />

userinit: userinit ioctl failed: Network is down [1]: device init<br />

failed<br />

userinit: userinit ioctl failed: Fatal Error in keypriv.c(520):<br />

device init failed<br />

These messages may indicate that a cable is not connected, the switch is down,<br />

SM is not running, or that a hardware error occurred.<br />

OpenFabrics and InfiniPath Issues<br />

The following sections cover issues related to OpenFabrics (including Subnet<br />

Managers) and InfiniPath.<br />

Stop Infinipath Services Before Stopping/Restarting<br />

InfiniPath<br />

The following Infinipath services must be stopped before<br />

stopping/starting/restarting InfiniPath:<br />

• <strong>QLogic</strong> Fabric Manager<br />

• OpenSM<br />

• <strong>QLogic</strong> MPI<br />

• VNIC<br />

• SRP<br />

Here is a sample command and the corresponding error messages:<br />

# /etc/init.d/openibd stop<br />

Unloading infiniband modules: sdp cm umad uverbs ipoib sa ipath mad<br />

coreFATAL:Module ib_umad is in use.<br />

Unloading infinipath modules FATAL: Module ib_qib is in use.<br />

[FAILED]<br />

F-6 D000046-005 B


F–Troubleshooting<br />

OpenFabrics and InfiniPath Issues<br />

Manual Shutdown or Restart May Hang if NFS in Use<br />

If you are using NFS over IPoIB and use the manual /etc/init.d/openibd<br />

stop (or restart) command, the shutdown process may silently hang on the<br />

fuser command contained within the script. This is because fuser cannot<br />

traverse down the tree from the mount point once the mount point has<br />

disappeared. To remedy this problem, the fuser process itself needs to be killed.<br />

Run the following command either as a root user or as the user who is running the<br />

fuser process:<br />

# kill -9 fuser<br />

The shutdown will continue.<br />

This problem is not seen if the system is rebooted or if the filesystem has already<br />

been unmounted before stopping infinipath.<br />

Load and Configure IPoIB Before Loading SDP<br />

SDP generates Connection Refused errors if it is loaded before IPoIB has been<br />

loaded and configured. To solve the problem, load and configure IPoIB first.<br />

Set $IBPATH for OpenFabrics Scripts<br />

The environment variable $IBPATH must be set to /usr/bin. If this has not been<br />

set, or if you have it set to a location other than the installed location, you may see<br />

error messages similar to the following when running some OpenFabrics scripts:<br />

/usr/bin/ibhosts: line 30: /usr/local/bin/ibnetdiscover: No such<br />

file or directory<br />

For the OpenFabrics commands supplied with this InfiniPath release, set the<br />

variable (if it has not been set already) to /usr/bin, as follows:<br />

$ export IBPATH=/usr/bin<br />

SDP Module Not Loading<br />

If the settings for debug level and the zero copy threshold from InfiniPath<br />

release 2.0 are present in the release 2.2 /etc/modprobe.conf file (RHEL) or<br />

/etc/modprobe.conf.local (SLES) file, the SDP module may not load:<br />

options ib_sdp sdp_debug_level=4<br />

sdp_zcopy_thrsh_src_default=10000000<br />

To solve the problem, remove this line.<br />

D000046-005 B F-7


F–Troubleshooting<br />

System Administration Troubleshooting<br />

ibsrpdm Command Hangs when Two <strong>Host</strong> Channel<br />

Adapters are Installed but Only Unit 1 is Connected<br />

to the Switch<br />

If multiple infiniband adapters (unit 0 and unit 1) are installed and only unit 1 is<br />

connected to the switch, the ibsrpdm command (to set up an SRP target) can<br />

hang. If unit 0 is connected and unit 1 is disconnected, the problem does not<br />

occur.<br />

When only unit 1 is connected to the switch, use the -d option with ibsrpdm. Then,<br />

using the output from the ibsrpdm command, echo the new target information into<br />

/sys/class/infiniband_srp/srp-ipath1-1/add_target.<br />

For example:<br />

# ibsrpdm -d /dev/infiniband/umad1 -c<br />

# echo \<br />

id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe800000000<br />

0000021000001ff040bf6,pkey=ffff,service_id=f60b04ff01000021 ><br />

/sys/class/infiniband_srp/srp-ipath0-1/add_target<br />

Outdated ipath_ether Configuration Setup Generates Error<br />

Ethernet emulation (ipath_ether) has been removed in this release, and, as a<br />

result, an error may be seen if the user still has an alias set previously by<br />

modprobe.conf (for example, alias eth2 ipath_ether).<br />

When ifconfig or ifup are run, the error will look similar to this (assuming<br />

ipath_ether was used for eth2):<br />

eth2: error fetching interface information: Device not found<br />

To prevent the error message, remove the following files (assuming<br />

ipath_ether was used for eth2):<br />

/etc/sysconfig/network-scripts/ifcfg-eth2 (for RHEL)<br />

/etc/sysconfig/network/ifcfg-eth2 (for SLES)<br />

<strong>QLogic</strong> recommends using the IP over InfiniBand protocol (IPoIB-CM), included in<br />

the standard OpenFabrics software releases, as a replacement for<br />

ipath_ether.<br />

System Administration Troubleshooting<br />

The following sections provide details on locating problems related to system<br />

administration.<br />

Broken Intermediate Link<br />

Sometimes message traffic passes through the fabric while other traffic appears<br />

to be blocked. In this case, MPI jobs fail to run.<br />

F-8 D000046-005 B


F–Troubleshooting<br />

Performance Issues<br />

In large cluster configurations, switches may be attached to other switches to<br />

supply the necessary inter-node connectivity. Problems with these inter-switch (or<br />

intermediate) links are sometimes more difficult to diagnose than failure of the<br />

final link between a switch and a node. The failure of an intermediate link may<br />

allow some traffic to pass through the fabric while other traffic is blocked or<br />

degraded.<br />

If you notice this behavior in a multi-layer fabric, check that all switch cable<br />

connections are correct. Statistics for managed switches are available on a<br />

per-port basis, and may help with debugging. See your switch vendor for more<br />

information.<br />

<strong>QLogic</strong> recommends using FastFabric to help diagnose this problem. If<br />

FastFabric is not installed in the fabric, there are two diagnostic tools, ibhosts<br />

and ibtracert, that may also be helpful. The tool ibhosts lists all the<br />

InfiniBand nodes that the subnet manager recognizes. To check the InfiniBand<br />

path between two nodes, use the ibtracert command.<br />

Performance Issues<br />

The following sections discuss known performance issues.<br />

Large Message Receive Side Bandwidth Varies with<br />

Socket Affinity on Opteron Systems<br />

On Opteron systems, when using the QLE7240 or QLE7280 in DDR mode, there<br />

is a receive side bandwidth bottleneck for CPUs that are not adjacent to the PCI<br />

Express root complex. This may cause performance to vary. The bottleneck is<br />

most obvious when using SendDMA with large messages on the farthest sockets.<br />

The best case for SendDMA is when both sender and receiver are on the closest<br />

sockets. Overall performance for PIO (and smaller messages) is better than with<br />

SendDMA.<br />

Erratic Performance<br />

Sometimes erratic performance is seen on applications that use interrupts. An<br />

example is inconsistent SDP latency when running a program such as netperf.<br />

This may be seen on AMD-based systems using the QLE7240 or QLE7280<br />

adapters. If this happens, check to see if the program irqbalance is running.<br />

This program is a Linux daemon that distributes interrupts across processors.<br />

However, it may interfere with prior interrupt request (IRQ) affinity settings,<br />

introducing timing anomalies. After stopping this process (as a root user), bind<br />

IRQ to a CPU for more consistent performance. First, stop irqbalance:<br />

# /sbin/chkconfig irqbalance off<br />

# /etc/init.d/irqbalance stop<br />

D000046-005 B F-9


F–Troubleshooting<br />

Performance Issues<br />

Next, find the IRQ number and bind it to a CPU. The IRQ number can be found in<br />

one of two ways, depending on the system used. Both methods are described in<br />

the following paragraphs.<br />

NOTE:<br />

Take care when cutting and pasting commands from PDF documents, as<br />

quotes are special characters and may not be translated correctly.<br />

Method 1<br />

Check to see if the IRQ number is found in /proc/irq/xxx, where xxx is the<br />

IRQ number in /sys/class/infiniband/ipath*/device/irq. Do this as a<br />

root user. For example:<br />

# my_irq=‘cat /sys/class/infiniband/ipath*/device/irq‘<br />

# ls /proc/irq<br />

If $my_irq can be found under /proc/irq/, then type:<br />

# echo 01 > /proc/irq/$my_irq/smp_affinity<br />

Method 2<br />

If command from Method 1, ls /proc/irq, cannot find $my_irq, then use the<br />

following commands instead:<br />

# my_irq=‘cat /proc/interrupts|grep ib_qib|awk \<br />

’{print $1}’|sed -e ’s/://’‘<br />

# echo 01 > /proc/irq/$my_irq/smp_affinity<br />

This method is not the first choice because, on some systems, there may be two<br />

rows of ib_qib output, and you will not know which one of the two numbers to<br />

choose. However, if you cannot find $my_irq listed under /proc/irq<br />

(Method 1), this type of system most likely has only one line for ib_qib listed in<br />

/proc/interrupts, so you can use Method 2.<br />

Here is an example:<br />

# cat /sys/class/infiniband/ipath*/device/irq<br />

98<br />

# ls /proc/irq<br />

0 10 11 13 15 233 4 50 7 8 90<br />

1 106 12 14 2 3 5 58 66 74 9<br />

(Note that you cannot find 98.)<br />

# cat /proc/interrupts|grep ib_qib|awk \<br />

’{print $1}’|sed -e ’s/://’<br />

106<br />

# echo 01 > /proc/irq/106/smp_affinity<br />

Using the echo command immediately changes the processor affinity of an IRQ.<br />

F-10 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

NOTE:<br />

• The contents of the smp_affinity file may not reflect the expected<br />

values, even though the affinity change has taken place.<br />

• If the driver is reloaded, the affinity assignment will revert to the default,<br />

so you will need to reset it to the desired value.<br />

You can look at the stats in /proc/interrupts while the adapter is active to<br />

observe which CPU is fielding ib_qib interrupts.<br />

Performance Warning if ib_qib Shares Interrupts with eth0<br />

When ib_qib shares interrupts with eth0, performance may be affected the<br />

OFED ULPs, such as IPoIB. A warning message appears in syslog, and also on<br />

the console or tty session where /etc/init.d/openibd start is run (if<br />

messages are set up to be displayed). Messages are in this form:<br />

Nov 5 14:25:43 infinipath: Shared interrupt will<br />

affect performance: vector 169: devices eth0, ib_qib<br />

Check /proc/interrupts: "169" is in the first column, and "devices" are shown<br />

in the last column.<br />

You can also contact your system vendor to see if the BIOS settings can be<br />

changed to avoid the problem.<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Problems specific to compiling and running MPI programs are described in the<br />

following sections.<br />

D000046-005 B F-11


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Mixed Releases of MPI RPMs<br />

Make sure that all of the MPI RPMs are from the same release. When using<br />

mpirun, an error message will occur if different components of the MPI RPMs are<br />

from different releases. In the following example, mpirun from release 2.1 is<br />

being used with a 2.2 library.<br />

$ mpirun -np 2 -m ~/tmp/x2 osu_latency<br />

MPI_runscript-xqa-14.0: ssh -x> Cannot detect InfiniPath<br />

interconnect.<br />

MPI_runscript-xqa-14.0: ssh -x> Seek help on loading InfiniPath<br />

interconnect driver.<br />

MPI_runscript-xqa-15.1: ssh -x> Cannot detect InfiniPath<br />

interconnect.<br />

MPI_runscript-xqa-15.1: ssh -x> Seek help on loading InfiniPath<br />

interconnect driver.<br />

MPIRUN: Node program(s) exitted during connection setup<br />

$ mpirun -v<br />

MPIRUN:Infinipath Release2.3: Built on Wed Nov 6 17:28:58 PDT 2008<br />

by mee<br />

The following example is the error that occurs when mpirun from the 2.2 release<br />

is being used with the 2.1 libraries.<br />

$ mpirun-ipath-ssh -np 2 -ppn 1 -m ~/tmp/idev osu_latency<br />

MPIRUN: mpirun from the 2.3 software distribution requires all<br />

node processes to be running 2.3 software. At least node <br />

uses non-2.3 MPI libraries<br />

The following string means that either an incompatible non-<strong>QLogic</strong> mpirun binary<br />

has been found or that the binary is from an InfiniPath release prior to 2.3.<br />

Found incompatible non-InfiniPath or pre-2.3<br />

InfiniPath mpirun-ipath-ssh (exec=/usr/bin/mpirun-ipath-ssh)<br />

Missing mpirun Executable<br />

When the mpirun executable is missing, the following error appears:<br />

Please install mpirun on or provide a path to<br />

mpirun-ipath-ssh<br />

(not found in $MPICH_ROOT/bin, $PATH<br />

or path/to/mpirun-ipath-ssh/on/the/head/node) or run with<br />

mpirun -distributed=off<br />

This error string means that an mpirun executable (mpirun-ipath-ssh) was<br />

not found on the computation nodes. Make sure that the mpi-frontend-* RPM<br />

is installed on all nodes that will use mpirun.<br />

F-12 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Resolving <strong>Host</strong>name with Multi-Homed Head Node<br />

By default, mpirun assumes that ranks can independently resolve the hostname<br />

obtained on the head node with gethostname. However, the hostname of a<br />

multi-homed head node may not resolve on the compute nodes. To address this<br />

problem, the following new option has been added to mpirun:<br />

-listen-addr <br />

This address will be forwarded to the ranks. To change the default, put this option<br />

in the global mpirun.defaults file or in a user-local file.<br />

If the address on the frontend cannot be resolved, then a warning is sent to the<br />

console and to syslog. If you use the following command line, you may see<br />

messages similar to this:<br />

% mpirun-ipath-ssh -np 2 -listen-addr foo -m ~/tmp/hostfile-idev<br />

osu_bcast<br />

MPIRUN.: Warning: Couldn’t resolve listen address ’foo’<br />

on head node<br />

(Unknown host), using it anyway...<br />

MPIRUN.: No node programs have connected within 60<br />

seconds.<br />

This message occurs if none of the ranks can connect back to the head node.<br />

The following message may appear if some ranks cannot connect back:<br />

MPIRUN.: Not all node programs have connected within<br />

60 seconds.<br />

MPIRUN.: No connection received from 1 node process on<br />

node <br />

Cross-Compilation Issues<br />

The GNU 4.x environment is supported in the PathScale Compiler Suite 3.x<br />

release.<br />

However, the 2.x <strong>QLogic</strong> PathScale compilers are not currently supported on<br />

SLES 10 systems that use the GNU 4.x compilers and compiler environment<br />

(header files and libraries).<br />

<strong>QLogic</strong> recommends installing the PathScale 3.1 release.<br />

D000046-005 B F-13


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Compiler/Linker Mismatch<br />

If the compiler and linker do not match in C and C++ programs, the following error<br />

message appears:<br />

$ export MPICH_CC=gcc<br />

$ mpicc mpiworld.c<br />

/usr/bin/ld: cannot find -lmpichabiglue_gcc3<br />

collect2: ld returned 1 exit status<br />

Compiler Cannot Find Include, Module, or Library Files<br />

RPMs can be installed in any location by using the --prefix option. This can<br />

introduce errors when compiling, if the compiler cannot find the include files (and<br />

module files for Fortran 90 and Fortran 95) from mpi-devel*, and the libraries<br />

from mpi-libs*, in the new locations. Compiler errors similar to the following<br />

appear:<br />

$ mpicc myprogram.c<br />

/usr/bin/ld: cannot find -lmpich<br />

collect2: ld returned 1 exit status<br />

NOTE:<br />

As noted in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong>, all development<br />

files now reside in specific *-Devel subdirectories.<br />

On development nodes, programs must be compiled with the appropriate options<br />

so that the include files and the libraries can be found in the new locations. In<br />

addition, when running programs on compute nodes, you need to ensure that the<br />

run-time library path is the same as the path that was used to compile the<br />

program.<br />

The following examples show what compiler options to use for include files and<br />

libraries on the development nodes, and how to specify the new library path on<br />

the compute nodes for the runtime linker. The affected RPMs are:<br />

• mpi-devel* (on the development nodes)<br />

• mpi-libs* (on the development or compute nodes)<br />

For the examples in “Compiling on Development Nodes” on page F-15, it is<br />

assumed that the new locations are:<br />

/path/to/devel (for mpi-devel-*)<br />

/path/to/libs (for mpi-libs-*)<br />

F-14 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Compiling on Development Nodes<br />

If the mpi-devel-* RPM is installed with the --prefix /path/to/devel<br />

option, then mpicc, etc. must be passed in -I/path/to/devel/include for<br />

the compiler to find the MPI include files, as in this example:<br />

$ mpicc myprogram.c -I/path/to/devel/include<br />

If you are using Fortran 90 or Fortran 95, a similar option is needed for the<br />

compiler to find the module files:<br />

$ mpif90 myprogramf90.f90 -I/path/to/devel/include<br />

If the mpi-lib-* RPM is installed on these development nodes with the<br />

--prefix /path/to/libs option, then the compiler needs the<br />

-L/path/to/libs option so it can find the libraries. Here is the example for<br />

mpicc:<br />

$ mpicc myprogram.c -L/path/to/libs/lib (for 32 bit)<br />

$ mpicc myprogram.c -L/path/to/libs/lib64 (for 64 bit)<br />

To find both the include files and the libraries with these non-standard locations,<br />

type:<br />

$ mpicc myprogram.c -I/path/to/devel/include -L/path/to/libs/lib<br />

Specifying the Run-time Library Path<br />

There are several ways to specify the run-time library path so that when the<br />

programs are run, the appropriate libraries are found in the new location. There<br />

are three different ways to do this:<br />

• Use the -Wl,-rpath, option when compiling on the development node.<br />

• Update the /etc/ld.so.conf file on the compute nodes to include the path.<br />

• Export the path in the .mpirunrc file.<br />

These methods are explained in more detail in the following paragraphs.<br />

An additional linker option, -Wl,-rpath, supplies the run-time library path when<br />

compiling on the development node. The compiler options now look like this:<br />

$ mpicc myprogram.c -I/path/to/devel/include -L/path/to/libs/lib<br />

-Wl,-rpath,/path/to/libs/lib<br />

The above compiler command ensures that the program will run using this path on<br />

any machine.<br />

D000046-005 B F-15


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

For the second option, change the file /etc/ld.so.conf on the compute<br />

nodes rather than using the -Wl,-rpath, option when compiling on the<br />

development node. It is assumed that the mpi-lib-* RPM is installed on the<br />

compute nodes with the same --prefix /path/to/libs option as on the<br />

development nodes. Then, on the computer nodes, add the following lines to the<br />

file /etc/ld.so.conf:<br />

/path/to/libs/lib<br />

/path/to/libs/lib64<br />

To make sure that the changes take effect, run (as a root user):<br />

# /etc/ldconfig<br />

The libraries can now be found by the runtime linker on the compute nodes. The<br />

advantage to this method is that it works for all InfiniPath programs, without<br />

having to remember to change the compile/link lines.<br />

Instead of either of the two previous mechanisms, you can also put the following<br />

line in the ~/.mpirunrc file:<br />

export LD_LIBRARY_PATH=/path/to/libs/{lib,lib64}<br />

See “Environment for Node Programs” on page 4-19 for more information on<br />

using the -rcfile option with mpirun.<br />

Choices between these options are left up to the cluster administrator and the MPI<br />

developer. See the documentation for your compiler for more information on the<br />

compiler options.<br />

Problem with Shell Special Characters and Wrapper Scripts<br />

Be careful when dealing with shell special characters, especially when using the<br />

mpicc, etc. wrapper scripts. These characters must be escaped to avoid the shell<br />

interpreting them.<br />

For example, when compiling code using the -D compiler flag, mpicc (and other<br />

wrapper scripts) will fail if the defined variable contains a space, even when<br />

surrounded by double quotes. In the following example, the result of the -show<br />

option reveals what happens to the variable:<br />

$ mpicc -show -DMYDEFINE="some value" test.c<br />

gcc -c -DMYDEFINE=some value test.c<br />

gcc -Wl,--export-dynamic,--allow-shlib-undefined test.o -lmpich<br />

F-16 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The shell strips off the double quotes before handing the arguments to the mpicc<br />

script, thus causing the problem. The workaround is to escape the double quotes<br />

and white space by using backslashes, so that the shell does not process them.<br />

(Also note the single quote (‘) around the -D, since the scripts do an eval rather<br />

than directly invoking the underlying compiler.) Use this command instead:<br />

$ mpicc -show -DMYDEFINE=\"some\ value\" test.c<br />

gcc -c ‘-DMYDEFINE="some value"‘ test.c<br />

gcc -Wl,--export-dynamic,--allow-shlib-undefined test.o -lmpich<br />

Run Time Errors with Different MPI Implementations<br />

It is now possible to run different implementations of MPI, such as HP-MPI, over<br />

InfiniPath. Many of these implementations share command (such as mpirun) and<br />

library names, so it is important to distinguish which MPI version is in use. This is<br />

done primarily through careful programming practices.<br />

Examples are provided in the following paragraphs.<br />

In the following command, the HP-MPI version of mpirun is invoked by the full<br />

path name. However, the program mpi_nxnlatbw was compiled with the <strong>QLogic</strong><br />

version of mpicc. The mismatch produces errors similar this:<br />

$ /opt/hpmpi/bin/mpirun -hostlist "bbb-01,bbb-02,bbb-03,bbb-04"<br />

-np 4 /usr/bin/mpi_nxnlatbw<br />

bbb-02: Not running from mpirun.<br />

MPI Application rank 1 exited before MPI_Init() with status 1<br />

bbb-03: Not running from mpirun.<br />

MPI Application rank 2 exited before MPI_Init() with status 1<br />

bbb-01: Not running from mpirun.<br />

bbb-04: Not running from mpirun.<br />

MPI Application rank 3 exited before MPI_Init() with status 1<br />

MPI Application rank 0 exited before MPI_Init() with status 1<br />

D000046-005 B F-17


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

In the next case, mpi_nxnlatbw.c is compiled with the HP-MPI version of<br />

mpicc, and given the name hpmpi-mpi_nxnlatbw, so that it is easy to see<br />

which version was used. However, it is run with the <strong>QLogic</strong> mpirun, which<br />

produces errors similar to this:<br />

$ /opt/hpmpi/bin/mpicc \<br />

/usr/share/mpich/examples/performance/mpi_nxnlatbw.c -o<br />

hpmpi-mpi_nxnlatbw<br />

$ mpirun -m ~/host-bbb -np 4 ./hpmpi-mpi_nxnlatbw<br />

./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />

libmpio.so.1: cannot open shared object file: No such file or<br />

directory<br />

./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />

libmpio.so.1: cannot open shared object file: No such file or<br />

directory<br />

./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />

libmpio.so.1: cannot open shared object file: No such file or<br />

directory<br />

./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />

libmpio.so.1: cannot open shared object file: No such file or<br />

directory<br />

MPIRUN: Node program(s) exitted during connection setup<br />

The following two commands will work properly.<br />

<strong>QLogic</strong> mpirun and executable used together:<br />

$ mpirun -m ~/host-bbb -np 4 /usr/bin/mpi_nxnlatbw<br />

The HP-MPI mpirun and executable used together:<br />

$ /opt/hpmpi/bin/mpirun -hostlist \<br />

"bbb-01,bbb-02,bbb-03,bbb-04" -np 4 ./hpmpi-mpi_nxnlatbw<br />

Hints<br />

• Use the rpm command to find out which RPM is installed in the standard<br />

installed layout. For example:<br />

# rpm -qf /usr/bin/mpirun<br />

mpi-frontend-2.3-5314.919_sles10_qlc<br />

• Check all rcfiles and /opt/infinipath/etc/mpirun.defaults to<br />

make sure that the paths for binaries and libraries ($PATH and<br />

$LD_LIBRARY _PATH) are consistent.<br />

• When compiling, use descriptive names for the object files.<br />

F-18 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

See “Compiler Cannot Find Include, Module, or Library Files” on page F-14,<br />

“Compiling on Development Nodes” on page F-15, and “Specifying the Run-time<br />

Library Path” on page F-15 for additional information.<br />

Process Limitation with ssh<br />

MPI jobs that use more than eight processes per node may encounter an ssh<br />

throttling mechanism that limits the amount of concurrent per-node connections<br />

to 10. If you have this problem, a message similar to this appears when using<br />

mpirun:<br />

$ mpirun -m tmp -np 11 ~/mpi/mpiworld/mpiworld<br />

ssh_exchange_identification: Connection closed by remote host<br />

MPIRUN: Node program(s) exitted during connection setup<br />

If you encounter a message like this, you or your system administrator should<br />

increase the value of MaxStartups in your sshd configurations.<br />

NOTE:<br />

This limitation applies only if -distributed=off is specified. By default,<br />

with -distributed=on, you will not normally have this problem.<br />

Number of Processes Exceeds ulimit for Number of Open<br />

Files<br />

When users scale up the number of processes beyond the number of open files<br />

allowed by ulimit, mpirun will print an error message. The ulimit for the<br />

number of open files is typically 1024 on both Red Hat and SLES systems. The<br />

message will look similar to:<br />

MPIRUN.up001: Warning: ulimit for the number of open files is only<br />

1024, but this mpirun request requires at least <br />

open files (sockets). The shell ulimit for open files needs to be<br />

increased.<br />

This is due to limit:<br />

descriptors 1024<br />

The ulimit can be increased; <strong>QLogic</strong> recommends an increase of<br />

approximately 20 percent over the number of CPUs. For example, in the case of<br />

2048 CPUs, ulimit can be increased to 2500:<br />

ulimit -n 2500<br />

The ulimit needs to be increased only on the host where mpirun was started,<br />

unless the mode of operation allows mpirun from any node.<br />

D000046-005 B F-19


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Using MPI.mod Files<br />

MPI.mod (or mpi.mod) are the Fortran 90/Fortran 95 mpi modules files. These<br />

files contain the Fortran 90/Fortran 95 interface to the platform-specific MPI<br />

library. The module file is invoked by ‘USE MPI’ or ‘use mpi’ in your application. If<br />

the application has an argument list that does not match what mpi.mod expects,<br />

errors such as this can occur:<br />

$ mpif90 -O3 -OPT:fast_math -c communicate.F<br />

call mpi_recv(nrecv,1,mpi_integer,rpart(nswap),0,<br />

^<br />

pathf95-389 pathf90: ERROR BORDERS, File = communicate.F, Line =<br />

407, Column = 18<br />

No specific match can be found for the generic subprogram call<br />

"MPI_RECV".<br />

If it is necessary to use a non-standard argument list, create your own MPI<br />

module file and compile the application with it, rather than using the standard MPI<br />

module file that is shipped in the mpi-devel-* RPM.<br />

The default search path for the module file is:<br />

/usr/include<br />

To include your own MPI.mod rather than the standard version, use<br />

-I/your/search/directory, which causes /your/search/directory to<br />

be checked before /usr/include. For example:<br />

$ mpif90 -I/your/search/directory myprogram.f90<br />

Usage for Fortran 95 will be similar to the example for Fortran 90.<br />

Extending MPI Modules<br />

MPI implementations provide procedures that accept an argument having any<br />

data type, any precision, and any rank. However, it is not practical for an MPI<br />

module to enumerate every possible combination of type, kind, and rank.<br />

Therefore, the strict type checking required by Fortran 90 may generate errors.<br />

For example, if the MPI module tells the compiler that mpi_bcast can operate on<br />

an integer but does not also say that it can operate on a character string, you may<br />

see a message similar to the following:<br />

pathf95: ERROR INPUT, File = input.F, Line = 32, Column = 14<br />

No specific match can be found for the generic subprogram call<br />

"MPI_BCAST".<br />

F-20 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

If you know that an argument can accept a data type that the MPI module does<br />

not explicitly allow, you can extend the interface for yourself. For example, the<br />

following program shows how to extend the interface for mpi_bcast so that it<br />

accepts a character type as its first argument, without losing the ability to accept<br />

an integer type as well:<br />

module additional_bcast<br />

use mpi<br />

implicit none<br />

interface mpi_bcast<br />

module procedure additional_mpi_bcast_for_character<br />

end interface mpi_bcast<br />

contains<br />

subroutine additional_mpi_bcast_for_character(buffer, count,<br />

datatype, & root, comm, ierror)<br />

character*(*) buffer<br />

integer count, datatype, root, comm, ierror<br />

! Call the Fortran 77 style implicit interface to "mpi_bcast"<br />

external mpi_bcast<br />

call mpi_bcast(buffer, count, datatype, root, comm, ierror)<br />

end subroutine additional_mpi_bcast_for_character<br />

end module additional_bcast<br />

program myprogram<br />

use mpi<br />

use additional_bcast<br />

implicit none<br />

character*4 c<br />

integer master, ierr, i<br />

! Explicit integer version obtained from module "mpi"<br />

call mpi_bcast(i, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr)<br />

! Explicit character version obtained from module<br />

"additional_bcast"<br />

call mpi_bcast(c, 4, MPI_CHARACTER, master, MPI_COMM_WORLD,<br />

ierr)<br />

end program myprogram<br />

D000046-005 B F-21


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

This is equally applicable if the module mpi provides only a lower-rank interface<br />

and you want to add a higher-rank interface, for example, when the module<br />

explicitly provides for 1-D and 2-D integer arrays, but you need to pass a 3-D<br />

integer array. Add a higher-rank interface only under the following conditions:<br />

• The module mpi provides an explicit Fortran 90 style interface for<br />

mpi_bcast. If the module mpi does not have this interface, the program<br />

uses an implicit Fortran 77 style interface, which does not perform any type<br />

checking. Adding an interface will cause type-checking error messages<br />

where there previously were none.<br />

• The underlying function accepts any data type. It is appropriate for the first<br />

argument of mpi_bcast because the function operates on the underlying<br />

bits, without attempting to interpret them as integer or character data.<br />

Lock Enough Memory on Nodes When Using a Batch<br />

Queuing System<br />

<strong>QLogic</strong> MPI requires the ability to lock (pin) memory during data transfers on each<br />

compute node. This is normally done via /etc/initscript, which is created or<br />

modified during the installation of the infinipath RPM (setting a limit of<br />

128 MB, with the command ulimit -l 131072).<br />

Some batch systems, such as SLURM, propagate the user’s environment from<br />

the node where you start the job to all the other nodes. For these batch systems,<br />

you may need to make the same change on the node from which you start your<br />

batch jobs.<br />

If this file is not present or the node has not been rebooted after the infinipath<br />

RPM has been installed, a failure message similar to one of the following will be<br />

generated.<br />

The following message displays during installation:<br />

$ mpirun -np 2 -m ~/tmp/sm mpi_latency 1000 1000000<br />

iqa-19:0.ipath_userinit: mmap of pio buffers at 100000 failed:<br />

Resource temporarily unavailable<br />

iqa-19:0.Driver initialization failure on /dev/ipath<br />

iqa-20:1.ipath_userinit: mmap of pio buffers at 100000 failed:<br />

Resource temporarily unavailable<br />

iqa-20:1.Driver initialization failure on /dev/ipath<br />

F-22 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following message displays after installation:<br />

$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000<br />

node-00:1.ipath_update_tid_err: failed: Cannot allocate memory<br />

mpi_latency:<br />

/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src<br />

mq_ips.c:691:<br />

mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program<br />

unexpectedly quit. Exiting.<br />

You can check the ulimit -l on all the nodes by running ipath_checkout. A<br />

warning similar to this displays if ulimit -l is less than 4096:<br />

!!!ERROR!!! Lockable memory less than 4096KB on x nodes<br />

To fix this error, install the infinipath RPM on the node, and reboot it to ensure<br />

that /etc/initscript is run.<br />

Alternately, you can create your own /etc/initscript and set the ulimit<br />

there.<br />

Error Creating Shared Memory Object<br />

<strong>QLogic</strong> MPI (and PSM) use Linux’s shared memory mapped files to share<br />

memory within a node. When an MPI job is started, a shared memory file is<br />

created on each node for all MPI ranks sharing memory on that one node. During<br />

job execution, the shared memory file remains in /dev/shm. At program exit, the<br />

file is removed automatically by the operating system when the <strong>QLogic</strong> MPI<br />

(InfiniPath) library properly exits. Also, as an additional backup in the sequence of<br />

commands invoked by mpirun during every MPI job launch, the file is explicitly<br />

removed at program termination.<br />

However, under circumstances such as hard and explicit program termination (i.e.<br />

kill -9 on the mpirun process PID), <strong>QLogic</strong> MPI cannot guarantee that the<br />

/dev/shm file is properly removed. As many stale files accumulate on each node,<br />

an error message like the following can appear at startup:<br />

node023:6.Error creating shared memory object in shm_open(/dev/shm<br />

may have stale shm files that need to be removed):<br />

If this occurs, administrators should clean up all stale files by running this<br />

command (as a root user):<br />

# rm -rf /dev/shm/psm_shm.*<br />

You can also selectively identify stale files by using a combination of the fuser,<br />

ps, and rm commands (all files start with the psm_shm prefix). Once identified,<br />

you can issue rm commands on the stale files that you own.<br />

D000046-005 B F-23


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

NOTE:<br />

It is important that /dev/shm be writable by all users, or else error<br />

messages like the ones in this section can be expected. Also, non-<strong>QLogic</strong><br />

MPIs that use PSM may be more prone to stale shared memory files when<br />

processes are abnormally terminated.<br />

gdb Gets SIG32 Signal Under mpirun -debug with the<br />

PSM Receive Progress Thread Enabled<br />

When you run mpirun -debug and the PSM receive progress thread is enabled,<br />

gdb (the GNU debugger) reports the following error:<br />

(gdb) run<br />

Starting program: /usr/bin/osu_bcast < /dev/null [Thread debugging<br />

using libthread_db enabled] [New Thread 46912501386816 (LWP<br />

13100)] [New Thread 1084229984 (LWP 13103)] [New Thread 1094719840<br />

(LWP 13104)]<br />

Program received signal SIG32, Real-time event 32.<br />

[Switching to Thread 1084229984 (LWP 22106)] 0x00000033807c0930 in<br />

poll () from /lib64/libc.so.6<br />

This signal is generated when the main thread cancels the progress thread. To fix<br />

this problem, disable the receive progress thread when debugging an MPI<br />

program. Add the following line to $HOME/.mpirunrc:<br />

export PSM_RCVTHREAD=0<br />

NOTE:<br />

Remove the above line from $HOME/.mpirunrc after you debug an MPI<br />

program. If this line is not removed, the PSM receive progress thread will be<br />

permanently disabled. To check if the receive progress thread is enabled,<br />

look for output similar to the following when using the mpirun -verbose<br />

flag:<br />

idev-17:0.env PSM_RCVTHREAD Recv thread flags<br />

0 disables thread) => 0x1<br />

The value 0x1 indicates that the receive thread is currently enabled. A value<br />

of 0x0 indicates that the receive thread is disabled.<br />

F-24 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

General Error Messages<br />

The following message may be generated by ipath_checkout or mpirun:<br />

PSM found 0 available contexts on InfiniPath device<br />

The most likely cause is that the cluster has processes using all the available<br />

PSM contexts.<br />

Error Messages Generated by mpirun<br />

The following sections describe the mpirun error messages. These messages<br />

are in one of these categories:<br />

• Messages from the <strong>QLogic</strong> MPI (InfiniPath) library<br />

• MPI messages<br />

• Messages relating to the InfiniPath driver and InfiniBand links<br />

Messages generated by mpirun follow this format:<br />

program_name: message<br />

function_name: message<br />

Messages can also have different prefixes, such as ipath_ or psm_, which<br />

indicate where in the software the errors are occurring.<br />

Messages from the <strong>QLogic</strong> MPI (InfiniPath) Library<br />

Messages from the <strong>QLogic</strong> MPI (InfiniPath) library appear in the mpirun output.<br />

The following example contains rank values received during connection setup that<br />

were higher than the number of ranks (as indicated in the mpirun startup code):<br />

sender rank rank is out of range (notification)<br />

sender rank rank is out of range (ack)<br />

The following are error messages that indicate internal problems and must be<br />

reported to Technical Support.<br />

unknown frame type type<br />

[n] Src lid error: sender: x, exp send: y<br />

Frame receive from unknown sender. exp. sender = x, came from y<br />

Failed to allocate memory for eager buffer addresses: str<br />

The following error messages usually indicate a hardware or connectivity<br />

problem:<br />

Failed to get IB Unit LID for any unit<br />

Failed to get our IB LID<br />

Failed to get number of Infinipath units<br />

In these cases, try to reboot. If that does not work, call Technical Support.<br />

D000046-005 B F-25


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following message indicates a mismatch between the <strong>QLogic</strong> interconnect<br />

hardware in use and the version where the software was compiled:<br />

Number of buffer avail registers is wrong; have n, expected m<br />

build mismatch, tidmap has n bits, ts_map m<br />

These messages indicate a mismatch between the InfiniPath software and<br />

hardware versions. Consult Technical Support after verifying that current drivers<br />

and libraries are installed.<br />

The following examples are all informative messages about driver initialization<br />

problems. They are not necessarily fatal themselves, but may indicate problems<br />

that interfere with the application. In the actual printed output, all of the messages<br />

are prefixed with the name of the function that produced them.<br />

assign_port command failed: str<br />

Failed to get LID for unit u: str<br />

Failed to get number of units: str<br />

GETPORT ioctl failed: str<br />

can't allocate memory for ipath_ctrl: str<br />

can't stat infinipath device to determine type: str<br />

file descriptor is not for a real device, failing<br />

get info ioctl failed: str<br />

ipath_get_num_units called before init<br />

ipath_get_unit_lid called before init<br />

mmap of egr bufs from h failed: str<br />

mmap of pio buffers at %llx failed: str<br />

mmap of pioavail registers (%llx) failed: str<br />

mmap of rcvhdr q failed: str<br />

mmap of user registers at %llx failed: str<br />

userinit command failed: str<br />

Failed to set close on exec for device: str<br />

NOTE:<br />

These messages should never occur. If they do, notify Technical Support.<br />

The following message indicates that a node program may not be processing<br />

incoming packets, perhaps due to a very high system load:<br />

eager array full after overflow, flushing (head h, tail t)<br />

The following error messages should rarely occur; they indicate internal software<br />

problems:<br />

ExpSend opcode h tid=j, rhf_error k: str<br />

Asked to set timeout w/delay l, gives time in past (t2 < t1)<br />

Error in sending packet: str<br />

In this case, str can give additional information about why the failure occurred.<br />

F-26 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following message usually indicates a node failure or malfunctioning link in<br />

the fabric:<br />

Couldn’t connect to (LID=::). Time<br />

elapsed 00:00:30. Still trying...<br />

IP is the MPI rank’s IP address, and are the rank’s lid,<br />

port, and subport.<br />

If messages similar to the following display, it may mean that the program is trying<br />

to receive to an invalid (unallocated) memory address, perhaps due to a logic<br />

error in the program, usually related to malloc/free:<br />

ipath_update_tid_err: Failed TID update for rendezvous, allocation<br />

problem<br />

kernel: infinipath: get_user_pages (0x41 pages starting at<br />

0x2aaaaeb50000<br />

kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages:<br />

errno 12<br />

TID is short for Token ID, and is part of the <strong>QLogic</strong> hardware. This error indicates<br />

a failure of the program, not the hardware or driver.<br />

MPI Messages<br />

Some MPI error messages are issued from the parts of the code inherited from<br />

the MPICH implementation. See the MPICH documentation for message<br />

descriptions. This section discusses the error messages specific to the <strong>QLogic</strong><br />

MPI implementation.<br />

These messages appear in the mpirun output. Most are followed by an abort,<br />

and possibly a backtrace. Each is preceded by the name of the function where the<br />

exception occurred.<br />

The following message is always followed by an abort. The processlabel is<br />

usually in the form of the host name followed by process rank:<br />

processlabel Fatal Error in filename line_no: error_string<br />

At the time of publication, the possible error_strings are:<br />

Illegal label format character.<br />

Memory allocation failed.<br />

Error creating shared memory object.<br />

Error setting size of shared memory object.<br />

Error mmapping shared memory.<br />

Error opening shared memory object.<br />

Error attaching to shared memory.<br />

Node table has inconsistent len! Hdr claims %d not %d<br />

Timeout waiting %d seconds to receive peer node table from mpirun<br />

D000046-005 B F-27


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following indicates an unknown host:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />

MPIRUN: Cannot obtain IP address of : Unknown host<br />

15:35_~.1019<br />

The following indicates that there is no route to a valid host:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />

ssh: connect to host port 22: No route to host<br />

MPIRUN: Some node programs ended prematurely without connecting to<br />

mpirun.<br />

MPIRUN: No connection received from 1 node process on node<br />

<br />

The following indicates that there is no route to any host:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />

ssh: connect to host port 22: No route to host<br />

ssh: connect to host port 22: No route to host<br />

MPIRUN: All node programs ended prematurely without connecting to<br />

mpirun.<br />

The following indicates that node jobs have started, but one host could not<br />

connect back to mpirun:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />

9139.psc_skt_connect: Error connecting to socket: No route to host<br />

. Cannot connect to spawner on host %s port %d<br />

within 60 seconds.<br />

MPIRUN: Some node programs ended prematurely without connecting to<br />

mpirun.<br />

MPIRUN: No connection received from 1 node process on node<br />

<br />

The following indicates that node jobs have started, but both hosts could not<br />

connect back to mpirun:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />

9158.psc_skt_connect: Error connecting to socket: No route to host<br />

. Cannot connect to spawner on host %s port %d<br />

within 60 seconds.<br />

6083.psc_skt_connect: Error connecting to socket: No route to host<br />

. Cannot connect to spawner on host %s port %d<br />

within 60 seconds.<br />

MPIRUN: All node programs ended prematurely without connecting to<br />

mpirun.<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 1000000 1000000<br />

MPIRUN: node program unexpectedly quit: Exiting.<br />

F-28 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

The following indicates that one program on one node died:<br />

$ mpirun -np 2 -m ~/tmp/q mpi_latency 100000 1000000<br />

MPIRUN: node program unexpectedly quit: Exiting.<br />

The quiescence detected message is printed when an MPI job is not making<br />

progress. The default timeout is 900 seconds. After this length of time, all the<br />

node processes are terminated. This timeout can be extended or disabled with the<br />

-quiescence-timeout option in mpirun.<br />

$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000<br />

MPIRUN: MPI progress Quiescence Detected after 9000 seconds.<br />

MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.<br />

MPIRUN: Per-rank details are the following:<br />

MPIRUN: Rank 0 ( ) caused MPI progress Quiescence.<br />

MPIRUN: Rank 1 ( ) caused MPI progress Quiescence.<br />

MPIRUN: both MPI progress and Ping Quiescence Detected after 120<br />

seconds.<br />

Occasionally, a stray process will continue to exist out of its context. mpirun<br />

checks for stray processes; they are killed after detection. The following code is<br />

an example of the type of message that displays in this case:<br />

$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000<br />

iqa-38: Received 1 out-of-context eager message(s) from stray<br />

process PID=29745<br />

running on host 192.168.9.218<br />

iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I<br />

am a stray process, exiting.<br />

2000 5.222116<br />

iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host<br />

IP=192.168.9.218 sent<br />

1 stray message(s) and was told so 1 time(s) (first stray message<br />

at 0.7s (13%),last at 0.7s (13%) into application run)<br />

The following message should never occur. If it does, notify Technical Support:<br />

Internal Error: NULL function/argument found:func_ptr(arg_ptr)<br />

Driver and Link Error Messages Reported by MPI Programs<br />

The following driver and link error messages are reported by MPI programs.<br />

When the InfiniBand link fails during a job, a message is reported once per<br />

occurrence. The message will be similar to:<br />

ipath_check_unit_status: IB Link is down<br />

D000046-005 B F-29


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

MPI Stats<br />

This message occurs when a cable is disconnected, a switch is rebooted, or when<br />

there are other problems with the link. The job continues retrying until the<br />

quiescence interval expires. See the mpirun -q option for information on<br />

quiescence.<br />

If a hardware problem occurs, an error similar to this displays:<br />

infinipath: [error strings ] Hardware error<br />

In this case, the MPI program terminates. The error string may provide additional<br />

information about the problem. To further determine the source of the problem,<br />

examine syslog on the node reporting the problem.<br />

Using the -print-stats option to mpirun provides a listing to stderr of<br />

various MPI statistics. Here is example output for the -print-stats option<br />

when used with an eight-rank run of the HPCC benchmark, using the following<br />

command:<br />

$ mpirun -np 8 -ppn 1 -m machinefile -M ./hpcc<br />

STATS: MPI Statistics Summary (max,min @ rank)<br />

STATS: Eager count sent (max=171.94K @ 0, min=170.10K @ 3, med=170.20K @ 5)<br />

STATS: Eager bytes sent (max=492.56M @ 5, min=491.35M @ 0, med=491.87M @ 1)<br />

STATS: Rendezvous count sent (max= 5735 @ 0, min= 5729 @ 3, med= 5731 @ 7)<br />

STATS: Rendezvous bytes sent (max= 1.21G @ 4, min= 1.20G @ 2, med= 1.21G @ 0)<br />

STATS: Expected count received(max=173.18K @ 4, min=169.46K @ 1, med=172.71K @ 7)<br />

STATS: Expected bytes received(max= 1.70G @ 1, min= 1.69G @ 2, med= 1.70G @ 7)<br />

STATS: Unexpect count received(max= 6758 @ 0, min= 2996 @ 4, med= 3407 @ 2)<br />

STATS: Unexpect bytes received(max= 1.48M @ 0, min=226.79K @ 5, med=899.08K @ 2)<br />

By default, -M assumes -M=mpi and that the user wants only mpi level statistics.<br />

The man page shows various other low-level categories of statistics that are<br />

provided. Here is another example:<br />

$ mpirun -np 8 -ppn 1 -m machinefile -M=mpi,ipath hpcc<br />

STATS: MPI Statistics Summary (max,min @ rank)<br />

STATS: Eager count sent (max=171.94K @ 0, min=170.10K @ 3, med=170.22K @ 1)<br />

STATS: Eager bytes sent (max=492.56M @ 5, min=491.35M @ 0, med=491.87M @ 1)<br />

STATS: Rendezvous count sent (max= 5735 @ 0, min= 5729 @ 3, med= 5731 @ 7)<br />

STATS: Rendezvous bytes sent (max= 1.21G @ 4, min= 1.20G @ 2, med= 1.21G @ 0)<br />

STATS: Expected count received(max=173.18K @ 4, min=169.46K @ 1, med=172.71K @ 7)<br />

STATS: Expected bytes received(max= 1.70G @ 1, min= 1.69G @ 2, med= 1.70G @ 7)<br />

STATS: Unexpect count received(max= 6758 @ 0, min= 2996 @ 4, med= 3407 @ 2)<br />

STATS: Unexpect bytes received(max= 1.48M @ 0, min=226.79K @ 5, med=899.08K @ 2)<br />

STATS: InfiniPath low-level protocol stats<br />

STATS: pio busy count (max=190.01K @ 0, min=155.60K @ 1, med=160.76K @ 5)<br />

STATS: scb unavail exp count (max= 9217 @ 0, min= 7437 @ 7, med= 7727 @ 4)<br />

STATS: tid update count (max=292.82K @ 6, min=290.59K @ 2, med=292.55K @ 4)<br />

STATS: interrupt thread count (max= 941 @ 0, min= 335 @ 7, med= 439 @ 2)<br />

STATS: interrupt thread success(max= 0.00 @ 3, min= 0.00 @ 1, med= 0.00 @ 0)<br />

F-30 D000046-005 B


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

Statistics other than MPI-level statistics are fairly low level; most users will not<br />

understand them. Contact <strong>QLogic</strong> Technical Support for more information.<br />

Message statistics are available for transmitted and received messages. In all<br />

cases, the MPI rank number responsible for a minimum or maximum value is<br />

reported with the relevant value. For application runs of at least three ranks, a<br />

median is also available.<br />

Since transmitted messages employ either an Eager or a Rendezvous protocol,<br />

results are available relative to both message count and aggregated bytes.<br />

Message count represents the amount of messages transmitted by each protocol<br />

on a per-rank basis. Aggregated amounts of message bytes indicate the total<br />

amount of data that was moved on each rank by a particular protocol.<br />

On the receive side, messages are split into expected or unexpected messages.<br />

Unexpected messages cause the MPI implementation to buffer the transmitted<br />

data until the receiver can produce a matching MPI receive buffer. Expected<br />

messages refer to the inverse case, which is the common case in most MPI<br />

applications. An additional metric, Unexpected count %, representing the<br />

proportion of unexpected messages in relation to the total number of messages<br />

received, is also shown because of the notable effect unexpected messages have<br />

on performance.<br />

For more detailed information, use MPI profilers such as mpiP. For more<br />

information on mpiP, see: http://mpip.sourceforge.net/<br />

For information about the HPCC benchmark, see: http://icl.cs.utk.edu/hpcc/<br />

D000046-005 B F-31


F–Troubleshooting<br />

<strong>QLogic</strong> MPI Troubleshooting<br />

F-32 D000046-005 B


G ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware<br />

Issues<br />

To verify that an InfiniBand host can access an Ethernet system through the EVIC,<br />

issue a ping command to the Ethernet system from the InfiniBand host. Make<br />

certain that the route to the Ethernet system is using the VIO hardware by using<br />

the Linux route command on the InfiniBand host, then verify that the route to the<br />

subnet is using one of the virtual Ethernet interfaces (i.e., an EIOC).<br />

NOTE:<br />

If the ping command fails, check the following:<br />

• The logical connection between the InfiniBand host and the EVIC<br />

Checking the logical connection between the InfiniBand <strong>Host</strong> and the<br />

VIO hardware.<br />

• The interface definitions on the host Checking the interface definitions<br />

on the host.<br />

• The physical connection between the VIO hardware and the Ethernet<br />

network Verify the physical connection between the VIO hardware and<br />

the Ethernet network.<br />

Checking the logical connection between the InfiniBand <strong>Host</strong><br />

and the VIO hardware<br />

To determine if the logical connection between the InfiniBand host and the VIO<br />

hardware is correct, check the following:<br />

• The correct VirtualNIC driver is running.<br />

• The /etc/infiniband/qlgc_vnic.cfg file contains the desired<br />

information.<br />

• The host can communicate with the I/O Controllers (IOCs) of the VIO<br />

hardware.<br />

D000046-005 B G-1


G–ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues<br />

Verify that the proper VirtualNIC driver is running<br />

Check that a VirtualNIC driver is running by issuing an lsmod command on the<br />

InfiniBand host. Make sure that the qlgc_vnic is displayed on the list of<br />

modules. Following is an example:<br />

st186:~ # lsmod<br />

Module<br />

Size Used by<br />

cpufreq_ondemand 25232 1<br />

cpufreq_userspace 23552 0<br />

cpufreq_powersave 18432 0<br />

powernow_k8 30720 2<br />

freq_table<br />

22400 1 powernow_k8<br />

qlgc_srp 93876 0<br />

qlgc_vnic 116300 0#<br />

Verifying that the qlgc_vnic.cfg file contains the correct<br />

information<br />

Use the following scenarios to verify that the qlgc_vnic.cfg file contains a<br />

definition for the applicable virtual interface:<br />

Issue the command ib_qlgc_vnic_query to get the list of IOCs the host<br />

can see.<br />

If the list is empty, there may be a syntax error in the qlgc_vnic.cfg file (e.g., a<br />

missing semicolon). Look in /var/log/messages at the time qlgc_vnic was<br />

last started to see if any error messages were put in the log at that time.<br />

If the qlgc_vnic.cfg file has been edited since the last time the VirtualNIC<br />

driver was started, the driver needs restarted. To restart the driver, so that it uses<br />

the current qlgc_vnic.cfg file, issue a /etc/init.d/qlgc_vnic restart.<br />

G-2 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues<br />

Verifying that the host can communicate with the I/O<br />

Controllers (IOCs) of the VIO hardware<br />

To display the Ethernet VIO cards that the host can see and communicate with,<br />

issue the command ib_qlgc_vnic_query. The system returns information<br />

similar to the following:<br />

IO Unit Info:<br />

port LID: 0003<br />

port GID: fe8000000000000000066a0258000001<br />

change ID: 0009<br />

max controllers: 0x03<br />

controller[ 1]<br />

GUID: 00066a0130000001<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

ID: Chassis 0x00066A00010003F2, Slot 1, IOC 1<br />

service entries: 2<br />

service[ 0]: 1000066a00000001 /<br />

InfiniNIC.InfiniConSys.Control:01<br />

service[ 1]: 1000066a00000101 /<br />

InfiniNIC.InfiniConSys.Data:01<br />

controller[ 2]<br />

GUID: 00066a0230000001<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

ID: Chassis 0x00066A00010003F2, Slot 1, IOC 2<br />

service entries: 2<br />

service[ 0]: 1000066a00000002 /<br />

InfiniNIC.InfiniConSys.Control:02<br />

service[ 1]: 1000066a00000102 /<br />

InfiniNIC.InfiniConSys.Data:02<br />

controller[ 3]<br />

GUID: 00066a0330000001<br />

vendor ID: 00066a<br />

device ID: 000030<br />

IO class : 2000<br />

D000046-005 B G-3


G–ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues<br />

ID: Chassis 0x00066A00010003F2, Slot 1, IOC 3<br />

service entries: 2<br />

service[ 0]: 1000066a00000003 /<br />

InfiniNIC.InfiniConSys.Control:03<br />

service[ 1]: 1000066a00000103 /<br />

InfiniNIC.InfiniConSys.Data:03<br />

When ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID<br />

information. With the -s option it reports the IOCSTRING information for the<br />

Virtual I/O hardware IOCs present on the fabric. Following is an example:<br />

# ib_qlgc_vnic_query -e<br />

ioc_guid=00066a0130000001,dgid=fe8000000000000000066a02580000<br />

01,pkey=ffff<br />

ioc_guid=00066a0230000001,dgid=fe8000000000000000066a02580000<br />

01,pkey=ffff<br />

ioc_guid=00066a0330000001,dgid=fe8000000000000000066a02580000<br />

01,pkey=ffff<br />

#ib_qlgc_vnic_query -s<br />

"Chassis 0x00066A00010003F2, Slot 1, IOC 1"<br />

"Chassis 0x00066A00010003F2, Slot 1, IOC 2"<br />

"Chassis 0x00066A00010003F2, Slot 1, IOC 3"<br />

#ib_qlgc_vnic_query -es<br />

ioc_guid=00066a0130000001,dgid=fe8000000000000000066a02580000<br />

01,pkey=ffff,"Chassis 0x00066A00010003F2, Slot 1, IOC 1"<br />

ioc_guid=00066a0230000001,dgid=fe8000000000000000066a02580000<br />

01,pkey=ffff,"Chassis 0x00066A00010003F2, Slot 1, IOC 2"<br />

ioc_guid=00066a0330000001,dgid=fe8000000000000000066a02580000<br />

01,pkey=ffff,"Chassis 0x00066A00010003F2, Slot 1, IOC 3"<br />

G-4 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues<br />

If the host can not see applicable IOCs, there are two things to check. First, verify<br />

that the adapter port specified in the eioc definition of the<br />

/etc/infiniband/qlgc_vnic.cfg file is active. This is done using the<br />

ibv_devinfo commands on the host, then checking the value of state. If the<br />

state is not Port_Active, the adapter port is not logically connected to the<br />

fabric. It is possible that one of the adapter ports is not physically connected to an<br />

InfiniBand switch. For example:<br />

st139:~ # ibv_devinfo<br />

hca_id: mlx4_0<br />

fw_ver: 2.2.000<br />

node_guid:<br />

0002:c903:0000:0f80<br />

sys_image_guid:<br />

0002:c903:0000:0f83<br />

vendor_id:<br />

0x02c9<br />

vendor_part_id: 25418<br />

hw_ver:<br />

0xA0<br />

board_id:<br />

MT_04A0110002<br />

phys_port_cnt: 2<br />

port: 1<br />

state:<br />

PORT_ACTIVE<br />

(4)<br />

max_mtu: 2048 (4)<br />

active_mtu: 2048 (4)<br />

sm_lid: 1<br />

port_lid: 8<br />

port_lmc:<br />

0x00<br />

(4)<br />

port: 2<br />

state:<br />

PORT_ACTIVE<br />

max_mtu: 2048 (4)<br />

active_mtu: 2048 (4)<br />

sm_lid: 1<br />

port_lid: 9<br />

port_lmc: 0x00#<br />

Second, verify that the adapter port specified in the EIOC definition is the correct<br />

port. The host sees the IOCs, but not over the adapter port in the definition of the<br />

IOC. For example, the host may see the IOCs over adapter Port 1, but the eioc<br />

definition in the /etc/infiniband/qlgc_vnic.cfg file specifies PORT=2.<br />

D000046-005 B G-5


G–ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues<br />

Another reason why the host might not be able to see the necessary IOCs is that<br />

the subnet manager has gone down. Issue an iba_saquery command to make<br />

certain that the response shows all of the nodes in the fabric. If an error is<br />

returned and the adapter is physically connected to the fabric, then the subnet<br />

manager has gone down, and this situation needs to be corrected.<br />

Checking the interface definitions on the host<br />

If it is not possible to ping from an InfiniBand host to the Ethernet host, and the<br />

ViPort State of the interface is VIPORT_CONNECTED, then issue an<br />

ifconfig command. The interfaces defined in the configuration files listed in<br />

/etc/sysconfig/network directory for SLES hosts or the<br />

/etc/sysconfig/network-scripts for Red Hat hosts should be displayed in<br />

the list of interfaces in the ifconfig output. For example, the ifconfig file<br />

should show an interface for each EIOC configuration file in the following list:<br />

# ls /etc/sysconfig/network-scripts<br />

ifcfg-eioc1<br />

ifcfg-eioc2<br />

ifcfg-eioc3<br />

ifcfg-eioc4<br />

ifcfg-eioc5<br />

ifcfg-eioc6<br />

Interface does not show up in output of 'ifconfig'<br />

If an interface is not displayed in the output of an ifconfig command, there is<br />

most likely a problem in the definition of that interface in the<br />

/etc/sysconfig/network-scripts/ifcfg- (for RedHat systems)<br />

or /etc/sysconfig/network/ifcfg- (for SuSE systems) file, where<br />

is the name of the virtual interface (e.g., eioc1).<br />

NOTE:<br />

For the remainder of this section, ifcfg directory refers to<br />

/etc/sysconfig/network-scripts/ on RedHat systems, and<br />

/etc/sysconfig/network on SuSE systems.<br />

Issue an ifup command. If the interface is displayed when issuing an<br />

ifconfig command, there may be a problem with the way the interface startup<br />

is defined in the ifcfg directory'/ifcfg- file that is preventing the<br />

interface from coming up automatically.<br />

If the interface does not come up, check the interface definitions in the ifcfg<br />

directory. Make certain that there are no misspellings in the ifcfg- file.<br />

G-6 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting VirtualNIC and VIO Hardware Issues<br />

Example of ifcfg-eiocx setup for RedHat systems:<br />

DEVICE=eioc1<br />

BOOTPROTO=static<br />

IPADDR=172.26.48.132<br />

BROADCAST=172.26.63.130<br />

NETMASK=255.255.240.0<br />

NETWORK=172.26.48.0<br />

ONBOOT=yes<br />

TYPE=Ethernet<br />

Example of ifcfg-eiocx setup for SuSE and SLES systems:<br />

BOOTPROTO='static'<br />

IPADDR='172.26.48.130'<br />

BROADCAST='172.26.63.255'<br />

NETMASK='255.255.240.0'<br />

NETWORK='172.26.48.0'<br />

STARTMODE='hotplug'<br />

TYPE='Ethernet'<br />

Verify the physical connection between the VIO hardware and<br />

the Ethernet network<br />

If the interface is displayed in an ifconfig and a ping between the InfiniBand<br />

host and the Ethernet host is still unsuccessful, verify that the VIO hardware<br />

Ethernet ports are physically connected to the correct Ethernet network. Verify<br />

that the Ethernet port corresponding to the IOCGUID for the interface to be used<br />

is connected to the expected Ethernet network.<br />

There are up to 6 IOC GUIDs on each VIO hardware module (6 for the IB/Ethernet<br />

Bridge Module, 2 for the EVIC), one for each Ethernet port. If a VIO hardware<br />

module can be seen from a host, the ib_qlgc_vnic_query -s file displays<br />

information similar to:<br />

EVIC in Chassis 0x00066a000300012a, Slot 19, Ioc 1<br />

EVIC in Chassis 0x00066a000300012a, Slot 19, Ioc 2<br />

EVIC in Chassis 0x00066a000300012a, Slot 8, Ioc 1<br />

EVIC in Chassis 0x00066a000300012a, Slot 8, Ioc 2<br />

EVIC in Chassis 0x00066a00da000100, Slot 2, Ioc 1<br />

EVIC in Chassis 0x00066a00da000100, Slot 2, Ioc 2<br />

D000046-005 B G-7


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

Troubleshooting SRP Issues<br />

ib_qlgc_srp_stats showing session in disconnected state<br />

Problem:<br />

If the session is part of a multi-session adapter, ib_qlgc_srp_stats will show it<br />

to be in the disconnected state. For example:<br />

SCSI <strong>Host</strong> # : 17 | Mode : ROUNDROBIN<br />

Trgt Adapter Depth : 1000 | Verify Target : Yes<br />

Rqst Adapter Depth : 1000 | Rqst LUN Depth : 16<br />

Tot Adapter Depth : 1000 | Tot LUN Depth : 16<br />

Act Adapter Depth : 998 | Act LUN Depth : 16<br />

Max LUN Scan : 512 | Max IO : 131072 (128 KB)<br />

Max Sectors : 256 | Max SG Depth : 33<br />

Session Count : 2 | No Connect T/O : 60 Second(s)<br />

Register In Order : ON | Dev Reqst T/O : 2 Second(s)<br />

Description : SRP Virtual HBA 1<br />

Session : Session 1 | State : Disconnected<br />

Source GID<br />

: 0xfe8000000000000000066a000100d051<br />

Destination GID : 0xfe8000000000000000066a0260000165<br />

SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 1<br />

SRP Target IOClass : 0xFF00<br />

| SRP Target SID : 0x0000494353535250<br />

SRP IPI Guid : 0x00066a000100d051 | SRP IPI Extnsn : 0x0000000000000001<br />

SRP TPI Guid : 0x00066a0138000165 | SRP TPI Extnsn : 0x0000000000000001<br />

Source LID : 0x000b | Dest LID : 0x0004<br />

Completed Sends : 0x00000000000002c0 | Send Errors : 0x0000000000000000<br />

Completed Receives : 0x00000000000002c0 | Receive Errors : 0x0000000000000000<br />

Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />

Total SWUs<br />

: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />

Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />

SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />

<strong>Host</strong> Busys<br />

: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />

Session : Session 2 | State : Disconnected<br />

Source GID<br />

: 0xfe8000000000000000066a000100d052<br />

Destination GID : 0xfe8000000000000000066a0260000165<br />

SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 2<br />

SRP Target IOClass : 0xFF00<br />

| SRP Target SID : 0x0000494353535250<br />

SRP IPI Guid : 0x00066a000100d052 | SRP IPI Extnsn : 0x0000000000000001<br />

SRP TPI Guid : 0x00066a0238000165 | SRP TPI Extnsn : 0x0000000000000001<br />

Source LID : 0x000c | Dest LID : 0x0004<br />

Completed Sends : 0x00000000000001c8 | Send Errors : 0x0000000000000000<br />

Completed Receives : 0x00000000000001c8 | Receive Errors : 0x0000000000000000<br />

Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />

Total SWUs<br />

: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />

Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />

SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />

<strong>Host</strong> Busys<br />

: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />

G-8 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

Solution:<br />

Perhaps an interswitch cable has been disconnected, or the VIO hardware is<br />

offline, or the Chassis/Slot does not contain a VIO hardware card. Instead of<br />

looking at this file, use the ib_qlgc_srp_query command to verify that the<br />

desired adapter port is in the active state.<br />

NOTE:<br />

It is normal to see the "Can not find a path" message when the system<br />

first boots up. Sometimes SRP comes up before the subnet manager has<br />

brought the port state of the adapter port to active. If the adapter port is not<br />

active, SRP will not be able to find the VIO hardware card. Use the<br />

appropriate OFED command to show the port state.<br />

Session in 'Connection Rejected' state<br />

Problem:<br />

The session is in the 'Connection Rejected' state according to<br />

/var/log/messages. If the session is part of a multi-session adapter,<br />

ib_qlgc_srp_stats shows it in the "Connection Rejected" state.<br />

A host displays<br />

"Connection Failed for Session X: IBT Code = 0x0<br />

"Connection Failed for Session X: SRP Code = 0x1003<br />

"Connection Rejected"<br />

D000046-005 B G-9


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

Following is an example:<br />

SCSI <strong>Host</strong> # : 17 | Mode : ROUNDROBIN<br />

Trgt Adapter Depth : 1000 | Verify Target : Yes<br />

Rqst Adapter Depth : 1000 | Rqst LUN Depth : 16<br />

Tot Adapter Depth : 1000 | Tot LUN Depth : 16<br />

Act Adapter Depth : 998 | Act LUN Depth : 16<br />

Max LUN Scan : 512 | Max IO : 131072 (128 KB)<br />

Max Sectors : 256 | Max SG Depth : 33<br />

Session Count : 2 | No Connect T/O : 60 Second(s)<br />

Register In Order : ON | Dev Reqst T/O : 2 Second(s)<br />

Description : SRP Virtual HBA 1<br />

Session : Session 1 | State : Disconnected<br />

Source GID<br />

: 0xfe8000000000000000066a000100d051<br />

Destination GID : 0xfe8000000000000000066a0260000165<br />

SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 1<br />

SRP Target IOClass : 0xFF00<br />

| SRP Target SID : 0x0000494353535250<br />

SRP IPI Guid : 0x00066a000100d051 | SRP IPI Extnsn : 0x0000000000000001<br />

SRP TPI Guid : 0x00066a0138000165 | SRP TPI Extnsn : 0x0000000000000001<br />

Source LID : 0x000b | Dest LID : 0x0004<br />

Completed Sends : 0x00000000000002c0 | Send Errors : 0x0000000000000000<br />

Completed Receives : 0x00000000000002c0 | Receive Errors : 0x0000000000000000<br />

Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />

Total SWUs<br />

: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />

Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />

SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />

<strong>Host</strong> Busys<br />

: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />

Session : Session 2 | State : Disconnected<br />

Source GID<br />

: 0xfe8000000000000000066a000100d052<br />

Destination GID : 0xfe8000000000000000066a0260000165<br />

SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 2<br />

SRP Target IOClass : 0xFF00<br />

| SRP Target SID : 0x0000494353535250<br />

SRP IPI Guid : 0x00066a000100d052 | SRP IPI Extnsn : 0x0000000000000001<br />

SRP TPI Guid : 0x00066a0238000165 | SRP TPI Extnsn : 0x0000000000000001<br />

Source LID : 0x000c | Dest LID : 0x0004<br />

Completed Sends : 0x00000000000001c8 | Send Errors : 0x0000000000000000<br />

Completed Receives : 0x00000000000001c8 | Receive Errors : 0x0000000000000000<br />

Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />

Total SWUs<br />

: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />

Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />

SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />

<strong>Host</strong> Busys<br />

: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />

AND<br />

The VIO hardware displays "Initiator Not Configured within IOU:<br />

initiator <br />

Initiatorport<br />

identifier ( is<br />

invalid/not allowed to use this FCIOU”.<br />

G-10 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

Solution 1:<br />

The host initiator has not been configured as an SRP initiator on the VIO<br />

hardware SRP Initiator Discovery screen. Via Chassis Viewer, bring up the SRP<br />

Initiator Discovery screen and either<br />

Click on 'Add New' to add a wildcarded entry with the initiator extension to<br />

match what is in the session entry in the qlgc_srp.cfg file, or<br />

Click on the Start button to discover the adapter port GUID, and then click<br />

'Configure' on the row containing the adapter port GUID and give the entry<br />

a name.<br />

Solution 2:<br />

Check the SRP map on the VIO hardware specified in the failing Session block of<br />

the qlgc_srp.cfg file. Make certain there is a map defined for the row specified<br />

by either the initiatorExtension in the failing Session block of the<br />

qlgc_srp.cfg file or the adapter port GUID specified in the failing Session block<br />

of the qlgc_srp.cfg file. Additionally, make certain that the map in that row is in<br />

the column of the IOC specified in the failing Session block of the qlgc_srp.cfg<br />

file.<br />

Attempts to read or write to disk are unsuccessful<br />

Problem:<br />

Attempts to read or write to the disk are unsuccessful when SRP comes up. About<br />

every five seconds the VIO hardware displays<br />

"fcIOStart Failed",<br />

"CMDisconnect called for Port: xxxxxxxx Initiator: Target: ",<br />

and<br />

"Target Port Deleted for Port: xxxxxxxx Initiator: Target: "<br />

The host log shows a session transitioning between Connected and Down. The<br />

host log also displays "Test Unit Ready has FAILED", "Abort Task has<br />

SUCCEEDED", and "Clear Task has FAILED".<br />

Solution:<br />

This indicates a problem in the path between the VIO hardware and the target<br />

storage device. After an SRP host has connected to the VIO hardware<br />

successfully, the host sends a "Test Unit Ready" command to the storage<br />

device. After five seconds, if that command is not responded to, the SRP host<br />

brings down the session and retries in five seconds. Verify that the status of the<br />

connection between the appropriate VIO hardware port and the target device is<br />

UP on the FCP Device Discovery screen.<br />

D000046-005 B G-11


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

Problem:<br />

Attempts to read or write to the disk are unsuccessful, when they were previously<br />

successful. The host displays 'Sense Data indicates recovery is<br />

necessary on Session' and the "Test Unit Ready has FAILED", "Abort<br />

Task has SUCCEEDED", "Clear Task has FAILED" messages.<br />

Solution:<br />

If there is a problem with communication between the VIO hardware and the<br />

storage device (e.g., the cable between the storage device and the Fibre Channel<br />

switch was pulled) the VIO hardware log will display a "Connection Lost to<br />

NPort Id" message. The next time the host tries to do an input/output (I/O), the<br />

'Sense Data indicates recovery is necessary' appears. Then SRP will<br />

recycle the session. As part of trying to move the session from 'Connected' to<br />

'Active', SRP will issue the 'Test Unit Ready' command.<br />

Verify that the status of the connection between the appropriate VIO hardware<br />

port and the target device is UP on the FCP Device Discovery screen.<br />

Additionally, there may occasionally be messages in the log such as:<br />

Connection Failed for Session X: IBT Code = 0x0<br />

Connection Failed for Session X: SRP Code = 0x0<br />

That may indicate a problem in the path between the VIO hardware and the target<br />

storage device.<br />

Four sessions in a round-robin configuration are active<br />

Problem:<br />

Four sessions in a round-robin configuration are active according to<br />

ib_qlgc_srp_stats. However, only one disk can be seen, although five should<br />

be seen.<br />

Solution 1:<br />

Make certain that Max LUNs Scanned is reporting the same value as<br />

adapterMaxLUNs is set to in qlgc_srp.cfg.<br />

Solution 2:<br />

Make certain that all sessions have a map to the same disk defined. The fact that<br />

the session is active means that the session can see a disk. However, if one of the<br />

sessions is using a map with the 'wrong' disk, then the round-robin method could<br />

lead to a disk or disks not being seen.<br />

Which port does a port GUID refer to<br />

Solution:<br />

A <strong>QLogic</strong> <strong>Host</strong> Channel Adapter Port GUID is of the form 00066appa0iiiiii<br />

G-12 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

where pp gives the port number (0 relative)<br />

and iiiiiii gives the individual id number of the adapter<br />

so 00066a00a0iiiiiii is the port guid of the 1st port of the adapter<br />

and 00066a01a0iiiiiii is the port guid of the 2nd port of the adapter.<br />

Similarly, a VFx Port GUID is of the form 00066app38iiiiii<br />

where pp gives the IOC number (1 or 2)<br />

and iiiiiii gives the individual ID number of the VIO hardware<br />

so 00066a0138iiiiiii is the port guid of IOC 1 of VIO hardware iiiiiii<br />

and 00066a0238iiiiiii is the port guid of IOC 2 of VIO hardware iiiiiii<br />

NOTE:<br />

After a virtual adapter has been successfully added (meaning at least 1<br />

session where part of the adapter has gone to the Active state) the SRP<br />

module will indicates what type of session was created in the mode variable<br />

(ib_qlgc_srp_stats) file, depending on whether "roundrobinmode:<br />

1" is set in the qlgc_srp.cfg file. In this case "X" is the virtual adapter<br />

number, with number 0 being the first one created.<br />

If no sessions were successfully brought to the Active state, then the<br />

roundrobin_X or failover_X file will not be created.<br />

In a round robin configuration, if everything is configured correctly, all sessions will<br />

be Active.<br />

In a failover configuration, if everything is configured correctly, one session will be<br />

Active and the rest will be Connected. The transition of a session from Connected<br />

to Active will not be attempted until that session needs to become Active, due to<br />

the failure of the previously Active session.<br />

How does the user find a <strong>Host</strong> Channel Adapter port GUID<br />

Solution:<br />

A <strong>Host</strong> Channel Adapter Port GUID is displayed by entering the following at any<br />

host prompt:<br />

ibv_devinfo -i 1 for port 1<br />

ibv_devinfo -i 2 for port 2<br />

D000046-005 B G-13


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

The system displays information similar to the following:<br />

st106:~ # ibv_devinfo -i 1<br />

hca_id: mthca0<br />

fw_ver: 5.1.9301<br />

node_guid:<br />

0006:6a00:9800:6c9f<br />

sys_image_guid:<br />

0006:6a00:9800:6c9f<br />

vendor_id:<br />

0x066a<br />

vendor_part_id: 25218<br />

hw_ver:<br />

0xA0<br />

board_id:<br />

SS_0000000005<br />

phys_port_cnt: 2<br />

port: 1<br />

state: PORT_ACTIVE (4)<br />

max_mtu: 2048 (4)<br />

active_mtu: 2048 (4)<br />

sm_lid: 71<br />

port_lid: 60<br />

port_lmc:<br />

0x00<br />

st106:~ # ibv_devinfo -i 2<br />

hca_id: mthca0<br />

fw_ver: 5.1.9301<br />

node_guid:<br />

0006:6a00:9800:6c9f<br />

sys_image_guid:<br />

0006:6a00:9800:6c9f<br />

vendor_id:<br />

0x066a<br />

vendor_part_id: 25218<br />

hw_ver:<br />

0xA0<br />

board_id:<br />

SS_0000000005<br />

phys_port_cnt: 2<br />

port: 2<br />

state: PORT_ACTIVE (4)<br />

max_mtu: 2048 (4)<br />

active_mtu: 2048 (4)<br />

sm_lid: 71<br />

port_lid: 64<br />

port_lmc:<br />

0x00<br />

G-14 D000046-005 B


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

Need to determine the SRP driver version.<br />

Solution:<br />

To determine the SRP driver version number, enter the command modinfo -d<br />

qlgc-srp, which returns information similar to the following:<br />

st159:~ # modinfo -d qlgc-srp<br />

<strong>QLogic</strong> Corp. Virtual HBA (SRP) SCSI Driver, version 1.0.0.0.3<br />

D000046-005 B G-15


G–ULP Troubleshooting<br />

Troubleshooting SRP Issues<br />

G-16 D000046-005 B


H<br />

Write Combining<br />

Introduction<br />

Write combining improves write bandwidth to the <strong>QLogic</strong> chip by writing multiple<br />

words in a single bus transaction (typically 64 bytes). Write combining applies only<br />

to x86_64 systems.<br />

The x86 Page Attribute Table (PAT) mechanism that allocates Write Combining<br />

(WC) mappings for the PIO buffers has been added and is now the default.<br />

If PAT is unavailable or PAT initialization fails, the code will generate a message in<br />

the log and fall back to the Memory Type Range Registers (MTRR) mechanism.<br />

If write combining is not working properly, lower than expected bandwidth may<br />

occur.<br />

The following sections provide instructions for checking write combining and for<br />

using PAT and MTRR.<br />

Verify Write Combining is Working<br />

To see if write combining is working correctly and to check the bandwidth, run the<br />

following command:<br />

$ ipath_pkt_test -B<br />

With write combining enabled, the QLE7140 and QLE7240 report in the range<br />

of 1150–1500 MBps. The QLE7280 reports in the range of 1950–3000 MBps.<br />

You can also use ipath_checkout (use option 5) to check bandwidth.<br />

Although the PAT mechanism should work correctly by default, increased latency<br />

and low bandwidth may indicate a problem. If so, the interconnect operates, but in<br />

a degraded performance mode, with latency increasing to several microseconds,<br />

and bandwidth decreasing to as little as 200 MBps.<br />

Upon driver startup, you may see these errors:<br />

ib_qib 0000:04:01.0: infinqib0: Performance problem: bandwidth to<br />

PIO buffers is only 273 MiB/sec<br />

.<br />

.<br />

.<br />

D000046-005 B H-1


H–Write Combining<br />

PAT and Write Combining<br />

If you do not see any of these messages on your console, but suspect this<br />

problem, check the /var/log/messages file. Some systems suppress driver<br />

load messages but still output them to the log file.<br />

Methods for enabling and disabling the two write combining mechanisms are<br />

described in the following sections. There are no conflicts between the two<br />

methods.<br />

PAT and Write Combining<br />

This is the default mechanism for allocating Write Combining (WC) mappings for<br />

the PIO buffers. It is set as a parameter in /etc/modprobe.conf (on Red Hat<br />

systems) or /etc/modprobe.conf.local (on SLES systems). The default is:<br />

option ib_qib wc_pat=1<br />

If PAT is unavailable or PAT initialization fails, the code generates a message in<br />

the log and falls back to the Memory Type Range Registers (MTRR) mechanism.<br />

To use MTRR, disable PAT by setting this module parameter to 0 (as a root user):<br />

option ib_qib wc_pat=0<br />

Then, revert to using MTRR-only behavior by following one of the two suggestions<br />

in “MTRR Mapping and Write Combining” on page H-2.<br />

The driver must be restarted after the changes have been made.<br />

NOTE:<br />

There will be no WC entry in /proc/mtrr when using PAT.<br />

MTRR Mapping and Write Combining<br />

.<br />

Two suggestions for properly enabling MTRR mapping for write combining are<br />

described in the following sections.<br />

See “Performance Issues” on page F-9 for more details on a related performance<br />

issue.<br />

Edit BIOS Settings to Fix MTRR Issues<br />

You can edit the BIOS setting for MTRR mapping. The BIOS setting looks similar<br />

to:<br />

MTRR Mapping<br />

[Discrete]<br />

For systems with very large amounts of memory (32GB or more), it may also be<br />

necessary to adjust the BIOS setting for the PCI hole granularity to 2GB. This<br />

setting allows the memory to be mapped with fewer MTRRs, so that there will be<br />

one or more unused MTRRs for the InfiniPath driver.<br />

H-2 D000046-005 B


H–Write Combining<br />

MTRR Mapping and Write Combining<br />

Some BIOS’ do not have the MTRR mapping option. It may have a different<br />

name, depending on the chipset, vendor, BIOS, or other factors. For example, it is<br />

sometimes referred to as 32 bit memory hole. This setting must be enabled.<br />

If there is no setting for MTRR mapping or 32 bit memory hole, and you have<br />

problems with degraded performance, contact your system or motherboard<br />

vendor and ask how to enable write combining.<br />

Use the ipath_mtrr Script to Fix MTRR Issues<br />

<strong>QLogic</strong> also provides a script, ipath_mtrr, which sets the MTRR registers,<br />

enabling maximum performance from the InfiniPath driver. This Python script is<br />

available as a part of the InfiniPath software download, and is contained in the<br />

infinipath* RPM. It is installed in /bin.<br />

To diagnose the machine, run it with no arguments (as a root user):<br />

# ipath_mtrr<br />

The test results will list any problems, if they exist, and provide suggestions on<br />

what to do.<br />

To fix the MTRR registers, use:<br />

# ipath_mtrr -w<br />

Restart the driver after fixing the registers.<br />

This script needs to be run after each system reboot. It can be set to run<br />

automatically upon restart by adding this line in /etc/sysconfig/infinipath:<br />

IPATH_MTRR_ACTIVE=1<br />

See the ipath_mtrr(8) man page for more information on other options.<br />

D000046-005 B H-3


H–Write Combining<br />

MTRR Mapping and Write Combining<br />

Notes<br />

H-4 D000046-005 B


I<br />

Useful Programs and Files<br />

The most useful programs and files for debugging, and commands for common<br />

tasks, are presented in the following sections. Many of these programs and files<br />

have been discussed elsewhere in the documentation. This information is<br />

summarized and repeated here for your convenience.<br />

Check Cluster Homogeneity with ipath_checkout<br />

Many problems can be attributed to the lack of homogeneity in the cluster<br />

environment. Use the following items as a checklist for verifying homogeneity. A<br />

difference in any one of these items in your cluster may cause problems:<br />

• Kernels<br />

• Distributions<br />

• Versions of the <strong>QLogic</strong> boards<br />

• Runtime and build environments<br />

• .o files from different compilers<br />

• Libraries<br />

• Processor/link speeds<br />

• PIO bandwidth<br />

• MTUs<br />

With the exception of finding any differences between the runtime and build<br />

environments, ipath_checkout will pick up information on all the above items.<br />

Other programs useful for verifying homogeneity are listed in Table I-1. More<br />

details on ipath_checkout are in “ipath_checkout” on page I-10.<br />

Restarting InfiniPath<br />

When the driver status appears abnormal on any node, you can try restarting (as<br />

a root user). Type:<br />

# /etc/init.d/openibd restart<br />

These two commands perform the same function as restart:<br />

# /etc/init.d/openibd stop<br />

# /etc/init.d/openibd start<br />

Also check the /var/log/messages file for any abnormal activity.<br />

D000046-005 B I-1


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

Summary and Descriptions of Useful Programs<br />

Useful programs are summarized in Table I-1. Names in blue text are linked to a<br />

corresponding section that provides further details. Check the man pages for<br />

more information on the programs.<br />

Table I-1. Useful Programs<br />

Program Name<br />

chkconfig<br />

dmesg<br />

Function<br />

Checks the configuration state and enables/disables services,<br />

including drivers. Can be useful for checking homogeneity.<br />

Prints out bootup messages. Useful for checking for initialization<br />

problems.<br />

ibhosts a<br />

ibstatus a<br />

ibtracert a<br />

ibv_devinfo a<br />

ident b<br />

ipathbug-helper c<br />

ipath_checkout c<br />

ipath_control c<br />

ipath_mtrr c<br />

ipath_pkt_test c<br />

Checks that all hosts in the fabric are up and visible to the<br />

subnet manager and to each other<br />

Checks the status of InfiniBand devices when OpenFabrics is<br />

installed<br />

Determines the path that InfiniBand packets travel between<br />

two nodes<br />

Lists information about InfiniBand devices in use. Use when<br />

OpenFabrics is enabled.<br />

Identifies RCS keyword strings in files. Can check for dates,<br />

release versions, and other identifying information.<br />

A shell script that gathers status and history information for<br />

use in analyzing InfiniPath problems<br />

A bash shell script that performs sanity testing on a cluster<br />

using <strong>QLogic</strong> hardware and InfiniPath software. When the<br />

program runs without errors, the node is properly configured.<br />

A shell script that manipulates various parameters for the<br />

InfiniPath driver.<br />

This script gathers the same information contained in boardversion,<br />

status_str, and version.<br />

A Python script that sets the MTRR registers.<br />

Tests the InfiniBand link and bandwidth between two <strong>QLogic</strong><br />

infiniband adapters, or, using an InfiniBand loopback connector,<br />

tests within a single <strong>QLogic</strong> infiniband adapter<br />

I-2 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

Table I-1. Useful Programs (Continued)<br />

Program Name<br />

ipathstats c<br />

lsmod<br />

modprobe<br />

mpi_stress<br />

mpirun d<br />

ps<br />

rpm<br />

strings e<br />

Table Notes<br />

Function<br />

Displays driver statistics and hardware counters, including<br />

performance and "error" (including status) counters<br />

Shows status of modules in the Linux kernel. Use to check<br />

whether drivers are loaded.<br />

Adds or removes modules from the Linux kernel.<br />

An MPI stress test program designed to load up an MPI interconnect<br />

with point-to-point messages while optionally checking<br />

for data integrity.<br />

A front end program that starts an MPI job on an InfiniPath<br />

cluster. Use to check the origin of the drivers.<br />

Displays information on current active processes. Use to<br />

check whether all necessary processes have been started.<br />

Package manager to install, query, verify, update, or erase<br />

software packages. Use to check the contents of a package.<br />

Prints the strings of printable characters in a file. Useful for<br />

determining contents of non-text files such as date and version.<br />

a<br />

These programs are contained in the OpenFabrics openib-diags RPM.<br />

b<br />

These programs are contained within the rcs RPM for your distribution.<br />

c<br />

These programs are contained in the infinipath RPM. To use these programs, install the<br />

infinipath RPM on the nodes where you install the mpi-frontend RPM.<br />

d<br />

These programs are contained in the <strong>QLogic</strong> mpi-frontend RPM.<br />

e<br />

These programs are contained within the binutils RPM for your distribution.<br />

dmesg<br />

dmesg prints out bootup messages. It is useful for checking for initialization<br />

problems. You can check to see if problems were detected during the driver and<br />

<strong>QLogic</strong> hardware initialization with the command:<br />

$ dmesg|egrep -i infinipath|qib<br />

This command may generate more than one screen of output.<br />

D000046-005 B I-3


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

iba_opp_query<br />

This command retrieves path records from the Distributed SA and is somewhat<br />

similar to iba_saquery. It is intended for testing the Distributed SA<br />

(qlogic_sa) and for verifying connectivity between nodes in the fabric. For<br />

information on configuring and using the Distributed SA, refer to “<strong>QLogic</strong><br />

Distributed Subnet Administration” on page 3-13.<br />

iba_opp_query does not access the SM when doing queries, it only accesses<br />

the local Distributed SA database. For that reason, the kinds of queries that can<br />

be done are much more limited than with iba_saquery. In particular, it can only<br />

find paths that start on the machine where the command is run. (In other words,<br />

the source LID or source GID must be on the local node.) In addition, queries<br />

must supply either a source and destination LID, or a source and destination GID.<br />

They cannot be mixed. In addition, you will usually need to provide either a SID<br />

that was specified in Distributed SA configuration file, or a pkey that matches such<br />

a SID.<br />

Usage<br />

iba_opp_query [-v level] [-hca hca] [-p port] [-s LID] [-d<br />

LID] [-S GID] [-D GID] [-k pkey] [-i sid] [-H]<br />

Options<br />

-v/--verbose level — Debug level. Should be a number between 1<br />

and 7. Default is 5.<br />

-s/--slid LID — Source LID. Can be in decimal, hex (0x##) or octal<br />

(0##)<br />

-d/--dlid LID — Destination LID. Can be in decimal, hex (0x##) or octal<br />

(0##)<br />

-S/--sgid GID — Source GID. (Can be in GID<br />

("0x########:0x########") or inet6 format ("##:##:##:##:##:##:##:##"))<br />

-D/--dgid GID — Destination GID. (Can be in GID<br />

("0x########:0x########") or inet6 format ("##:##:##:##:##:##:##:##"))<br />

-k/--pkey pkey — Partition Key<br />

-i/--sid sid — Service ID<br />

-h/--hca hca — The <strong>Host</strong> Channel Adapter to use. (Defaults to the first<br />

<strong>Host</strong> Channel Adapter.) The <strong>Host</strong> Channel Adapter can be identified by<br />

name ("mthca0", “qib1”, et cetera) or by number (1, 2, 3, et cetera).<br />

-p/--port port — The port to use. (Defaults to the first port)<br />

-H/--help — Provides this help text.<br />

All arguments are optional, but ill-formed queries can be expected to fail. You<br />

must provide at least a pair of LIDs, or a pair of GIDs.<br />

I-4 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

Sample output:<br />

# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107<br />

Query Parameters:<br />

resv1<br />

0x0000000000000107<br />

dgid ::<br />

sgid ::<br />

dlid<br />

0x75<br />

slid<br />

0x31<br />

hop<br />

0x0<br />

flow<br />

0x0<br />

tclass<br />

0x0<br />

num_path<br />

0x0<br />

pkey<br />

0x0<br />

qos_class<br />

0x0<br />

sl<br />

0x0<br />

mtu<br />

0x0<br />

rate<br />

0x0<br />

pkt_life<br />

0x0<br />

preference<br />

0x0<br />

resv2<br />

0x0<br />

resv3<br />

0x0<br />

Using HCA qib0<br />

Result:<br />

resv1<br />

0x0000000000000107<br />

dgid<br />

fe80::11:7500:79:e54a<br />

sgid<br />

fe80::11:7500:79:e416<br />

dlid<br />

0x75<br />

slid<br />

0x31<br />

hop<br />

0x0<br />

flow<br />

0x0<br />

tclass<br />

0x0<br />

num_path<br />

0x0<br />

pkey<br />

0xffff<br />

qos_class<br />

0x0<br />

sl<br />

0x1<br />

mtu<br />

0x4<br />

rate<br />

0x6<br />

pkt_life<br />

0x10<br />

preference<br />

0x0<br />

D000046-005 B I-5


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

resv2<br />

resv3<br />

0x0<br />

0x0<br />

Explanation of Sample Output:<br />

This is a simple query, specifying the source and destination LIDs and the<br />

desired SID. The first half of the output shows the full “query” that will be<br />

sent to the Distributed SA. Unused fields are set to zero or are blank.<br />

In the center, the line “Using HCA qib0” tells us that, because we did not<br />

specify which <strong>Host</strong> Channel Adapter to query against, the tool chose one for<br />

us. (Normally, the user will never have to specify which <strong>Host</strong> Channel<br />

Adapter to use. This is only relevant in the case where a single node is<br />

connected to multiple physical IB fabrics.)<br />

Finally, the bottom half of the output shows the result of the query. Note that,<br />

if the query had failed (because the destination does not exist or because<br />

the SID is not found in the Distributed SA) you will receive and error instead:<br />

# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x108<br />

Query Parameters:<br />

resv1<br />

0x0000000000000108<br />

dgid ::<br />

sgid ::<br />

dlid<br />

0x75<br />

slid<br />

0x31<br />

hop<br />

0x0<br />

flow<br />

0x0<br />

tclass<br />

0x0<br />

num_path<br />

0x0<br />

pkey<br />

0x0<br />

qos_class<br />

0x0<br />

sl<br />

0x0<br />

mtu<br />

0x0<br />

rate<br />

0x0<br />

pkt_life<br />

0x0<br />

preference<br />

0x0<br />

resv2<br />

0x0<br />

resv3<br />

0x0<br />

Using HCA qib0<br />

******<br />

Error: Get Path returned 22 for query: Invalid argument<br />

******<br />

I-6 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

ibhosts<br />

ibstatus<br />

Examples:<br />

Query by LID and SID:<br />

iba_opp_query -s 0x31 -d 0x75 -i 0x107<br />

iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107<br />

Queries using octal or decimal numbers:<br />

iba_opp_query --slid 061 --dlid 0165 --sid 0407 (using octal<br />

numbers)<br />

iba_opp_query –slid 49 –dlid 113 –sid 263 (using decimal<br />

numbers)<br />

Note that these queries are the same as the first two, only the base of the<br />

numbers has changed.<br />

Query by LID and PKEY:<br />

iba_opp_query --slid 0x31 --dlid 0x75 –pkey 0x8002<br />

Query by GID:<br />

iba_opp_query -S fe80::11:7500:79:e416 -D<br />

fe80::11:7500:79:e54a --sid 0x107<br />

iba_opp_query -S 0xfe80000000000000:0x001175000079e416 -D<br />

0xfe80000000000000:0x001175000079e394 --sid 0x107<br />

As before, these queries are identical to the first two queries – they are just<br />

using the GIDs instead of the LIDs to specify the ports involved.<br />

This tool determines if all the hosts in your InfiniBand fabric are up and visible to<br />

the subnet manager and to each other. It is installed from the openib-diag<br />

RPM. Running ibhosts (as a root user) produces output similar to this when run<br />

from a node on the InfiniBand fabric:<br />

# ibhosts<br />

Ca : 0x0008f10001280000 ports 2 "Voltaire InfiniBand<br />

Fiber-Channel Router"<br />

Ca : 0x0011750000ff9869 ports 1 "idev-11"<br />

Ca : 0x0011750000ff9878 ports 1 "idev-05"<br />

Ca : 0x0011750000ff985c ports 1 "idev-06"<br />

Ca : 0x0011750000ff9873 ports 1 "idev-04"<br />

This program displays basic information on the status of InfiniBand devices that<br />

are currently in use when OpenFabrics RPMs are installed. It is installed from the<br />

openib-diag RPM.<br />

D000046-005 B I-7


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

ibtracert<br />

Following is a sample output for the SDR adapters:<br />

$ ibstatus<br />

Infiniband device ’qib0’ port 1 status:<br />

default gid: fe80:0000:0000:0000:0011:7500:0005:602f<br />

base lid: 0x35<br />

sm lid:<br />

0x2<br />

state:<br />

4: ACTIVE<br />

phys state: 5: LinkUp<br />

rate:<br />

10 Gb/sec (4X)<br />

Following is a sample output for the DDR adapters; note the difference in rate:<br />

$ ibstatus<br />

Infiniband device ’qib0’ port 1 status:<br />

default gid: fe80:0000:0000:0000:0011:7500:00ff:9608<br />

base lid: 0xb<br />

sm lid:<br />

0x1<br />

state:<br />

4: ACTIVE<br />

phys state: 5: LinkUp<br />

rate:<br />

20 Gb/sec (4X DDR)<br />

The tool ibtracert determines the path that InfiniBand packets travel between<br />

two nodes. It is installed from the openib-diag RPM. The InfiniBand LIDs of the<br />

two nodes in this example are determined by using the ipath_control -i<br />

command on each node. The ibtracert tool produces output similar to the<br />

following when run (as a root user) from a node on the InfiniBand fabric:<br />

# ibtracert 0xb9 0x9a<br />

From ca {0x0011750000ff9886} portnum 1 lid 0xb9-0xb9 "iqa-37"<br />

[1] -> switch port {0x0002c9010a19bea0}[1] lid 0x14-0x14<br />

"MT47396 Infiniscale-III"<br />

[24] -> switch port {0x00066a0007000333}[8] lid 0xc-0xc<br />

"SilverStorm 9120 GUID=0x00066a000200016c Leaf 6, Chip A"<br />

[6] -> switch port {0x0002c90000000000}[15] lid 0x9-0x9<br />

"MT47396 Infiniscale-III"<br />

[7] -> ca port {0x0011750000ff9878}[1] lid 0x9a-0x9a "idev-05"<br />

To ca {0x0011750000ff9878} portnum 1 lid 0x9a-0x9a "idev-05"<br />

I-8 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

ibv_devinfo<br />

This program displays information about InfiniBand devices, including various<br />

kinds of identification and status data. It is installed from the openib-diag RPM.<br />

Use this program when OpenFabrics is enabled. ibv_devinfo queries RDMA<br />

devices. Use the -v option to see more information. For example:<br />

$ ibv_devinfo<br />

hca_id: qib0<br />

fw_ver: 0.0.0<br />

node_guid:<br />

0011:7500:00ff:89a6<br />

sys_image_guid:<br />

0011:7500:00ff:89a6<br />

vendor_id:<br />

0x1175<br />

vendor_part_id: 29216<br />

hw_ver:<br />

0x2<br />

board_id:<br />

InfiniPath_QLE7280<br />

phys_port_cnt: 1<br />

port: 1<br />

state: PORT_ACTIVE (4)<br />

max_mtu: 4096 (5)<br />

active_mtu: 4096 (5)<br />

sm_lid: 1<br />

port_lid: 31<br />

port_lmc:<br />

0x00<br />

ident<br />

The ident strings are available in ib_qib.ko. Running ident provides driver<br />

information similar to the following. For <strong>QLogic</strong> RPMs, it will look like the following<br />

example:<br />

$ ident /lib/modules/$(uname-r)/updates/kernel/drivers/<br />

infiniband/hw/ib_qib.ko<br />

/lib/modules/$(uname-r)/updates/kernel/drivers/infiniband/hw/ib_qi<br />

b.ko:<br />

$Id: <strong>QLogic</strong> OFED Release 1.5 $<br />

$Date: 2010-02-17-18:51 $<br />

If the /lib/modules/$(uname -r)/updates directory is not present,<br />

then the driver in use is the one that comes with the core kernel.<br />

In this case, either the kernel-ib RPM is not installed or it is<br />

not configured for the current running kernel.<br />

D000046-005 B I-9


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

If the updates directory is present, but empty except for the subdirectory<br />

kernel, then an OFED install is probably being used, and the ident string will<br />

be empty. For example:<br />

$ cd /lib/modules/$(uname -r)/updates<br />

$ ls<br />

kernel<br />

$ cd kernel/drivers/infiniband/hw/qib/<br />

lib/modules/2.6.18-8.el5/updates/kernel/drivers/infiniband/hw/qib<br />

$ ident ib_qib.ko<br />

ib_qib.ko:<br />

ident warning: no id keywords in ib_qib.ko<br />

ipathbug-helper<br />

ipath_checkout<br />

NOTE:<br />

ident is in the optional rcs RPM, and is not always installed.<br />

The tool ipathbug-helper is useful for verifying homogeneity. It is installed<br />

from the infinipath RPM. Before contacting <strong>QLogic</strong> Technical Support, run this<br />

script on the head node of your cluster and the compute nodes that you suspect<br />

are having problems. Looking at the output often helps you find the problem. Run<br />

ipathbug-helper on several nodes and examine the output for differences.<br />

It is best to run ipathbug-helper with root privilege, since some of the queries<br />

it makes require this level of privilege. There is also a --verbose parameter that<br />

increases the amount of gathered information.<br />

If you cannot see the problem, send the stdout output to your reseller, along with<br />

information on the version of the InfiniPath software you are using.<br />

The ipath_checkout tool is a bash script that verifies that the installation is<br />

correct and that all the nodes of the network are functioning and mutually<br />

connected by the InfiniPath fabric. It is installed from the infinipath RPM. It<br />

must be run on a front end node, and requires specification of a nodefile. For<br />

example:<br />

$ ipath_checkout [options] nodefile<br />

The nodefile lists the hostnames of the nodes of the cluster, one hostname per<br />

line. The format of nodefile is as follows:<br />

hostname1<br />

hostname2<br />

...<br />

I-10 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

NOTE:<br />

• The hostnames in the nodefile are Ethernet hostnames, not IPv4<br />

addresses.<br />

• To create a nodefile, use the ibhosts program. It will generate a list<br />

of available nodes that are already connected to the switch.<br />

ipath_checkout performs the following seven tests on the cluster:<br />

1. Executes the ping command to all nodes to verify that they all are<br />

reachable from the front end.<br />

2. Executes the ssh command to each node to verify correct configuration of<br />

ssh.<br />

3. Gathers and analyzes system configuration from the nodes.<br />

4. Gathers and analyzes RPMs installed on the nodes.<br />

5. Verifies InfiniPath hardware and software status and configuration, including<br />

tests for link speed, PIO bandwidth (incorrect MTRR settings), and MTU<br />

size.<br />

6. Verifies the ability to mpirun jobs on the nodes.<br />

7. Runs a bandwidth and latency test on every pair of nodes and analyzes the<br />

results.<br />

The options available with ipath_checkout are shown in Table I-2.<br />

Table I-2. ipath_checkout Options<br />

Command<br />

-h, --help<br />

-v, --verbose<br />

-vv, --vverbose<br />

-vvv, --vvverbose<br />

-c, --continue<br />

Meaning<br />

These options display help messages describing how a command<br />

is used.<br />

These options specify three successively higher levels of<br />

detail in reporting test results. There are four levels of detail<br />

in all, including the case where none of these options are<br />

given.<br />

When this option is not specified, the test terminates when<br />

any test fails. When specified, the tests continue after a failure,<br />

with failing nodes excluded from subsequent tests.<br />

D000046-005 B I-11


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

Table I-2. ipath_checkout Options (Continued)<br />

Command<br />

-k, --keep<br />

--workdir=DIR<br />

--run=LIST<br />

--skip=LIST<br />

-d, --debug<br />

Meaning<br />

This option keeps intermediate files that were created while<br />

performing tests and compiling reports. Results are saved in<br />

a directory created by mktemp and named<br />

infinipath_XXXXXX or in the directory name given to<br />

--workdir.<br />

Use DIR to hold intermediate files created while running<br />

tests. DIR must not already exist.<br />

This option runs only the tests in LIST. See the seven tests<br />

listed previously. For example, --run=123 will run only<br />

tests 1, 2, and 3.<br />

This option skips the tests in LIST. See the seven tests listed<br />

previously. For example, --skip=2457 will skip tests 2, 4,<br />

5, and 7.<br />

This option turns on the -x and -v flags in bash(1).<br />

ipath_control<br />

In most cases of failure, the script suggests recommended actions. Also refer to<br />

the ipath_checkout man page.<br />

The ipath_control tool is a shell script that manipulates various parameters<br />

for the InfiniPath driver. It is installed from the infinipath RPM. Many of the<br />

parameters are used only when diagnosing problems, and may require special<br />

system configurations. Using these options may require restarting the driver or<br />

utility programs to recover from incorrect parameters.<br />

Most of the functionality is accessed via the /sys filesystem. This shell script<br />

gathers the same information contained in these files:<br />

/sys/class/infiniband/qib0/device/boardversion<br />

/sys/class/infiniband/qib0/ports/1/linkcontrol/status_str<br />

/sys/class/infiniband/qib0/device/driver/version<br />

These files are also documented in Table I-4 and Table I-5.<br />

Other than the -i option, this script must be run with root permissions. See the<br />

man pages for ipath_control for more details.<br />

I-12 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

Here is sample usage and output:<br />

% ipath_control -i<br />

$Id: <strong>QLogic</strong> OFED Release 1.5 $ $Date: 2010-03-01-23:28 $<br />

0: Version: ChipABI 2.0, InfiniPath_QLE7342, InfiniPath1 6.1, SW<br />

Compat 2<br />

0: Serial: RIB0941C00005 LocalBus: PCIe,5000MHz,x8<br />

0,1: Status: 0xe1 Initted Present IB_link_up IB_configured<br />

0,1: LID=0x1 GUID=0011:7500:0079:e574<br />

0,1: HRTBT:Auto LINK:40 Gb/sec (4X QDR)<br />

0,2: Status: 0x21 Initted Present [IB link not Active]<br />

0,2: LID=0xffff GUID=0011:7500:0079:e575<br />

The -i option combined with the -v option is very useful for looking at the<br />

InfiniBand width/rate and PCIe lanes/rate. For example:<br />

% ipath_control -iv<br />

$Id: <strong>QLogic</strong> OFED Release 1.5 $ $Date: 2010-03-01-23:28 $<br />

0: Version: ChipABI 2.0, InfiniPath_QLE7342, InfiniPath1 6.1, SW<br />

Compat 2<br />

0: Serial: RIB0941C00005 LocalBus: PCIe,5000MHz,x8<br />

0,1: Status: 0xe1 Initted Present IB_link_up IB_configured<br />

0,1: LID=0x1 GUID=0011:7500:0079:e574<br />

0,1: HRTBT:Auto LINK:40 Gb/sec (4X QDR)<br />

0,2: Status: 0x21 Initted Present [IB link not Active]<br />

0,2: LID=0xffff GUID=0011:7500:0079:e575<br />

0,2: HRTBT:Auto LINK:10 Gb/sec (4X)<br />

NOTE:<br />

On the first line, Release version refers to the current software release.<br />

The second line contains chip architecture version information.<br />

ipath_mtrr<br />

Another useful option blinks the LED on the InfiniPath adapter (QLE7240 and<br />

QLE7280 adapters). This is useful for finding an adapter within a cluster. Run the<br />

following as a root user:<br />

# ipath_control -b [On|Off]<br />

NOTE:<br />

Use ipath_mtrr if you are not using the default PAT mechanism to enable<br />

write combining.<br />

D000046-005 B I-13


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

MTRR is used by the InfiniPath driver to enable write combining to the <strong>QLogic</strong><br />

on-chip transmit buffers. This option improves write bandwidth to the <strong>QLogic</strong> chip<br />

by writing multiple words in a single bus transaction (typically 64 bytes). This<br />

option applies only to x86_64 systems. It can often be set in the BIOS.<br />

However, some BIOS’ do not have the MTRR mapping option. It may have a<br />

different name, depending on the chipset, vendor, BIOS, or other factors. For<br />

example, it is sometimes referred to as 32 bit memory hole. This setting must be<br />

enabled.<br />

If there is no setting for MTRR mapping or 32 bit memory hole, contact your<br />

system or motherboard vendor and ask how to enable write combining.<br />

You can check and adjust these BIOS settings using the BIOS Setup utility. For<br />

specific instructions, follow the hardware documentation that came with your<br />

system.<br />

<strong>QLogic</strong> also provides a script, ipath_mtrr, which sets the MTRR registers,<br />

enabling maximum performance from the InfiniPath driver. This Python script is<br />

available as a part of the InfiniPath software download, and is contained in the<br />

infinipath* RPM. It is installed in /bin.<br />

To diagnose the machine, run it with no arguments (as a root user):<br />

# ipath_mtrr<br />

The test results will list any problems, if they exist, and provide suggestions on<br />

what to do.<br />

To fix the MTRR registers, use:<br />

# ipath_mtrr -w<br />

Restart the driver after fixing the registers.<br />

This script needs to be run after each system reboot. It can be set to run<br />

automatically upon restart by adding this line in /etc/sysconfig/infinipath:<br />

IPATH_MTRR_ACTIVE=1<br />

See the ipath_mtrr(8) man page for more information on other options.<br />

ipath_pkt_test<br />

This program is installed from the infinipath RPM. Use ipath_pkt_test to<br />

do one of the following:<br />

• Test the InfiniBand link and bandwidth between two InfiniPath infiniband<br />

adapters.<br />

• Using an InfiniBand loopback connector, test the link and bandwidth within a<br />

single InfiniPath infiniband adapter.<br />

I-14 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

ipathstats<br />

The ipath_pkt_test program runs in either ping-pong mode (send a packet,<br />

wait for a reply, repeat) or in stream mode (send packets as quickly as possible,<br />

receive responses as they come back).<br />

Upon completion, the sending side prints statistics on the packet bandwidth,<br />

showing both the payload bandwidth and the total bandwidth (including InfiniBand<br />

and InfiniPath headers). See the man page for more information.<br />

The ipathstats program is useful for diagnosing InfiniPath problems,<br />

particularly those that are performance related. It is installed from the<br />

infinipath RPM. It displays both driver statistics and hardware counters,<br />

including both performance and "error" (including status) counters.<br />

Running ipathstats -c 10, for example, displays the number of packets and<br />

32-bit words of data being transferred on a node in each 10-second interval. This<br />

output may show differences in traffic patterns on different nodes, or at different<br />

stages of execution. See the man page for more information.<br />

lsmod<br />

When you need to find which InfiniPath and OpenFabrics modules are running,<br />

type the following command:<br />

# lsmod | egrep ’ipath_|ib_|rdma_|findex’<br />

modprobe<br />

Use this program to load/unload the drivers. You can check to see if the driver has<br />

loaded by using this command:<br />

# modprobe -v ib_qib<br />

The -v option typically only prints messages if there are problems.<br />

The configuration file that modprobe uses is /etc/modprobe.conf<br />

(/etc/modprobe.conf.local on SLES). In this file, various options and<br />

naming aliases can be set.<br />

mpirun<br />

mpirun determines whether the program is being run against a <strong>QLogic</strong> or<br />

non-<strong>QLogic</strong> driver. It is installed from the mpi-frontend RPM. Sample<br />

commands and results are shown in the following paragraphs.<br />

<strong>QLogic</strong>-built:<br />

$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0<br />

asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0<br />

(1 active chips)<br />

asus-01:0.ipath_userinit: Driver is <strong>QLogic</strong>-built<br />

D000046-005 B I-15


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Programs<br />

mpi_stress<br />

Non-<strong>QLogic</strong> built:<br />

$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0<br />

asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0<br />

(1 active chips)<br />

asus-01:0.ipath_userinit: Driver is not <strong>QLogic</strong>-built<br />

This is an MPI stress test program designed to load up an MPI interconnect with<br />

point-to-point messages while optionally checking for data integrity. By default, it<br />

runs with all-to-all traffic patterns, optionally including oneself and one’s local<br />

shared memory (shm) peers. It can also be set up with multi-dimensional grid<br />

traffic patterns; this can be parameterized to run rings, open 2D grids, closed<br />

2D grids, cubic lattices, hypercubes, and so on.<br />

Optionally, the message data can be randomized and checked using CRC<br />

checksums (strong but slow) or XOR checksums (weak but fast). The<br />

communication kernel is built out of non-blocking point-to-point calls to load up the<br />

interconnect. The program is not designed to exhaustively test out different MPI<br />

primitives. Performance metrics are displayed, but should be carefully interpreted<br />

in terms of the features enabled.<br />

This is an MPI application and should be run under mpirun or its equivalent.<br />

The following example runs 16 processes and a specified hosts file using the<br />

default options (all-to-all connectivity, 64 to 4MB messages in powers of two, one<br />

iteration, no data integrity checking):<br />

$ mpirun -np 16 -m hosts mpi_stress<br />

There are a number of options for mpi_stress; this one may be particularly useful:<br />

-P<br />

This option poisons receive buffers at initialization and after each receive;<br />

pre-initialize with random data so that any parts that are not being correctly<br />

updated with received data can be observed later.<br />

See the mpi_stress(1) man page for more information.<br />

rpm<br />

To check the contents of an installed RPM, use these commands:<br />

$ rpm -qa infinipath\* mpi-\*<br />

$ rpm -q --info infinipath # (etc)<br />

The option-q queries. The option --qa queries all. To query a package that has<br />

not yet been installed, use the -qpl option.<br />

I-16 D000046-005 B


I–Useful Programs and Files<br />

Common Tasks and Commands<br />

strings<br />

Use the strings command to determine the content of and extract text from a<br />

binary file.<br />

The command strings can also be used. For example, the command:<br />

$ strings -a /usr/lib/libinfinipath.so.4.0 | grep Date:<br />

produces this output:<br />

$Date: 2009-02-26 12:05 Release2.3 InfiniPath $<br />

NOTE:<br />

The strings command is part of binutils (a development RPM), and<br />

may not be available on all machines.<br />

Common Tasks and Commands<br />

Table I-3 lists some common commands that help with administration and<br />

troubleshooting. Note that mpirun in nonmpi mode can perform a number of<br />

checks.<br />

Table I-3. Common Tasks and Commands Summary<br />

Function<br />

Check the system state<br />

Verify hosts via an Ethernet<br />

ping<br />

Verify ssh<br />

Command<br />

ipath_checkout [options] hostsfile<br />

ipathbug-helper -m hostsfile \<br />

> ipath-info-allhosts<br />

mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi ipath_control -i<br />

Also see the file:<br />

/sys/class/infiniband/ipath*/device/status_str<br />

where * is the unit number. This file provides information<br />

about the link state, possible cable/switch problems,<br />

and hardware errors.<br />

ipath_checkout --run=1 hostsfile<br />

ipath_checkout --run=2 hostsfile<br />

Show uname -a for all hosts mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi uname -a<br />

D000046-005 B I-17


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Files<br />

Table I-3. Common Tasks and Commands Summary (Continued)<br />

Reboot hosts<br />

Function<br />

Command<br />

As a root user:<br />

mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi reboot<br />

Run a command on all hosts mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi <br />

Examples:<br />

mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi hostname<br />

mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi date<br />

Copy a file to all hosts<br />

Summarize the fabric components<br />

Show the status of host Infini-<br />

Band ports<br />

Verify that the hosts see each<br />

other<br />

Check MPI performance<br />

Using bash:<br />

$ for i in $( cat hostsfile )<br />

do<br />

scp $i:<br />

done<br />

ipathbug-helper -m hostsfile \<br />

> ipath-info-allhosts<br />

ipathbug-helper -m hostsfile \<br />

> ipath-info-allhosts<br />

mpirun -m hostsfile -ppn 1 \<br />

-np numhosts -nonmpi ipath_control -i<br />

ipath_checkout --run=5 hostsfile<br />

ipath_checkout --run=7 hostsfile<br />

Generate all hosts problem<br />

report information<br />

ipathbug-helper -m hostsfile \<br />

> ipath-info-allhosts<br />

Table Notes<br />

The " \ " indicates commands that are broken across multiple lines.<br />

Summary and Descriptions of Useful Files<br />

Useful files are summarized in Table I-4. Names in blue text are linked to a<br />

corresponding section that provides further details.<br />

I-18 D000046-005 B


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Files<br />

Table I-4. Useful Files<br />

File Name<br />

boardversion<br />

status_str<br />

/var/log/messages<br />

version<br />

Function<br />

File that shows the version of the chip architecture.<br />

File that verifies that the InfiniPath software is loaded and<br />

functioning<br />

Logfile where various programs write messages. Tracks<br />

activity on your system<br />

File that provides version information of installed software/drivers<br />

boardversion<br />

It is useful to keep track of the current version of the chip architecture. You can<br />

check the version by looking in this file:<br />

/sys/class/infiniband/qib0/device/boardversion<br />

Example contents are:<br />

ChipABI 2.0,InfiniPath_QLE7280,InfiniPath1 5.2,PCI 2,SW Compat 2<br />

This information is useful for reporting problems to Technical Support.<br />

status_str<br />

NOTE:<br />

This file returns information of where the form factor adapter is installed. The<br />

PCIe half-height, short form factor is referred to as the QLE7140, QLE7240,<br />

QLE7280, QLE7340, or QLE7342.<br />

Check the file status_str to verify that the InfiniPath software is loaded and<br />

functioning. The file is located here:<br />

/sys/class/infiniband/qib/device/status_str<br />

Table I-5 shows the possible contents of the file, with brief explanations of the<br />

entries.<br />

Table I-5. status_str File Contents<br />

Initted<br />

File Contents<br />

Description<br />

The driver has loaded and successfully initialized<br />

the IBA6110 or IBA7220 ASIC.<br />

D000046-005 B I-19


I–Useful Programs and Files<br />

Summary and Descriptions of Useful Files<br />

Table I-5. status_str File Contents (Continued)<br />

File Contents<br />

Present<br />

IB_link_up<br />

IB_configured<br />

NOIBcable<br />

Fatal_Hardware_Error<br />

Description<br />

The IBA6110 or IBA7220 ASIC has been detected<br />

(but not initialized unless Initted is also present).<br />

The InfiniBand link has been configured and is in<br />

the active state; packets can be sent and received.<br />

The InfiniBand link has been configured. It may or<br />

may not be up and usable.<br />

Unable to detect link present. This problem can be<br />

caused by one of the following problems with the<br />

QLE7140, QLE7240, or QLE7280 adapters:<br />

• No cable is plugged into the adapter.<br />

• The adapter is connected to something other<br />

than another InfiniBand device, or the connector<br />

is not fully seated.<br />

• The switch where the adapter is connected is<br />

down.<br />

Check the system log (default is /var/log/messages)<br />

for more information, then call Technical<br />

Support.<br />

This same directory contains other files with information related to status. These<br />

files are summarized in Table I-6.<br />

Table I-6. Status—Other Files<br />

File Name<br />

lid<br />

mlid<br />

guid<br />

nguid<br />

serial<br />

Contents<br />

InfiniBand LID. The address on the InfiniBand fabric, similar conceptually<br />

to an IP address for TCP/IP. Local refers to it being unique<br />

only within a single InfiniBand fabric.<br />

The Multicast Local ID (MLID), for InfiniBand multicast. Used for<br />

InfiniPath ether broadcasts, since InfiniBand has no concept of<br />

broadcast.<br />

The GUID for the InfiniPath chip, it is equivalent to a MAC address.<br />

The number of GUIDs that are used. If nguids == 2 and two chips<br />

are discovered, the first chip is assigned the requested GUID (from<br />

eeprom, or ipath_sma), and the second chip is assigned GUID+1.<br />

The serial number of the QLE7140, QLE7240, or QLE7280 adapter.<br />

I-20 D000046-005 B


I–Useful Programs and Files<br />

Summary of Configuration Files<br />

Table I-6. Status—Other Files (Continued)<br />

File Name<br />

unit<br />

status<br />

Contents<br />

A unique number for each card or chip in a system.<br />

The numeric version of the status_str file, described in Table I-5.<br />

version<br />

You can check the version of the installed InfiniPath software by looking in:<br />

/sys/class/infiniband/qib0/device/driver/version<br />

<strong>QLogic</strong>-built drivers have contents similar to:<br />

$Id: <strong>QLogic</strong> OFED Release 1.4.2$ $Date: Fri Feb 27 16:14:31 PST 2009<br />

$<br />

Non-<strong>QLogic</strong>-built drivers (in this case kernel.org) have contents similar to:<br />

$Id: <strong>QLogic</strong> kernel.org driver $<br />

Summary of Configuration Files<br />

Table I-7 contains descriptions of the configuration and configuration template<br />

files used by the InfiniPath and OpenFabrics software.<br />

Table I-7. Configuration Files<br />

Configuration File Name<br />

/etc/infiniband/qlgc_vnic.cfg<br />

/etc/modprobe.conf<br />

Description<br />

VirtualNIC configuration file. Create this file<br />

after running ib_qlgc_vnic_query to<br />

get the information you need. This file was<br />

named /etc/infiniband/qlogic_vnic.cfg<br />

or<br />

/etc/sysconfig/ics_inic.cfg in<br />

previous releases. See the sample file<br />

qlgc_vnic.cfg.sample (described<br />

later) to see how it should be set up.<br />

Specifies options for modules when added<br />

or removed by the modprobe command.<br />

Also used for creating aliases. The PAT<br />

write-combing option is set here.<br />

For Red Hat systems.<br />

D000046-005 B I-21


I–Useful Programs and Files<br />

Summary of Configuration Files<br />

Table I-7. Configuration Files (Continued)<br />

Configuration File Name<br />

/etc/modprobe.conf.local<br />

/etc/infiniband/openib.conf<br />

/etc/sysconfig/infinipath<br />

/etc/sysconfig/network/ifcfg-<br />

<br />

Description<br />

Specifies options for modules when added<br />

or removed by the modprobe command.<br />

Also used for creating aliases. The PAT<br />

write-combing option is set here.<br />

For SLES systems.<br />

The primary configuration file for Infini-<br />

Path, OFED modules, and other modules<br />

and associated daemons. Automatically<br />

loads additional modules or changes IPoIB<br />

transport type.<br />

Contains settings, including the one that<br />

sets the ipath_mtrr script to run on<br />

reboot.<br />

Network configuration file for network interfaces<br />

When used for VNIC configuration,<br />

is in the form eiocX, where X is<br />

the device number. There will be one<br />

interface configuration file for each interface<br />

defined in /etc/infiniband/qlgc_vnic.cfg.<br />

For SLES systems.<br />

/etc/sysconfig/network-scripts/ifcfg-<br />

Network configuration file for network interfaces<br />

When used for VNIC configuration,<br />

is in the form eiocX, where X is<br />

the device number. There will be one<br />

interface configuration file for each interface<br />

defined in /etc/infiniband/qlgc_vnic.cfg.<br />

For Red Hat systems.<br />

I-22 D000046-005 B


I–Useful Programs and Files<br />

Summary of Configuration Files<br />

Table I-7. Configuration Files (Continued)<br />

Sample and Template Files<br />

qlgc_vnic.cfg.sample<br />

/usr/share/doc/initscripts-*/<br />

sysconfig.txt<br />

Description<br />

Sample VNIC config file. It can be found<br />

with the OFED documentation, or in the<br />

qlgc_vnictools subdirectory of the<br />

<strong>QLogic</strong> OFED <strong>Host</strong> <strong>Software</strong> download. It<br />

is also installed in /etc/infiniband.<br />

File that explains many of the entries in the<br />

configuration files<br />

For Red Hat systems.<br />

D000046-005 B I-23


I–Useful Programs and Files<br />

Summary of Configuration Files<br />

Notes<br />

I-24 D000046-005 B


J<br />

Recommended Reading<br />

Reference material for further reading is provided in this appendix.<br />

References for MPI<br />

The MPI Standard specification documents are located at:<br />

http://www.mpi-forum.org/docs<br />

The MPICH implementation of MPI and its documentation are located at:<br />

http://www-unix.mcs.anl.gov/mpi/mpich/<br />

The ROMIO distribution and its documentation are located at:<br />

http://www.mcs.anl.gov/romio<br />

Books for Learning MPI Programming<br />

Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI, Second Edition,<br />

1999, MIT Press, ISBN 0-262-57134-X<br />

Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI-2, Second Edition,<br />

1999, MIT Press, ISBN 0-262-57133-1<br />

Pacheco, Parallel Programming with MPI, 1997, Morgan Kaufman Publishers,<br />

ISBN 1-55860<br />

Reference and Source for SLURM<br />

InfiniBand<br />

OpenFabrics<br />

The open-source resource manager designed for Linux clusters is located at:<br />

http://www.llnl.gov/linux/slurm/<br />

The InfiniBand specification can be found at the InfiniBand Trade Association site:<br />

http://www.infinibandta.org/<br />

Information about the Open InfiniBand Alliance is located at:<br />

http://www.openfabrics.org<br />

D000046-005 B J-1


J–Recommended Reading<br />

Clusters<br />

Clusters<br />

Networking<br />

Rocks<br />

Gropp, William, Ewing Lusk, and Thomas Sterling, Beowulf Cluster Computing<br />

with Linux, Second Edition, 2003, MIT Press, ISBN 0-262-69292-9<br />

The Internet Frequently Asked Questions (FAQ) archives contain an extensive<br />

Request for Command (RFC) section. Numerous documents on networking and<br />

configuration can be found at:<br />

http://www.faqs.org/rfcs/index.html<br />

Extensive documentation on installing Rocks and custom Rolls can be found at:<br />

http://www.rocksclusters.org/<br />

Other <strong>Software</strong> Packages<br />

Environment Modules is a popular package to maintain multiple concurrent<br />

versions of software packages and is available from:<br />

http://modules.sourceforge.net/<br />

J-2 D000046-005 B


Corporate Headquarters <strong>QLogic</strong> Corporation 26650 Aliso Viejo Parkway Aliso Viejo, CA 92656 949.389.6000 www.qlogic.com<br />

International Offices UK | Ireland | Germany | India | Japan | China | Hong Kong | Singapore | Taiwan<br />

© 2010 <strong>QLogic</strong> Corporation. Specifications are subject to change without notice. All rights reserved worldwide. <strong>QLogic</strong> and the <strong>QLogic</strong> logo are<br />

registered trademarks of <strong>QLogic</strong> Corporation. All other brand and product names are trademarks or registered trademarks of their respective owners.<br />

Information supplied by <strong>QLogic</strong> Corporation is believed to be accurate and reliable. <strong>QLogic</strong> Corporation assumes no responsibility for any errors in<br />

this brochure. <strong>QLogic</strong> Corporation reserves the right, without notice, to make changes in product design or specifications.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!