QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
QLogic OFED+ Host Software User Guide, Rev. B
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong><br />
<strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
D000046-005 B
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Information furnished in this manual is believed to be accurate and reliable. However, <strong>QLogic</strong> Corporation assumes no<br />
responsibility for its use, nor for any infringements of patents or other rights of third parties, which may result from its<br />
use. <strong>QLogic</strong> Corporation reserves the right to change product specifications at any time without notice. Applications<br />
described in this document for any of these products are for illustrative purposes only. <strong>QLogic</strong> Corporation makes no<br />
representation nor warranty that such applications are suitable for the specified use without further testing or<br />
modification. <strong>QLogic</strong> Corporation assumes no responsibility for any errors that may appear in this document.<br />
No part of this document may be copied nor reproduced by any means, nor translated nor transmitted to any magnetic<br />
medium without the express written consent of <strong>QLogic</strong> Corporation. In accordance with the terms of their valid <strong>QLogic</strong><br />
agreements, customers are permitted to make electronic and paper copies of this document for their own exclusive<br />
use.<br />
<strong>Rev</strong>. B, November 2010<br />
Document <strong>Rev</strong>ision History<br />
Changes<br />
Sections Affected<br />
Added IB Bonding sub-section Section 3, page 3-6<br />
Added IPATH_HCA_SELECTION_ALG to Table<br />
4-7. Environment Variables<br />
“Environment Variables” on page 4-20<br />
ii<br />
D000046-005 B
Table of Contents<br />
1 Introduction<br />
How this <strong>Guide</strong> is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1<br />
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2<br />
Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3<br />
2 Step-by-Step Cluster Setup and MPI Usage Checklists<br />
Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1<br />
Using MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2<br />
3 TrueScale Cluster Setup and Administration<br />
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1<br />
Installed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2<br />
TrueScale and OpenFabrics Driver Overview . . . . . . . . . . . . . . . . . . . . . . . 3-3<br />
IPoIB Network Interface Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4<br />
IPoIB Administration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5<br />
Administering IPoIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5<br />
Stopping, Starting and Restarting the IPoIB Driver. . . . . . . . . . . 3-5<br />
Configuring IPoIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6<br />
Editing the IPoIB Configuration File . . . . . . . . . . . . . . . . . . . . . . 3-6<br />
IB Bonding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6<br />
Interface Configuration Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7<br />
Red Hat EL4 Update 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7<br />
Red Hat EL5, All Updates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8<br />
SuSE Linux Enterprise Server (SLES) 10 and 11. . . . . . . . . . . . 3-9<br />
Verify IB Bonding is Configured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10<br />
Subnet Manager Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12<br />
<strong>QLogic</strong> Distributed Subnet Administration . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13<br />
Applications that use Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />
Virtual Fabrics and the Distributed SA. . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />
Configuring the Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />
Default Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14<br />
Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15<br />
Virtual Fabrics with Overlapping Definitions . . . . . . . . . . . . . . . . . . . . 3-16<br />
D000046-005 iii
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Distributed SA Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19<br />
SID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19<br />
ScanFrequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20<br />
LogFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20<br />
Dbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20<br />
Other Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21<br />
MPI over uDAPL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21<br />
Changing the MTU Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21<br />
Managing the TrueScale Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22<br />
Configure the TrueScale Driver State . . . . . . . . . . . . . . . . . . . . . . . . . 3-23<br />
Start, Stop, or Restart TrueScale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23<br />
Unload the Driver/Modules Manually. . . . . . . . . . . . . . . . . . . . . . . . . . 3-24<br />
TrueScale Driver Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24<br />
More Information on Configuring and Loading Drivers. . . . . . . . . . . . . . . . . 3-25<br />
Performance Settings and Management Tips . . . . . . . . . . . . . . . . . . . . . . . 3-25<br />
Homogeneous Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26<br />
Adapter and Other Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26<br />
Remove Unneeded Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28<br />
Disable Powersaving Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29<br />
Hyper-Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29<br />
<strong>Host</strong> Environment Setup for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29<br />
Configuring for ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30<br />
Configuring ssh and sshd Using shosts.equiv . . . . . . . . . . 3-30<br />
Configuring for ssh Using ssh-agent . . . . . . . . . . . . . . . . . . . 3-32<br />
Process Limitation with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33<br />
Checking Cluster and <strong>Software</strong> Status. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34<br />
ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34<br />
iba_opp_query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35<br />
ibstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36<br />
ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36<br />
ipath_checkout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37<br />
4 Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1<br />
<strong>QLogic</strong> MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1<br />
PSM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2<br />
Other MPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2<br />
Linux File I/O in MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2<br />
MPI-IO with ROMIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />
Getting Started with MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />
iv D000046-005
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Copy Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />
Create the mpihosts File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3<br />
Compile and Run an Example C Program . . . . . . . . . . . . . . . . . . . . . 4-4<br />
Examples Using Other Programming Languages . . . . . . . . . . . . . . . . 4-5<br />
<strong>QLogic</strong> MPI Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6<br />
Use Wrapper Scripts for Compiling and Linking . . . . . . . . . . . . . . . . . 4-7<br />
Configuring MPI Programs for <strong>QLogic</strong> MPI . . . . . . . . . . . . . . . . . . . . . 4-8<br />
To Use Another Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9<br />
Compiler and Linker Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10<br />
Process Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11<br />
TrueScale Hardware Contexts on the DDR and QDR<br />
InfiniBand Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12<br />
Enabling and Disabling <strong>Software</strong> Context Sharing . . . . . . . . . . . 4-13<br />
Restricting TrueScale Hardware Contexts<br />
in a Batch Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13<br />
Context Sharing Error Messages . . . . . . . . . . . . . . . . . . . . . . . . 4-14<br />
Running in Shared Memory Mode . . . . . . . . . . . . . . . . . . . . . . . 4-14<br />
mpihosts File Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15<br />
Using mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16<br />
Console I/O in MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18<br />
Environment for Node Programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19<br />
Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20<br />
Running Multiple Versions of TrueScale or MPI . . . . . . . . . . . . . . . . . 4-22<br />
Job Blocking in Case of Temporary InfiniBand Link Failures. . . . . . . . 4-23<br />
Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23<br />
CPU Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23<br />
mpirun Tunable Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24<br />
MPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24<br />
MPD Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24<br />
Using MPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25<br />
<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP Applications . . . . . . . . . . . . . . . . . . . 4-25<br />
Debugging MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26<br />
MPI Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26<br />
Using Debuggers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27<br />
<strong>QLogic</strong> MPI Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28<br />
5 Using Other MPIs<br />
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1<br />
Installed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2<br />
Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />
D000046-005 v
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />
Compiling Open MPI Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />
Running Open MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4<br />
Further Information on Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />
MVAPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />
Compiling MVAPICH Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />
Running MVAPICH Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6<br />
Further Information on MVAPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6<br />
Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI<br />
with the mpi-selector Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6<br />
HP-MPI and Platform MPI 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />
Compiling Platform MPI 7 Applications . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />
Running Platform MPI 7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />
More Information on Platform MPI 7 . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />
Platform (Scali) MPI 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9<br />
Compiling Platform MPI 5.6 Applications . . . . . . . . . . . . . . . . . . . . . . 5-9<br />
Running Platform MPI 5.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . 5-10<br />
Further Information on Platform MPI 5.6 . . . . . . . . . . . . . . . . . . . . . . . 5-10<br />
Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11<br />
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11<br />
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11<br />
Compiling Intel MPI Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13<br />
Running Intel MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13<br />
Further Information on Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14<br />
Improving Performance of Other MPIs Over InfiniBand Verbs. . . . . . . . . . . 5-14<br />
6 Performance Scaled Messaging<br />
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1<br />
Virtual Fabric Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2<br />
Using SL and PKeys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2<br />
Using Service ID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3<br />
SL2VL mapping from the Fabric Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3<br />
Verifying SL2VL tables on <strong>QLogic</strong> 7300 Series Adapters . . . . . . . . . . . . . . 6-4<br />
vi D000046-005
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
7 Dispersive Routing<br />
8 gPXE<br />
gPXE Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1<br />
Required Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2<br />
Preparing the DHCP Server in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2<br />
Installing DHCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2<br />
Configuring DHCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3<br />
Netbooting Over InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4<br />
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5<br />
Boot Server Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5<br />
Steps on the gPXE Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14<br />
HTTP Boot Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14<br />
A<br />
B<br />
C<br />
mpirun Options Summary<br />
Job Start Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1<br />
Essential Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1<br />
Spawn Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2<br />
Quiescence Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3<br />
Verbosity Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3<br />
Startup Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3<br />
Stats Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4<br />
Tuning Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5<br />
Shell Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6<br />
Debug Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6<br />
Format Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7<br />
Other Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7<br />
Benchmark Programs<br />
Benchmark 1: Measuring MPI Latency Between Two Nodes . . . . . . . . . . . B-1<br />
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes . . . . . . . . . B-3<br />
Benchmark 3: Messaging Rate Microbenchmarks. . . . . . . . . . . . . . . . . . . . B-4<br />
Benchmark 4: Measuring MPI Latency in <strong>Host</strong> Rings . . . . . . . . . . . . . . . . . B-5<br />
VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration. . . . . . . . . . . . . . . . . C-1<br />
Getting Information about Ethernet IOCs on the Fabric . . . . . . . . . . . C-1<br />
Editing the VirtualNIC Configuration file . . . . . . . . . . . . . . . . . . . . . . . C-4<br />
Format 1: Defining an IOC using the IOCGUID . . . . . . . . . . . . . C-5<br />
Format 2: Defining an IOC using the IOCSTRING . . . . . . . . . . . C-6<br />
Format 3: Starting VNIC using DGID . . . . . . . . . . . . . . . . . . . . . C-6<br />
D000046-005 vii
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
VirtualNIC Failover Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7<br />
Failover to a different Adapter port on the same Adapter. . . . . . C-7<br />
Failover to a different Ethernet port on the same<br />
Ethernet gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7<br />
Failover to a port on a different Ethernet gateway . . . . . . . . . . . C-8<br />
Combination method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8<br />
Creating VirtualNIC Ethernet Interface Configuration Files . . . . . . . . . C-8<br />
VirtualNIC Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-9<br />
Starting, Stopping and Restarting the VirtualNIC Driver . . . . . . . . . . . C-12<br />
Link Aggregation Configuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14<br />
Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14<br />
VirtualNIC Configuration Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14<br />
D<br />
SRP Configuration<br />
SRP Configuration Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1<br />
Important Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1<br />
<strong>QLogic</strong> SRP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2<br />
Stopping, Starting and Restarting the SRP Driver . . . . . . . . . . . . . . . . D-3<br />
Specifying a Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3<br />
Determining the values to use for the configuration . . . . . . . . . . D-6<br />
Specifying an SRP Initiator Port of a Session by Card and<br />
Port Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8<br />
Specifying an SRP Initiator Port of Session by Port GUID . . . . . D-8<br />
Specifying a SRP Target Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9<br />
Specifying a SRP Target Port of a Session by IOCGUID . . . . . . D-10<br />
Specifying a SRP Target Port of a Session by Profile String . . . D-10<br />
Specifying an Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10<br />
Restarting the SRP Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10<br />
Configuring an Adapter with Multiple Sessions . . . . . . . . . . . . . . . . . . D-11<br />
Configuring Fibre Channel Failover. . . . . . . . . . . . . . . . . . . . . . . . . . . D-13<br />
Failover Configuration File 1: Failing over from one<br />
SRP Initiator port to another. . . . . . . . . . . . . . . . . . . . . . . . . . . D-14<br />
Failover Configuration File 2: Failing over from a port on the<br />
VIO hardware card to another port on the VIO hardware card. D-15<br />
Failover Configuration File 3: Failing over from a port on a<br />
VIO hardware card to a port on a different VIO hardware card<br />
within the same Virtual I/O chassis . . . . . . . . . . . . . . . . . . . . . D-16<br />
Failover Configuration File 4: Failing over from a port on a<br />
VIO hardware card to a port on a different VIO hardware<br />
card in a different Virtual I/O chassis . . . . . . . . . . . . . . . . . . . . D-17<br />
viii D000046-005
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Configuring Fibre Channel Load Balancing. . . . . . . . . . . . . . . . . . . . . D-18<br />
1 Adapter Port and 2 Ports on a Single VIO. . . . . . . . . . . . . . . . D-18<br />
2 Adapter Ports and 2 Ports on a Single VIO Module . . . . . . . . D-19<br />
Using the roundrobinmode Parameter . . . . . . . . . . . . . . . . . . . . D-20<br />
Configuring SRP for Native InfiniBand Storage. . . . . . . . . . . . . . . . . . D-21<br />
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-23<br />
Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-23<br />
Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-23<br />
OFED SRP Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-24<br />
E<br />
F<br />
Integration with a Batch Queuing System<br />
Using mpiexec with PBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1<br />
Using SLURM for Batch Queuing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-2<br />
Allocating Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-3<br />
Generating the mpihosts File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-3<br />
Simple Process Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-4<br />
Clean Termination of MPI Processes . . . . . . . . . . . . . . . . . . . . . . . . . E-4<br />
Lock Enough Memory on Nodes when Using SLURM. . . . . . . . . . . . . . . . . E-5<br />
Troubleshooting<br />
Using LEDs to Check the State of the Adapter . . . . . . . . . . . . . . . . . . . . . . F-1<br />
BIOS Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-2<br />
Kernel and Initialization Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-2<br />
Driver Load Fails Due to Unsupported Kernel. . . . . . . . . . . . . . . . . . . F-3<br />
Rebuild or Reinstall Drivers if Different Kernel Installed . . . . . . . . . . . F-3<br />
InfiniPath Interrupts Not Working. . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-3<br />
OpenFabrics Load Errors if ib_qib Driver Load Fails . . . . . . . . . . . . F-4<br />
InfiniPath ib_qib Initialization Failure. . . . . . . . . . . . . . . . . . . . . . . . F-5<br />
MPI Job Failures Due to Initialization Problems . . . . . . . . . . . . . . . . . F-6<br />
OpenFabrics and InfiniPath Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-6<br />
Stop Infinipath Services Before Stopping/Restarting InfiniPath . . . . . . F-6<br />
Manual Shutdown or Restart May Hang if NFS in Use . . . . . . . . . . . . F-7<br />
Load and Configure IPoIB Before Loading SDP . . . . . . . . . . . . . . . . . F-7<br />
Set $IBPATH for OpenFabrics Scripts . . . . . . . . . . . . . . . . . . . . . . . . F-7<br />
SDP Module Not Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-7<br />
ibsrpdm Command Hangs when Two <strong>Host</strong> Channel<br />
Adapters are Installed but Only Unit 1 is Connected<br />
to the Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-8<br />
Outdated ipath_ether Configuration Setup Generates Error . . . . . . . . F-8<br />
System Administration Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . F-8<br />
D000046-005 ix
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Broken Intermediate Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-8<br />
Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-9<br />
Large Message Receive Side Bandwidth Varies with<br />
Socket Affinity on Opteron Systems . . . . . . . . . . . . . . . . . . . . . . . . . F-9<br />
Erratic Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-9<br />
Method 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-10<br />
Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-10<br />
Performance Warning if ib_qib Shares Interrupts with eth0 . . . . . F-11<br />
<strong>QLogic</strong> MPI Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-11<br />
Mixed Releases of MPI RPMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-12<br />
Missing mpirun Executable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-12<br />
Resolving <strong>Host</strong>name with Multi-Homed Head Node . . . . . . . . . . . . . . F-13<br />
Cross-Compilation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-13<br />
Compiler/Linker Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-14<br />
Compiler Cannot Find Include, Module, or Library Files . . . . . . . . . . . F-14<br />
Compiling on Development Nodes . . . . . . . . . . . . . . . . . . . . . . . F-15<br />
Specifying the Run-time Library Path . . . . . . . . . . . . . . . . . . . . . F-15<br />
Problem with Shell Special Characters and Wrapper Scripts . . . . . . . F-16<br />
Run Time Errors with Different MPI Implementations . . . . . . . . . . . . . F-17<br />
Process Limitation with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-19<br />
Number of Processes Exceeds ulimit for Number of Open Files . . F-19<br />
Using MPI.mod Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-20<br />
Extending MPI Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-20<br />
Lock Enough Memory on Nodes When Using a Batch<br />
Queuing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-22<br />
Error Creating Shared Memory Object . . . . . . . . . . . . . . . . . . . . . . . . F-23<br />
gdb Gets SIG32 Signal Under mpirun -debug with the<br />
PSM Receive Progress Thread Enabled . . . . . . . . . . . . . . . . . . . . . F-24<br />
General Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-25<br />
Error Messages Generated by mpirun . . . . . . . . . . . . . . . . . . . . . . . F-25<br />
Messages from the <strong>QLogic</strong> MPI (InfiniPath) Library. . . . . . . . . . F-25<br />
MPI Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-27<br />
Driver and Link Error Messages Reported by MPI Programs. . . F-29<br />
MPI Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-30<br />
G<br />
ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues . . . . . . . . . . . . . . . . G-1<br />
x D000046-005
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
Checking the logical connection between the InfiniBand <strong>Host</strong><br />
and the VIO hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-1<br />
Verify that the proper VirtualNIC driver is running . . . . . . . . . . . G-2<br />
Verifying that the qlgc_vnic.cfg file contains the correct<br />
information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-2<br />
Verifying that the host can communicate with the I/O<br />
Controllers (IOCs) of the VIO hardware . . . . . . . . . . . . . . . . . . G-3<br />
Checking the interface definitions on the host. . . . . . . . . . . . . . . . . . . G-6<br />
Interface does not show up in output of 'ifconfig' . . . . . . . . . . . . G-6<br />
Verify the physical connection between the VIO hardware and<br />
the Ethernet network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-7<br />
Troubleshooting SRP Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-8<br />
ib_qlgc_srp_stats showing session in disconnected state . . . . . G-8<br />
Session in 'Connection Rejected' state . . . . . . . . . . . . . . . . . . . . . . . . G-9<br />
Attempts to read or write to disk are unsuccessful . . . . . . . . . . . . . . . G-11<br />
Four sessions in a round-robin configuration are active . . . . . . . . . . . G-12<br />
Which port does a port GUID refer to . . . . . . . . . . . . . . . . . . . . . . . . G-12<br />
How does the user find a <strong>Host</strong> Channel Adapter port GUID . . . . . . . G-13<br />
Need to determine the SRP driver version.. . . . . . . . . . . . . . . . . . . . . G-15<br />
H<br />
I<br />
Write Combining<br />
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1<br />
Verify Write Combining is Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1<br />
PAT and Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2<br />
MTRR Mapping and Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2<br />
Edit BIOS Settings to Fix MTRR Issues . . . . . . . . . . . . . . . . . . . . . . . H-2<br />
Use the ipath_mtrr Script to Fix MTRR Issues. . . . . . . . . . . . . . . . H-3<br />
Useful Programs and Files<br />
Check Cluster Homogeneity with ipath_checkout . . . . . . . . . . . . . . . . . I-1<br />
Restarting InfiniPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-1<br />
Summary and Descriptions of Useful Programs . . . . . . . . . . . . . . . . . . . . . I-2<br />
dmesg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-3<br />
iba_opp_query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-4<br />
ibhosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-7<br />
ibstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-7<br />
ibtracert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-8<br />
ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-9<br />
ident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-9<br />
ipathbug-helper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-10<br />
ipath_checkout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-10<br />
D000046-005 xi
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-12<br />
ipath_mtrr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-13<br />
ipath_pkt_test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-14<br />
ipathstats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />
lsmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />
modprobe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />
mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-15<br />
mpi_stress. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16<br />
rpm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16<br />
strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17<br />
Common Tasks and Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17<br />
Summary and Descriptions of Useful Files . . . . . . . . . . . . . . . . . . . . . . . . . I-18<br />
boardversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />
status_str. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />
version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21<br />
Summary of Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21<br />
J<br />
Recommended Reading<br />
References for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />
Books for Learning MPI Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />
Reference and Source for SLURM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />
InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />
OpenFabrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-1<br />
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />
Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />
Rocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />
Other <strong>Software</strong> Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J-2<br />
List of Figures<br />
Figure<br />
Page<br />
3-1 <strong>QLogic</strong> <strong>OFED+</strong> <strong>Software</strong> Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1<br />
3-2 Distributed SA Default Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16<br />
3-3 Distributed SA Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17<br />
3-4 Distributed SA Multiple Virtual Fabrics Configured Example . . . . . . . . . . . . . . . . . . 3-17<br />
3-5 Virtual Fabrics with Overlapping Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18<br />
3-6 Virtual Fabrics with PSM_MPI Virtual Fabric Enabled . . . . . . . . . . . . . . . . . . . . . . . 3-18<br />
3-7 Virtual Fabrics with all SIDs assigned to PSM_MPI Virtual Fabric. . . . . . . . . . . . . . 3-19<br />
3-8 Virtual Fabrics with Unique Numeric Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19<br />
C-1 Without IB_Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-10<br />
C-2 With IB_Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-11<br />
xii D000046-005
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
List of Tables<br />
Table<br />
Page<br />
4-1 <strong>QLogic</strong> MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7<br />
4-2 Command Line Options for Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7<br />
4-3 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9<br />
4-4 Portland Group (PGI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9<br />
4-5 PathScale Compiler Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10<br />
4-6 Available Hardware and <strong>Software</strong> Contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11<br />
4-7 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20<br />
5-1 Other Supported MPI Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1<br />
5-2 Open MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3<br />
5-3 MVAPICH Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5<br />
5-4 Platform MPI 7 Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8<br />
5-5 Platform MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10<br />
5-6 Intel MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13<br />
F-1 LED Link and Data Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1<br />
I-1 Useful Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-2<br />
I-2 ipath_checkout Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-11<br />
I-3 Common Tasks and Commands Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17<br />
I-4 Useful Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />
I-5 status_str File Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-19<br />
I-6 Status—Other Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-20<br />
I-7 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21<br />
D000046-005 xiii
<strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong><br />
<strong>QLogic</strong> <strong>OFED+</strong> Version 1.5<br />
xiv D000046-005
Preface<br />
The <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong> shows end users how to use the<br />
installed software to setup the fabric. End users include both the cluster<br />
administrator and the Message-Passing Interface (MPI) application programmers,<br />
who have different but overlapping interests in the details of the technology.<br />
For specific instructions about installing the <strong>QLogic</strong> QLE7140, QLE7240,<br />
QLE7280, QLE7340, QLE7342, QMH7342, and QME7342 PCI Express ® (PCIe ® )<br />
adapters see the <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong>, and the<br />
initial installation of the Fabric <strong>Software</strong>, see the <strong>QLogic</strong> Fabric <strong>Software</strong><br />
Installation <strong>Guide</strong>.<br />
Intended Audience<br />
This guide is intended for end users responsible for administration of a cluster<br />
network as well as for end users who want to use that cluster.<br />
This guide assumes that all users are familiar with cluster computing, that the<br />
cluster administrator is familiar with Linux ® administration, and that the application<br />
programmer is familiar with MPI, vFabrics, VNIC, SRP, and Distributed SA.<br />
Related Materials<br />
• <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong><br />
• <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong><br />
• Release Notes<br />
Documentation Conventions<br />
This guide uses the following documentation conventions:<br />
• NOTE: provides additional information.<br />
• CAUTION! indicates the presence of a hazard that has the potential of<br />
causing damage to data or equipment.<br />
• WARNING!! indicates the presence of a hazard that has the potential of<br />
causing personal injury.<br />
D000046-005 B xv
Preface<br />
License Agreements<br />
• Text in blue font indicates a hyperlink (jump) to a figure, table, or section in<br />
this guide, and links to Web sites are shown in underlined blue. For<br />
example:<br />
Table 9-2 lists problems related to the user interface and remote agent.<br />
See “Installation Checklist” on page 3-6.<br />
For more information, visit www.qlogic.com.<br />
• Text in bold font indicates user interface elements such as a menu items,<br />
buttons, check boxes, or column headings. For example:<br />
Click the Start button, point to Programs, point to Accessories, and<br />
then click Command Prompt.<br />
Under Notification Options, select the Warning Alarms check box.<br />
• Text in Courier font indicates a file name, directory path, or command line<br />
text. For example:<br />
To return to the root directory from anywhere in the file structure:<br />
Type cd /root and press ENTER.<br />
Enter the following command: sh ./install.bin<br />
• Key names and key strokes are indicated with UPPERCASE:<br />
Press CTRL+P.<br />
Press the UP ARROW key.<br />
• Text in italics indicates terms, emphasis, variables, or document titles. For<br />
example:<br />
For a complete listing of license agreements, refer to the <strong>QLogic</strong><br />
<strong>Software</strong> End <strong>User</strong> License Agreement.<br />
What are shortcut keys<br />
<br />
To enter the date type mm/dd/yyyy (where mm is the month, dd is the<br />
day, and yyyy is the year).<br />
• Topic titles between quotation marks identify related topics either within this<br />
manual, or in the online help that is referred to as the help system<br />
throughout this document.<br />
License Agreements<br />
Refer to the <strong>QLogic</strong> <strong>Software</strong> End <strong>User</strong> License Agreement for a complete listing<br />
of all license agreements affecting this product.<br />
xvi<br />
D000046-005 B
Preface<br />
Technical Support<br />
Technical Support<br />
Availability<br />
Customers should contact their authorized maintenance provider for technical<br />
support of their <strong>QLogic</strong> InfiniBand products. <strong>QLogic</strong>-direct customers may contact<br />
<strong>QLogic</strong> Technical Support; others will be redirected to their authorized<br />
maintenance provider.<br />
Visit the <strong>QLogic</strong> support Web site listed in Contact Information for the latest<br />
firmware and software updates.<br />
<strong>QLogic</strong> Technical Support for products under warranty is available during local<br />
standard working hours excluding <strong>QLogic</strong> Observed Holidays.<br />
Training<br />
<strong>QLogic</strong> offers training for technical professionals for all iSCSI, InfiniBand, and<br />
Fibre Channel products. From the main <strong>QLogic</strong> web page at www.qlogic.com,<br />
click the Education and Resources tab at the top, then click the Education &<br />
Training tab on the left. The <strong>QLogic</strong> Global Training Portal offers online courses,<br />
certification exams, and scheduling of in-person training.<br />
Technical Certification courses include installation, maintenance and<br />
troubleshooting <strong>QLogic</strong> SAN products. Upon demonstrating knowledge using live<br />
equipment, <strong>QLogic</strong> awards a certificate identifying the student as a Certified<br />
Professional. The training professionals at <strong>QLogic</strong> may be reached by e-mail at<br />
training@qlogic.com.<br />
Contact Information<br />
Please feel free to contact your <strong>QLogic</strong> approved reseller or <strong>QLogic</strong> Technical<br />
Support at any phase of integration for assistance. <strong>QLogic</strong> Technical Support can<br />
be reached by the following methods:<br />
Web<br />
Email<br />
http://support.qlogic.com<br />
support@qlogic.com<br />
Knowledge Database<br />
The <strong>QLogic</strong> knowledge database is an extensive collection of <strong>QLogic</strong> product<br />
information that you can search for specific solutions. We are constantly adding to<br />
the collection of information in our database to provide answers to your most<br />
urgent questions. Access the database from the <strong>QLogic</strong> Support Center:<br />
http://support.qlogic.com.<br />
D000046-005 B xvii
Preface<br />
Technical Support<br />
xviii<br />
D000046-005 B
1 Introduction<br />
How this <strong>Guide</strong> is Organized<br />
The <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong> is organized into these sections:<br />
• Section 1, provides an overview and describes interoperability.<br />
• Section 2, describes how to setup your cluster to run high-performance MPI<br />
jobs.<br />
• Section 3, describes the lower levels of the supplied <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong><br />
software. This section is of interest to a TrueScale cluster administrator.<br />
• Section 4, helps the Message Passing Interface (MPI) programmer make the<br />
best use of the <strong>QLogic</strong> MPI implementation. Examples are provided for<br />
compiling and running MPI programs.<br />
• Section 5, gives examples for compiling and running MPI programs with<br />
other MPI implementations.<br />
• Section 6, describes <strong>QLogic</strong> Performance Scaled Messaging (PSM) that<br />
provides support for full Virtual Fabric (vFabric) integration, allowing users to<br />
specify InfiniBand Service Level (SL) and Partition Key (PKey), or to provide<br />
a configured Service ID (SID) to target a vFabric.<br />
• Section 7, describes dispersive routing in the InfiniBand ® fabric to avoid<br />
congestion hotspots by “sraying” messages across the multiple potential<br />
paths.<br />
• Section 8, describes open-source Preboot Execution Environment (gPXE)<br />
boot including installation and setup.<br />
• Appendix A, describes the most commonly used options to mpirun.<br />
• Appendix B, describes how to run <strong>QLogic</strong>’s performance measurement<br />
programs.<br />
• Appendix C, describes the VirtualNIC interface configuration and<br />
administration, providing virtual Ethernet connectivity.<br />
• Appendix D, describes SCSI RDMA Protocol (SRP) configuration that allows<br />
the SCSI protocol to run over InfiniBand for Storage Area Network (SAN)<br />
usage.<br />
D000046-005 B 1-1
1–Introduction<br />
Overview<br />
Overview<br />
• Appendix E, describes two methods the administrator can use to allow users<br />
to submit MPI jobs through batch queuing systems.<br />
• Appendix F, provides information for troubleshooting installation, cluster<br />
administration, and MPI.<br />
• Appendix G, provides information for troubleshooting the upper layer<br />
protocol utilities in the fabric.<br />
• Appendix H, provides instructions for checking write combining and for using<br />
the Page Attribute Table (PAT) and Memory Type Range Registers (MTRR).<br />
• Appendix I, contains useful programs and files for debugging, as well as<br />
commands for common tasks.<br />
• Appendix J, contains a list of useful web sites and documents for a further<br />
understanding of the InfiniBand fabric, and related information.<br />
In addition, the <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong> contains<br />
information on <strong>QLogic</strong> hardware installation and the <strong>QLogic</strong> Fabric <strong>Software</strong><br />
Installation <strong>Guide</strong> contains information on <strong>QLogic</strong> software installation.<br />
The material in this documentation pertains to a <strong>QLogic</strong> OFED cluster. A cluster is<br />
defined as a collection of nodes, each attached to an InfiniBand-based fabric<br />
through the <strong>QLogic</strong> interconnect.<br />
The <strong>QLogic</strong> InfiniBand adapters are InfiniBand 4X adapters. The quad data rate<br />
(QDR) adapters (QLE7340, QLE7342, QMH7342, and QME7342) have a raw<br />
data rate of 40Gbps (data rate of 32Gbps). The double data rate (DDR) adapters<br />
(QLE7240 and QLE7280) have a raw data rate of 20Gbps (data rate of 16Gbps).<br />
The single data rate (SDR) adapters (QLE7140) have a raw data rate of 10Gbps<br />
(data rate of 8Gbps). The QLE7340, QLE7342, QMH7342, and QME7342<br />
adapters can also run in DDR or SDR mode, and the QLE7240 and QLE7280 can<br />
also run in SDR mode.<br />
The <strong>QLogic</strong> adapters utilize standard, off-the-shelf InfiniBand 4X switches and<br />
cabling. The <strong>QLogic</strong> interconnect is designed to work with all InfiniBand-compliant<br />
switches.<br />
NOTE:<br />
If you are using the QLE7240 or QLE7280, and want to use DDR mode,<br />
then DDR-capable switches must be used. Likewise, when using the<br />
QLE7300 series adapters in QDR mode, a QDR switch must be used.<br />
1-2 D000046-005 B
1–Introduction<br />
Interoperability<br />
<strong>QLogic</strong> <strong>OFED+</strong> software is interoperable with other vendors’ IBTA compliant<br />
InfiniBand adapters running compatible OFED releases. There are several<br />
options for subnet management in your cluster:<br />
• An embedded subnet manager can be used in one or more managed<br />
switches. <strong>QLogic</strong> offers the <strong>QLogic</strong> Embedded Fabric Manager (FM) for<br />
both DDR and QDR switch product lines supplied by your InfiniBand switch<br />
vendor.<br />
• A host-based subnet manager can be used. <strong>QLogic</strong> provides the <strong>QLogic</strong><br />
Fabric Manager (FM), as a part of the <strong>QLogic</strong> InfiniBand Fabric Suite.<br />
Interoperability<br />
<strong>QLogic</strong> <strong>OFED+</strong> participates in the standard InfiniBand subnet management<br />
protocols for configuration and monitoring. Note that:<br />
• <strong>QLogic</strong> <strong>OFED+</strong> (including Internet Protocol over InfiniBand (IPoIB)) is<br />
interoperable with other vendors’ InfiniBand adapters running compatible<br />
OFED releases.<br />
• The <strong>QLogic</strong> MPI stack is not interoperable with other InfiniBand adapters<br />
and target channel adapters. Instead, it uses an InfiniBand-compliant,<br />
vendor-specific protocol that is highly optimized for <strong>QLogic</strong> MPI and the<br />
<strong>QLogic</strong> PSM API.<br />
• In addition to supporting running MPI over verbs, <strong>QLogic</strong> provides a<br />
high-performance InfiniBand-Compliant vendor-specific protocol for running<br />
over verbs, known as PSM. MPIs run over PSM will not interoperate with<br />
other adapters.<br />
NOTE:<br />
See the OpenFabrics web site at www.openfabrics.org for more information<br />
on the OpenFabrics Alliance.<br />
D000046-005 B 1-3
1–Introduction<br />
Interoperability<br />
1-4 D000046-005 B
2 Step-by-Step Cluster Setup<br />
and MPI Usage Checklists<br />
Cluster Setup<br />
This section describes how to set up your cluster to run high-performance<br />
Message Passing Interface (MPI) jobs.<br />
Perform the following tasks when setting up the cluster. These include BIOS,<br />
adapter, and system settings.<br />
1. Make sure that hardware installation has been completed according to the<br />
instructions in the <strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong><br />
and software installation and driver configuration has been completed<br />
according to the instructions in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation<br />
<strong>Guide</strong>. To minimize management problems, the compute nodes of the<br />
cluster must have very similar hardware configurations and identical<br />
software installations. See “Homogeneous Nodes” on page 3-26 for more<br />
information.<br />
2. Check that the BIOS is set properly according to the instructions in the<br />
<strong>QLogic</strong> InfiniBand Adapter Hardware Installation <strong>Guide</strong>.<br />
3. Set up the Distributed Subnet Administration (SA) to correctly synchronize<br />
your virtual fabrics. See “<strong>QLogic</strong> Distributed Subnet Administration” on<br />
page 3-13<br />
4. Adjust settings, including setting the appropriate MTU size. See “Adapter<br />
and Other Settings” on page 3-26.<br />
5. Remove unneeded services. <strong>QLogic</strong> recommends turning irqbalance off.<br />
See “Remove Unneeded Services” on page 3-28.<br />
6. Disable powersaving features. See “Disable Powersaving Features” on<br />
page 3-29.<br />
7. Check other performance tuning settings. See “Performance Settings and<br />
Management Tips” on page 3-25.<br />
8. If using Intel ® processors, turn off Hyper-Threading. See “Hyper-Threading”<br />
on page 3-29.<br />
D000046-005 B 2-1
2–Step-by-Step Cluster Setup and MPI Usage Checklists<br />
Using MPI<br />
Using MPI<br />
9. Set up the host environment to use ssh. Two methods are discussed in<br />
“<strong>Host</strong> Environment Setup for MPI” on page 3-29.<br />
10. Verify the cluster setup. See “Checking Cluster and <strong>Software</strong> Status” on<br />
page 3-34.<br />
1. Verify that the <strong>QLogic</strong> hardware and software has been installed on all the<br />
nodes you will be using, and that ssh is set up on your cluster (see all the<br />
steps in the Cluster Setup checklist).<br />
2. Copy the examples to your working directory. See “Copy Examples” on<br />
page 4-3.<br />
3. Make an mpihosts file that lists the nodes where your programs will run.<br />
See “Create the mpihosts File” on page 4-3.<br />
4. Compile the example C program using the default wrapper script mpicc.<br />
Use mpirun to run it. See “Compile and Run an Example C Program” on<br />
page 4-4.<br />
5. Try the examples with other programming languages, C++, Fortran 77, and<br />
Fortran 90 in “Examples Using Other Programming Languages” on<br />
page 4-5.<br />
6. To test using other MPIs that run over PSM, such as MVAPICH, Open MPI,<br />
HP ® -MPI, Platform MPI, and Intel MPI, see Section 5 Using Other MPIs.<br />
7. To switch between multiple versions of Open MPI, MVAPICH, and <strong>QLogic</strong><br />
MPI, use the mpi-selector. See “Managing Open MPI, MVAPICH, and<br />
<strong>QLogic</strong> MPI with the mpi-selector Utility” on page 5-6.<br />
8. Refer to “<strong>QLogic</strong> MPI Details” on page 4-6 for more information about<br />
<strong>QLogic</strong> MPI, and to “Performance Tuning” on page 4-23 to read more about<br />
runtime performance tuning.<br />
9. Refer to Section 5 Using Other MPIs to learn about using other MPI<br />
implementations.<br />
2-2 D000046-005 B
3 TrueScale Cluster Setup<br />
and Administration<br />
Introduction<br />
This section describes what the cluster administrator needs to know about the<br />
<strong>QLogic</strong> <strong>OFED+</strong> software and system administration.<br />
The TrueScale driver ib_qib, <strong>QLogic</strong> Performance Scaled Messaging (PSM),<br />
accelerated Message-Passing Interface (MPI) stack, the protocol and MPI support<br />
libraries, and other modules are components of the <strong>QLogic</strong> <strong>OFED+</strong> software. This<br />
software provides the foundation that supports the MPI implementation.<br />
Figure 3-1 illustrates these relationships. Note that HP-MPI, Scali, MVAPICH,<br />
MVAPICH2, and Open MPI can run either over PSM or OpenFabrics ® <strong>User</strong> Verbs.<br />
The <strong>QLogic</strong> Virtual Network Interface Controller (VNIC) driver module is also<br />
illustrated in the figure.<br />
MPI Applications<br />
<strong>User</strong> Space<br />
Commo<br />
n<br />
InfiniBand/OpenFabri<br />
<strong>QLogic</strong> <strong>OFED+</strong><br />
Hardware<br />
<strong>QLogic</strong> MPI<br />
HP-MPI<br />
Scali<br />
MVAPICH<br />
Open MPI<br />
<strong>QLogic</strong> <strong>OFED+</strong><br />
Communication<br />
Library (PSM)<br />
HP-MPI<br />
Scali<br />
MVAPICH<br />
MVAPICH2<br />
Open MPI<br />
<strong>User</strong> Verbs<br />
HP-MPI<br />
Intel MPI<br />
uDAPL<br />
<strong>QLogic</strong> FM<br />
uMAD API<br />
Kernel Space<br />
TCP/IP<br />
IPoIB<br />
VNIC<br />
<strong>QLogic</strong> <strong>OFED+</strong> Driver<br />
<strong>QLogic</strong> infiniband adapter<br />
Figure 3-1. <strong>QLogic</strong> <strong>OFED+</strong> <strong>Software</strong> Structure<br />
D000046-005 B 3-1
3–TrueScale Cluster Setup and Administration<br />
Installed Layout<br />
Installed Layout<br />
This section describes the default installed layout for the <strong>QLogic</strong> <strong>OFED+</strong> software<br />
and <strong>QLogic</strong>-supplied MPIs.<br />
The <strong>QLogic</strong> MPI is installed in:<br />
/usr/mpi/qlogic<br />
The shared libraries are installed in:<br />
/usr/mpi/qlogic/lib for 32-bit applications<br />
/usr/mpi/qlogic/lib64 for 64-bit applications<br />
MPI include files are in:<br />
/usr/mpi/qlogic/cdinclude<br />
MPI programming examples and the source for several MPI benchmarks are in:<br />
/usr/mpi/qlogic/share/mpich/examples<br />
NOTE:<br />
If <strong>QLogic</strong> MPI is installed in an alternate location, the argument passed to<br />
--prefix (Your location) replaces the default /usr/mpi/qlogic<br />
prefix. <strong>QLogic</strong> MPI binaries, documentation, and libraries are installed under<br />
that prefix. However, a few configuration files are installed in /etc<br />
regardless of the desired --prefix.<br />
If you have installed the software into an alternate location, the<br />
$MPICH_ROOT environment variable needs to match --prefix.<br />
<strong>QLogic</strong> <strong>OFED+</strong> utility programs, are installed in:<br />
/usr/bin<br />
Documentation is found in:<br />
/usr/share/man<br />
/usr/share/doc/infinipath<br />
/usr/share/doc/mpich-infinipath<br />
License information is found only in usr/share/doc/infinipath. <strong>QLogic</strong><br />
<strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> user documentation can be found on the <strong>QLogic</strong> web site<br />
on the software download page for your distribution.<br />
Configuration files are found in:<br />
/etc/sysconfig<br />
Init scripts are found in:<br />
/etc/init.d<br />
3-2 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
TrueScale and OpenFabrics Driver Overview<br />
The TrueScale driver modules in this release are installed in:<br />
/lib/modules/$(uname -r)/<br />
updates/kernel/drivers/infiniband/hw/qib<br />
Most of the other OFED modules are installed under the infiniband<br />
subdirectory. Other modules are installed under:<br />
/lib/modules/$(uname -r)/updates/kernel/drivers/net<br />
The RDS modules are installed under:<br />
/lib/modules/$(uname -r)/updates/kernel/net/rds<br />
<strong>QLogic</strong>-supplied OpenMPI and MVAPICH RPMs with PSM support and compiled<br />
with GCC, PathScale, PGI, and the Intel compilers are now installed in directories<br />
using this format:<br />
/usr/mpi//--qlc<br />
For example:<br />
/usr/mpi/gcc/openmpi-1.4-qlc<br />
TrueScale and OpenFabrics Driver Overview<br />
The TrueScale ib_qib module provides low-level <strong>QLogic</strong> hardware support, and<br />
is the base driver for both MPI/PSM programs and general OpenFabrics protocols<br />
such as IPoIB and service data point (SDP). The driver also supplies the Subnet<br />
Management Agent (SMA) component.<br />
Optional configurable OpenFabrics components and their default settings at<br />
startup are:<br />
• IPoIB network interface. This component is required for TCP/IP networking<br />
for running Ethernet traffic over the TrueScale link. It is not running until it is<br />
configured.<br />
• VNIC. It is not running until it is configured.<br />
• OpenSM. This component is disabled at startup. It can be installed on one<br />
node as a master, with another node being a standby, or disable it on all<br />
nodes except where it will be used as an SM.<br />
• SRP (OFED and <strong>QLogic</strong> modules). SRP is not running until the module is<br />
loaded and the SRP devices on the fabric have been discovered.<br />
• MPI over uDAPL (can be used by Intel MPI or HP ® -MPI). IPoIB must be<br />
configured before MPI over uDAPL can be set up.<br />
Other optional drivers can now be configured and enabled, as described in “IPoIB<br />
Network Interface Configuration” on page 3-4.<br />
D000046-005 B 3-3
3–TrueScale Cluster Setup and Administration<br />
IPoIB Network Interface Configuration<br />
Complete information about starting, stopping, and restarting the <strong>QLogic</strong> <strong>OFED+</strong><br />
services are in “Managing the TrueScale Driver” on page 3-22.<br />
IPoIB Network Interface Configuration<br />
The following instructions show you how to manually configure your OpenFabrics<br />
IPoIB network interface. <strong>QLogic</strong> recommends using the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong><br />
<strong>Software</strong> Installation package that automatically installs the IPoIB Network<br />
Interface configuration. This example assumes that you are using sh or bash as<br />
your shell, all required <strong>QLogic</strong> <strong>OFED+</strong> and OpenFabric’s RPMs are installed, and<br />
your startup scripts have been ran (either manually or at system boot).<br />
For this example, the IPoIB network is 10.1.17.0 (one of the networks reserved for<br />
private use, and thus not routable on the Internet), with a /8 host portion. In this<br />
case, the netmask must be specified.<br />
This example assumes that no hosts files exist, the host being configured has the<br />
IP address 10.1.17.3, and DHCP is not used.<br />
NOTE:<br />
Instructions are only for this static IP address case. Configuration methods<br />
for using DHCP will be supplied in a later release.<br />
1. Type the following command (as a root user):<br />
ifconfig ib0 10.1.17.3 netmask 0xffffff00<br />
2. To verify the configuration, type:<br />
ifconfig ib0<br />
ifconfig ib1<br />
The output from this command will be similar to:<br />
ib0 Link encap:InfiniBand HWaddr<br />
00:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:<br />
00<br />
inet addr:10.1.17.3 Bcast:10.1.17.255 Mask:255.255.255.0<br />
UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1<br />
RX packets:0 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:128<br />
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)<br />
3-4 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
IPoIB Administration<br />
3. Type:<br />
ping -c 2 -b 10.1.17.255<br />
The output of the ping command will be similar to the following, with a line<br />
for each host already configured and connected:<br />
WARNING: pinging broadcast address<br />
PING 10.1.17.255 (10.1.17.255) 517(84) bytes of data.<br />
174 bytes from 10.1.17.3: icmp_seq=0 ttl=174 time=0.022<br />
ms<br />
64 bytes from 10.1.17.1: icmp_seq=0 ttl=64 time=0.070 ms<br />
(DUP!)<br />
64 bytes from 10.1.17.7: icmp_seq=0 ttl=64 time=0.073 ms<br />
(DUP!)<br />
The IPoIB network interface is now configured.<br />
4. Restart (as a root user) by typing:<br />
/etc/init.d/openibd restart<br />
NOTE:<br />
• The configuration must be repeated each time the system is rebooted.<br />
• IPoIB-CM (Connected Mode) is enabled by default. The setting in<br />
/etc/infiniband/openib.conf is SET_IPOIB_CM=yes. To use<br />
datagram mode, use change the setting to SET_IPOIB_CM=no.<br />
IPoIB Administration<br />
Administering IPoIB<br />
Stopping, Starting and Restarting the IPoIB Driver<br />
<strong>QLogic</strong> recommends using the <strong>QLogic</strong> IFS Installer TUI to stop, stat and restart<br />
the IPoIB driver. Refer to the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more<br />
information. For using the command line to stop, start, and restart the IPoIB driver<br />
use the following commands.<br />
To stop the IPoIB driver, use the following command:<br />
/etc/init.d/openibd stop<br />
To start the IPoIB driver, use the following command:<br />
/etc/init.d/openibd start<br />
D000046-005 B 3-5
3–TrueScale Cluster Setup and Administration<br />
IB Bonding<br />
To restart the IPoIB driver, use the following command:<br />
/etc/init.d/openibd restart<br />
Configuring IPoIB<br />
<strong>QLogic</strong> recommends using the <strong>QLogic</strong> IFS Installer TUI to configure the IPoIB<br />
driver. Refer to the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more<br />
information. For using the command line to configure the IPoIB driver use the<br />
following commands.<br />
Editing the IPoIB Configuration File<br />
1. For each IP Link Layer interface, create an interface configuration file,<br />
/etc/sysconfig/network-scripts/ifcfg-NAME, where NAME is the<br />
value of the NAME field specified in the CREATE block. The following is a<br />
sample: /etc/sysconfig/network-scripts/ifcfg-ib1 file:<br />
DEVICE=ib1<br />
BOOTPROTO=static<br />
BROADCAST=192.168.18.255<br />
IPADDR=192.168.18.120<br />
NETMASK=255.255.255.0<br />
ONBOOT=yes<br />
NOTE:<br />
For IPoIB, the INSTALL script for the adapter now helps the user<br />
create the ifcfg files.<br />
IB Bonding<br />
2. After modifying the /etc/sysconfig/ipoib.cfg file, restart the IPoIB driver<br />
with the following:<br />
/etc/init.d/openibd restart<br />
IB bonding is a high availability solution for IPoIB interfaces. It is based on the<br />
Linux Ethernet Bonding Driver and was adopted to work with IPoIB. The support<br />
for IPoIB interfaces is only for the active-backup mode, other modes should not be<br />
used. <strong>QLogic</strong> only supports bonding across <strong>Host</strong> Channel Adapter ports at this<br />
time. Bonding port 1 and port 2 on the same <strong>Host</strong> Channel Adapter is not<br />
supported in this release.<br />
3-6 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
IB Bonding<br />
Interface Configuration Scripts<br />
Create interface configuration scripts for the ibX and bondX interfaces. Once the<br />
configurations are in place, perform a server reboot, or a service network restart.<br />
For SLES operating systems (OS), a server reboot is required. Refer to the<br />
following standard syntax for bonding configuration by the OS.<br />
NOTE:<br />
For all of the following OS configuration script examples that set MTU,<br />
MTU=65520 is valid only if all IPoIB slaves operate in connected mode and<br />
are configured with the same value. For IPoIB slaves that work in datagram<br />
mode, use MTU=2044. If the MTU is not set correctly or the MTU is not set<br />
at all (set to the default value), performance of the interface may be lower.<br />
Red Hat EL4 Update 8<br />
The following is an example for bond0 (master). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-bond0:<br />
DEVICE=bond0<br />
IPADDR=192.168.1.1<br />
NETMASK=255.255.255.0<br />
NETWORK=192.168.1.0<br />
BROADCAST=192.168.1.255<br />
ONBOOT=yes<br />
BOOTPROTO=none<br />
USERCTL=no<br />
TYPE=Bonding<br />
MTU=65520<br />
BONDING_OPTS="primary=ib0 updelay=0 downdelay=0<br />
The following is an example for ib0 (slave). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-ib0:<br />
DEVICE=ib0<br />
USERCTL=no<br />
ONBOOT=yes<br />
MASTER=bond0<br />
SLAVE=yes<br />
BOOTPROTO=none<br />
TYPE=InfiniBand<br />
PRIMARY=yes<br />
D000046-005 B 3-7
3–TrueScale Cluster Setup and Administration<br />
IB Bonding<br />
The following is an example for ib1 (slave 2). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-ib1:<br />
DEVICE=ib1<br />
USERCTL=no<br />
ONBOOT=yes<br />
MASTER=bond0<br />
SLAVE=yes<br />
BOOTPROTO=none<br />
TYPE=InfiniBand<br />
Add the following lines to the file /etc/modprobe.conf:<br />
alias bond0 bonding<br />
options bond0 miimon=100 mode=1 max_bonds=1<br />
Red Hat EL5, All Updates<br />
The following is an example for bond0 (master). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-bond0:<br />
DEVICE=bond0<br />
IPADDR=192.168.1.1<br />
NETMASK=255.255.255.0<br />
NETWORK=192.168.1.0<br />
BROADCAST=192.168.1.255<br />
ONBOOT=yes<br />
BOOTPROTO=none<br />
USERCTL=no<br />
MTU=65520<br />
BONDING_OPTS="primary=ib0 updelay=0 downdelay=0"<br />
The following is an example for ib0 (slave). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-ib0:<br />
DEVICE=ib0<br />
USERCTL=no<br />
ONBOOT=yes<br />
MASTER=bond0<br />
SLAVE=yes<br />
BOOTPROTO=none<br />
TYPE=InfiniBand<br />
PRIMARY=yes<br />
3-8 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
IB Bonding<br />
The following is an example for ib1 (slave 2). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-ib1:<br />
DEVICE=ib1<br />
USERCTL=no<br />
ONBOOT=yes<br />
MASTER=bond0<br />
SLAVE=yes<br />
BOOTPROTO=none<br />
TYPE=InfiniBand<br />
Add the following lines to the file /etc/modprobe.conf:<br />
alias bond0 bonding<br />
options bond0 miimon=100 mode=1 max_bonds=1<br />
SuSE Linux Enterprise Server (SLES) 10 and 11<br />
The following is an example for bond0 (master). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-bond0:<br />
DEVICE="bond0"<br />
TYPE="Bonding"<br />
IPADDR="192.168.1.1"<br />
NETMASK="255.255.255.0"<br />
NETWORK="192.168.1.0"<br />
BROADCAST="192.168.1.255"<br />
BOOTPROTO="static"<br />
USERCTL="no"<br />
STARTMODE="onboot"<br />
BONDING_MASTER="yes"<br />
BONDING_MODULE_OPTS="mode=active-backup miimon=100<br />
primary=ib0 updelay=0 downdelay=0"<br />
BONDING_SLAVE0=ib0<br />
BONDING_SLAVE1=ib1<br />
MTU=65520<br />
D000046-005 B 3-9
3–TrueScale Cluster Setup and Administration<br />
IB Bonding<br />
The following is an example for ib0 (slave). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-ib0:<br />
DEVICE='ib0'<br />
BOOTPROTO='none'<br />
STARTMODE='off'<br />
WIRELESS='no'<br />
ETHTOOL_OPTIONS=''<br />
NAME=''<br />
USERCONTROL='no'<br />
IPOIB_MODE='connected'<br />
The following is an example for ib1 (slave 2). The file is named<br />
/etc/sysconfig/network-scripts/ifcfg-ib1:<br />
DEVICE='ib1'<br />
BOOTPROTO='none'<br />
STARTMODE='off'<br />
WIRELESS='no'<br />
ETHTOOL_OPTIONS=''<br />
NAME=''<br />
USERCONTROL='no'<br />
IPOIB_MODE='connected'<br />
Verify the following line is set to the value of yes in /etc/sysconfig/boot:<br />
RUN_PARALLEL="yes"<br />
Verify IB Bonding is Configured<br />
After the configuration scripts are updated, and the service network is restarted or<br />
a server reboot is accomplished, use the following CLI commands to verify that IB<br />
bonding is configured.<br />
• cat /proc/net/bonding/bond0<br />
• # ifconfig<br />
3-10 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
IB Bonding<br />
Example of cat /proc/net/bonding/bond0 output:<br />
# cat /proc/net/bonding/bond0<br />
Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)<br />
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac)<br />
Primary Slave: ib0<br />
Currently Active Slave: ib0<br />
MII Status: up<br />
MII Polling Interval (ms): 100<br />
Up Delay (ms): 0<br />
Down Delay (ms): 0<br />
Slave Interface: ib0<br />
MII Status: up<br />
Link Failure Count: 0<br />
Permanent HW addr: 80:00:04:04:fe:80<br />
Slave Interface: ib1<br />
MII Status: up<br />
Link Failure Count: 0<br />
Permanent HW addr: 80:00:04:05:fe:80<br />
D000046-005 B 3-11
3–TrueScale Cluster Setup and Administration<br />
Subnet Manager Configuration<br />
Example of Ifconfig output:<br />
st2169:/etc/sysconfig # ifconfig<br />
bond0 Link encap:InfiniBand HWaddr<br />
80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00<br />
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0<br />
inet6 addr: fe80::211:7500:ff:909b/64 Scope:Link<br />
UP BROADCAST RUNNING MASTER MULTICAST MTU:65520 Metric:1<br />
RX packets:120619276 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:120619277 errors:0 dropped:137 overruns:0 carrier:0<br />
collisions:0 txqueuelen:0<br />
RX bytes:10132014352 (9662.6 Mb) TX bytes:10614493096 (10122.7 Mb)<br />
ib0 Link encap:InfiniBand HWaddr<br />
80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00<br />
UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1<br />
RX packets:118938033 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:118938027 errors:0 dropped:41 overruns:0 carrier:0<br />
collisions:0 txqueuelen:256<br />
RX bytes:9990790704 (9527.9 Mb) TX bytes:10466543096 (9981.6 Mb)<br />
ib1 Link encap:InfiniBand HWaddr<br />
80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00<br />
UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1<br />
RX packets:1681243 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:1681250 errors:0 dropped:96 overruns:0 carrier:0<br />
collisions:0 txqueuelen:256<br />
RX bytes:141223648 (134.6 Mb) TX bytes:147950000 (141.0 Mb)<br />
Subnet Manager Configuration<br />
<strong>QLogic</strong> recommends using the <strong>QLogic</strong> Fabric Manager to manage your fabric.<br />
Refer to the <strong>QLogic</strong> Fabric Manager <strong>User</strong> <strong>Guide</strong> for information on configuring the<br />
<strong>QLogic</strong> Fabric Manager.<br />
OpenSM is an optional component of the OpenFabrics project that provides a<br />
Subnet Manager (SM) for InfiniBand networks. This package can be installed on<br />
all machines, but only needs to be enabled on the machine in the cluster that will<br />
act as a subnet manager. You can’t use OpenSM if any of your InfiniBand<br />
switches provide a subnet manager, or if you are running a host-based SM.<br />
WARNING!!<br />
Don’t run OpenSM with <strong>QLogic</strong> FM in the same fabric.<br />
3-12 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
If you are using the Installer tool, you can set the OpenSM default behavior at the<br />
time of installation.<br />
OpenSM only needs to be enabled on the node that acts as the subnet manager,<br />
so use the chkconfig command (as a root user) to enable it on the node where it<br />
will be run:<br />
chkconfig opensmd on<br />
The command to disable it on reboot is:<br />
chkconfig opensmd off<br />
You can start opensmd without rebooting your machine by typing:<br />
/etc/init.d/opensmd start<br />
You can stop opensmd again by typing:<br />
/etc/init.d/opensmd stop<br />
If you want to pass any arguments to the OpenSM program, modify the following<br />
file, and add the arguments to the OPTIONS variable:<br />
/etc/init.d/opensmd<br />
For example:<br />
Use the UPDN algorithm instead of the Min Hop algorithm.<br />
OPTIONS="-R updn"<br />
For more information on OpenSM, see the OpenSM man pages, or look on the<br />
OpenFabrics web site.<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
As InfiniBand clusters are scaled into the Petaflop range and beyond, a more<br />
efficient method for handling queries to the Fabric Manager is required. One of<br />
these issues is that while the Fabric Manager can configure and operate that<br />
many nodes, under certain conditions it can become overloaded with queries from<br />
those same nodes.<br />
For example, consider an InfiniBand fabric consisting of 1,000 nodes, each with 4<br />
processors. When a large MPI job is started across the entire fabric, each process<br />
needs to collect InfiniBand path records for every other node in the fabric - and<br />
every single process is going to be querying the subnet manager for these path<br />
records at roughly the same time. This amounts to a total of 3.9 million path<br />
queries just to start the job!<br />
In the past, MPI implementations have side-stepped this problem by hand crafting<br />
path records themselves, but this solution cannot be used if advanced fabric<br />
management techniques such as virtual fabrics and mesh/torus configurations are<br />
being used. In such cases, only the subnet manager itself has enough information<br />
to correctly build a path record between two nodes.<br />
D000046-005 B 3-13
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
The Distributed Subnet Administration (SA) solves this problem by allowing each<br />
node to locally replicate the path records needed to reach the other nodes on the<br />
fabric. At boot time, each Distributed SA queries the subnet manager for<br />
information about the relevant parts of the fabric, backing off whenever the subnet<br />
manager indicates that it is busy. Once this information is in the Distributed SA's<br />
database, it is ready to answer local path queries from MPI or other InfiniBand<br />
applications. If the fabric changes (due to a switch failure or a node being added<br />
or removed from the fabric) the Distributed SA updates the affected portions of the<br />
database. The Distributed SA is installed and runs on every node in the fabric,<br />
except the management node running <strong>QLogic</strong> FM.<br />
Applications that use Distributed SA<br />
The <strong>QLogic</strong> PSM Library has been extended to take advantage of Distributed SA.<br />
Therefore, all MPIs that use the <strong>QLogic</strong> PSM library can take advantage of the<br />
Distributed SA. Other applications must be modified specifically to take advantage<br />
of it. For developers writing applications that use the Distributed SA, please refer<br />
to the header file /usr/include/Infiniband/ofedplus_path.h for information on<br />
modifying applications to use the Distributed SA. This file can be found on any<br />
node where the Distributed SA is installed. For further assistance please contact<br />
<strong>QLogic</strong> Support.<br />
Virtual Fabrics and the Distributed SA<br />
The Distributed SA is designed to be aware of Virtual Fabrics, but to only store<br />
records for those Virtual Fabrics that match Service ID records in the Distributed<br />
SA's configuration file. In addition, the Distributed SA recognizes when multiple<br />
SIDs match the same Virtual Fabric and will only store one copy of each path<br />
record within a Virtual Fabric. Next, SIDs that match more than one Virtual Fabric<br />
will be assigned to a single Virtual Fabrics. Finally, Virtual Fabrics that do not<br />
match SIDs in the Distributed SA's database will be ignored.<br />
Configuring the Distributed SA<br />
In order to absolutely minimize the number of queries made by the Distributed SA,<br />
it is important to configure it correctly, both to match the configuration of the Fabric<br />
Manager and to exclude those portions of the fabric that will not be used by<br />
applications using the Distributed SA. The configuration file for the Distributed SA<br />
is named /etc/sysconfig/iba/qlogic_sa.conf.<br />
Default Configuration<br />
As shipped, the <strong>QLogic</strong> Fabric Manager creates a single virtual fabric, called<br />
“Default” and maps all nodes and Service IDs to it, and the Distributed SA ships<br />
with a configuration that lists a set of thirty-one Service IDs (SID<br />
0x1000117500000000 through 0x100011750000000f and 0x1 through 0xf). This<br />
results in an arrangement like the one shown in Figure 3-2<br />
3-14 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
Virtual Fabric “Default”<br />
Pkey: 0xffff<br />
SID Range: 0 x0-0 xffffffffffffffff<br />
Virtual Fabric “Default”<br />
Pkey: 0xffff<br />
SID Range: 0x1-0xf<br />
SID Range: 0x 1000117500000000-0x 100011750000000f<br />
Infiniband Fabric<br />
Distributed SA<br />
Figure 3-2. Distributed SA Default Configuration<br />
If you are using the <strong>QLogic</strong> FM in its default configuration, and you are using the<br />
standard <strong>QLogic</strong> PSM Service IDs, this arrangement will work fine and you will not<br />
need to modify the Distributed SA's configuration file - but notice that the<br />
Distributed SA has restricted the range of Service IDs it cares about to those that<br />
were defined in its configuration file. Attempts to get path records using other SIDs<br />
will not work, even if those other SIDs are valid for the fabric.<br />
Multiple Virtual Fabrics Example<br />
A person configuring the physical InfiniBand fabric may want to limit how much<br />
InfiniBand bandwidth MPI applications are permitted to consume. In that case,<br />
they may re-configure the <strong>QLogic</strong> Fabric Manager, turning off the "Default" Virtual<br />
Fabric and replacing it with several other Virtual Fabrics.<br />
In Figure 3-3, the administrator has divided the physical fabric into four virtual<br />
fabrics: "Admin" (used to communicate with the Fabric Manager), "Storage" (used<br />
by SRP), "PSM_MPI" (used by regular MPI jobs) and a special "Reserved" fabric<br />
for special high-priority jobs.<br />
D000046-005 B 3-15
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
Virtual Fabric “Admin”<br />
Pkey: 0x7fff<br />
Virtual Fabric<br />
“Reserved”<br />
Pkey: 0x8002<br />
SID Range: 0x10-0x1f<br />
Virtual Fabric “Storage”<br />
Pkey: 0x8001<br />
SID:<br />
0x0000494353535250<br />
Virtual Fabric<br />
“PSM_MPI”<br />
Pkey: 0x8003<br />
SID Range: 0x1-0xf<br />
SID Range:<br />
0x1000117500000000-<br />
0x100011750000000f<br />
Virtual Fabric “PSM_MPI”<br />
Pkey: 0x8003<br />
SID Range: 0x1-0xf<br />
SID Range: 0x1000117500000000-<br />
0x100011750000000f<br />
Infiniband Fabric<br />
Distributed SA<br />
Figure 3-3. Distributed SA Multiple Virtual Fabrics Example<br />
Due to the fact that the Distributed SA was not configured to include the SID<br />
Range 0x10 through 0x1f, it has simply ignored the "Reserved" VF. Adding those<br />
SIDs to the qlogic_sa.conf file solves the problem as shown in Figure 3-4.<br />
Virtual Fabric “Admin”<br />
Pkey: 0x7fff<br />
Virtual Fabric<br />
“Reserved”<br />
Pkey: 0x8002<br />
SID Range: 0x10-0x1f<br />
Virtual Fabric “Reserved”<br />
Pkey: 0x8002<br />
SID Range: 0x10-0x1f<br />
Virtual Fabric “Storage”<br />
Pkey: 0x8001<br />
SID:<br />
0x0000494353535250<br />
Virtual Fabric<br />
“PSM_MPI”<br />
Pkey: 0x8003<br />
SID Range: 0x1-0xf<br />
SID Range:<br />
0x1000117500000000-<br />
0x100011750000000f<br />
Virtual Fabric “PSM_MPI”<br />
Pkey: 0x8003<br />
SID Range: 0x1-0xf<br />
SID Range: 0x1000117500000000-<br />
0x100011750000000f<br />
Infiniband Fabric<br />
Distributed SA<br />
Figure 3-4. Distributed SA Multiple Virtual Fabrics Configured Example<br />
Virtual Fabrics with Overlapping Definitions<br />
As defined, SIDs should never be shared between Virtual Fabrics. Unfortunately,<br />
it is very easy to accidentally create such overlaps. Figure 3-5 shows an example<br />
with overlapping definitions.<br />
3-16 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
Virtual Fabric “Default”<br />
Virtual<br />
Pkey:<br />
Fabric<br />
0xffff<br />
“Default”<br />
SID Range:<br />
Pkey:<br />
0x0-0xffffffffffffffff<br />
0xffff<br />
SID Range: Virtual 0x0-0xffffffffffffffff<br />
Fabric “PSM_MPI”<br />
Pkey: 0x8002<br />
SID Range: 0x1-0xf<br />
SID Range:<br />
0x1000117500000000-<br />
0x100011750000000f<br />
<br />
Looking for for SID SID Ranges Range 0x1-0xf 0x1-0xf and<br />
and 0x1000117500000000-<br />
0x100011750000000f<br />
Infiniband Fabric<br />
Distributed SA<br />
Figure 3-5. Virtual Fabrics with Overlapping Definitions<br />
In Figure 3-5, the fabric administrator enabled the "PSM_MPI" Virtual Fabric<br />
without modifying the "Default" Virtual Fabric. As a result, the Distributed SA sees<br />
two different virtual fabrics that match its configuration file.<br />
In Figure 3-6, the person administering the fabric has created two different Virtual<br />
Fabrics without turning off the Default - and two of the new fabrics have<br />
overlapping SID ranges.<br />
Virtual Fabric “Reserved”<br />
ID: 2<br />
Pkey: 0x8003<br />
SID Range: 0x1-0xf<br />
Virtual Virtual Fabric Fabric “Default” “Default” Pkey: 0xffff<br />
SID Range: Pkey: 0x0-0xffffffffffffffff<br />
0xffff<br />
SID Range:<br />
Virtual<br />
0x0-0xffffffffffffffff<br />
Fabric “PSM_MPI”<br />
ID: 1 Pkey: 0x8002<br />
SID Range: 0x1-0xf<br />
SID Range:<br />
0x1000117500000000-<br />
0x100011750000000f<br />
<br />
Virtual Fabric “Default”<br />
Pkey: 0xffff<br />
SID Range: 0x1-0xf<br />
SID<br />
0x1000117500000000-<br />
Range: 0x1000117500000000-<br />
0x100011750000000f<br />
Looking for SID Ranges 0x1-0xf and<br />
Infiniband Fabric<br />
Distributed SA<br />
Figure 3-6. Virtual Fabrics with PSM_MPI Virtual Fabric Enabled<br />
In Figure 3-6, the administrator enabled the "PSM_MPI" fabric, and then added a<br />
new "Reserved" fabric that uses one of the SID ranges that "PSM_MPI" uses.<br />
When a path query has been received, the Distributed SA deals with these<br />
conflicts as follows:<br />
D000046-005 B 3-17
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
First, any virtual fabric with a pkey of 0xffff is declared to be the "Default". The<br />
"Default" Virtual Fabric is treated as a special case by the Distributed SA. The<br />
"Default" Virtual Fabric is used only as a last resort. Stored SIDs are only mapped<br />
to the default if they do not match any other Virtual Fabrics. Thus, in the first<br />
example, Figure 3-6, the Distributed SA will assign all the SIDs in its configuration<br />
file to the "PSM_MPI" Virtual Fabric as shown in Figure 3-7.<br />
Virtual Fabric “Default”<br />
Pkey: 0xffff<br />
SID Range: 0x0-0xffffffffffffffff<br />
Virtual Fabric “PSM_MPI”<br />
Pkey: 0x8002<br />
SID Range: 0x1-0xf<br />
SID Range:<br />
0x1000117500000000-<br />
0x100011750000000f<br />
Virtual Fabric “PSM_MPI”<br />
Pkey: 0x8002<br />
SID Range: 0x1-0xf<br />
SID Range: 0x1000117500000000-<br />
0x100011750000000f<br />
Infiniband Fabric<br />
Distributed SA<br />
Figure 3-7. Virtual Fabrics with all SIDs assigned to PSM_MPI Virtual Fabric<br />
Second, the Distributed SA handles overlaps by taking advantage of the fact that<br />
Virtual Fabrics have unique numeric indexes. (These IDs can be seen by using<br />
the command "iba_saquery -o vfinfo".) The Distributed SA will always<br />
assign a SID to the Virtual Fabric with the lowest ID number, as shown in<br />
Figure 3-8. This ensures that all copies of the Distributed SA in the InfiniBand<br />
fabric will make the same decisions about assigning SIDs. However, it also means<br />
that the behavior of your fabric can be affected by the order you configured the<br />
virtual fabrics.<br />
Virtual Fabric “Reserved”<br />
ID: 2 Pkey: 0x8003<br />
SID Range: 0x1-0xf<br />
Virtual Fabric Virtual “Default” Fabric “Default” Pkey: 0xffff<br />
SID Range: Pkey: 0x0-0xffffffffffffffff<br />
0xffff<br />
SID Range:<br />
Virtual<br />
0x0-0xffffffffffffffff<br />
Fabric “PSM_MPI”<br />
ID: 1 Pkey: 0x8002<br />
SID Range: 0x1-0xf<br />
SID Range:<br />
0x1000117500000000-<br />
0x100011750000000f<br />
Infiniband Fabric<br />
Virtual Fabric “PSM_MPI” “Default”<br />
Pkey: 0x8002 0xffff<br />
SID Range: 0x1-0xf<br />
SID Range: 0x1000117500000000-<br />
0x100011750000000f<br />
Distributed SA<br />
Figure 3-8. Virtual Fabrics with Unique Numeric Indexes<br />
3-18 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
In Figure 3-8, the Distributed SA assigns all overlapping SIDs to the "PSM_MPI"<br />
fabric because it has the lowest Index<br />
NOTE:<br />
The Distributed SA makes these assignments not because they are right,<br />
but because they allow the fabric to work even though there are errors. The<br />
correct solution in these cases is to redefine the fabric so that no node will<br />
ever be a member of two Virtual Fabrics that service the same SID.<br />
Distributed SA Configuration File<br />
The Distributed SA configuration file is<br />
/etc/sysconfig/iba/qlogic_sa.conf. It has several settings, but normally<br />
administrators will only need to deal with two or three of them.<br />
SID<br />
The SID is the primary configuration setting for the Distributed SA, and it can be<br />
specified multiple times. When the Distributed SA starts, it loads information about<br />
the Virtual Fabrics specified by the Fabric Manager and each SID is mapped to a<br />
single virtual fabric. Multiple SIDs can be mapped to the same virtual fabric. The<br />
default configuration for the Distributed SA includes all the SIDs defined in the<br />
default Qlogic FM configuration for use by MPI.<br />
The SID arguments have a very particular logic that must be understood for<br />
correct operation. A SID= argument defines one Service ID that is associated with<br />
a single virtual fabric. In addition, multiple SID= arguments can point to a single<br />
virtual fabric. For example, a virtual fabric has three sets of SIDs associated with<br />
it: 0x0a1 through 0x0a3, 0x1a1 through 0x1a3 and 0x2a1 through 0x2a3. You<br />
would define this as<br />
SID=0x0a1<br />
SID=0x0a2<br />
SID=0x0a3<br />
SID=0x1a1<br />
SID=0x1a2<br />
SID=0x1a3<br />
SID=0x2a1<br />
SID=0x2a2<br />
SID=0x2a3<br />
NOTE:<br />
A SID of zero is not supported at this time. Instead, the OPP libraries treat<br />
zero values as "unspecified".<br />
D000046-005 B 3-19
3–TrueScale Cluster Setup and Administration<br />
<strong>QLogic</strong> Distributed Subnet Administration<br />
ScanFrequency<br />
Periodically, the Distributed SA will completely re synchronize its database. This<br />
also occurs if the Fabric Manager is restarted. ScanFrequency defines the<br />
minimum number of seconds between complete re synchronizations. It defaults to<br />
600 seconds, or 10 minutes. On very large fabrics, increasing this value can help<br />
reduce the total amount of SM traffic. For example, to set the interval to 15<br />
minutes, add this line to the bottom of the qlogic_sa.conf file:<br />
ScanFrequency=900<br />
LogFile<br />
Normally, the Distributed SA logs special events to /var/log/messages. This<br />
parameter allows you to specify a different destination for the log messages. For<br />
example, to direct Distributed SA messages to their own log, add this line to the<br />
bottom of the qlogic_sa.conf file:<br />
LogFile=/var/log/SAReplica.log<br />
Dbg<br />
This parameter controls how much logging the Distributed SA will do. It can be set<br />
to a number between one and seven, where one indicates no logging and seven<br />
includes informational and debugging messages. To change the Dbg setting for<br />
Distributed SA, find the line in qlogic_sa.conf that reads Dbg=5 and change it to a<br />
different value, between 1 and 7. The value of Dbg changes the amount of logging<br />
that the Distributed SA generates as follows:<br />
• Dbg=1 or Dbg=2: Alerts and Critical Errors<br />
Only errors that will cause the Distributed SA to terminate will be<br />
reported.<br />
• Dbg=3: Errors<br />
Errors will be reported, but nothing else. (Includes Dbg=1 and Dbg=2)<br />
• Dbg=4: Warnings<br />
Errors and warnings will be reported. (Includes Dbg=3)<br />
• Dbg=5: Normal<br />
Some normal events will be reported along with errors and warnings.<br />
(Includes Dbg=4)<br />
• Dbg=6: Informational Messages<br />
In addition to the normal logging, Distributed SA will report detailed<br />
information about its status and operation. Generally, this will produce<br />
too much information for normal use. (Includes Dbg=5)<br />
3-20 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
MPI over uDAPL<br />
• Dbg=7: Debugging<br />
This should only be turned on at the request of <strong>QLogic</strong> Support. This<br />
will generate so much information that system operation will be<br />
impacted. (Includes Dbg=6)<br />
Other Settings<br />
The remaining configuration settings for the Distributed SA are generally only<br />
useful in special circumstances and are not needed in normal operation. The<br />
sample qlogic_sa.conf configuration file contains a brief description of each.<br />
MPI over uDAPL<br />
Intel MPI can be run over uDAPL, which uses InfiniBand Verbs. uDAPL is the user<br />
mode version of the Direct Access Provider Library (DAPL), and is provided as a<br />
part of the OFED packages. You will also have to have IPoIB configured.<br />
The setup for Intel MPI is described in the following steps:<br />
1. Make sure that DAPL 1.2 (not version 2.0) is installed on every node. In this<br />
release they are called compat-dapl. They can be installed either with the<br />
<strong>QLogic</strong> OFED <strong>Host</strong> <strong>Software</strong> package.<br />
2. Verify that there is a /etc/dat.conf file. The file dat.conf contains a list of<br />
interface adapters supported by uDAPL service providers. In particular, it<br />
must contain mapping entries for OpenIB-cma for dapl 1.2.x, in a form<br />
similar to this (all on one line):<br />
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2<br />
"ib0 0" ""<br />
3. On every node, type the following command (as a root user):<br />
# modprobe rdma_ucm<br />
To ensure that the module is loaded when the driver is loaded, add<br />
RDMA_UCM_LOAD=yes to the /etc/infiniband/openib.conf file. (Note that<br />
rdma_cm is also used, but it is loaded automatically.)<br />
4. Bring up an IPoIB interface on every node, for example, ib0. See the<br />
instructions for configuring IPoIB for more details.<br />
For more information on using Intel MPI, see Section 5.<br />
Changing the MTU Size<br />
The Maximum Transfer Unit (MTU) is set to 4K and enabled in the driver by<br />
default. To see the current MTU size, and the maximum supported by the adapter,<br />
type the command:<br />
$ ibv_devinfo<br />
D000046-005 B 3-21
3–TrueScale Cluster Setup and Administration<br />
Managing the TrueScale Driver<br />
To change the driver default back to 2K MTU, add this line (as a root user) into<br />
/etc/modprobe.conf (or /etc/modprobe.conf.local):<br />
options ib_qib ibmtu=4<br />
Restart the driver as described in “Managing the TrueScale Driver” on page 3-22.<br />
NOTE:<br />
To use 4K MTU, set the switch to have the same 4K default. If you are using<br />
<strong>QLogic</strong> switches, the following applies:<br />
• For the Externally Managed 9024, use 4.2.2.0.3 firmware<br />
(9024DDR4KMTU_firmware.emfw) for the 9024 EM. This has the 4K MTU<br />
default, for use on fabrics where 4K MTU is required. If 4K MTU support<br />
is not required, then use the 4.2.2.0.2 DDR *.emfw file for DDR<br />
externally-managed switches. Use FastFabric (FF) to load the firmware<br />
on all the 9024s on the fabric.<br />
• For the 9000 chassis, use the most recent 9000 code 4.2.4.0.1. The 4K<br />
MTU support is in 9000 chassis version 4.2.1.0.2 and later. For the 9000<br />
chassis, when the FastFabric 4.3 (or later) chassis setup tool is used,<br />
the user is asked to select an MTU. FastFabric can then set that MTU in<br />
all the 9000 internally managed switches. The change will take effect on<br />
the next reboot. Alternatively, for the internally managed 9000s, the<br />
ismChassisSetMtu Command Line Interface (CLI) command can be<br />
used. This should be executed on every switch and both hemispheres of<br />
the 9240s.<br />
• For the 12000 switches, refer to the <strong>QLogic</strong> FastFabric <strong>User</strong> <strong>Guide</strong> for<br />
externally managed switches, and to the <strong>QLogic</strong> FastFabric CLI<br />
Reference <strong>Guide</strong> for the internally managed switches.<br />
For reference, see the <strong>QLogic</strong> FastFabric <strong>User</strong> <strong>Guide</strong> and the <strong>QLogic</strong> 9000<br />
CLI Reference <strong>Guide</strong>. Both are available from the <strong>QLogic</strong> web site.<br />
For other switches, see the vendors’ documentation.<br />
Managing the TrueScale Driver<br />
The startup script for ib_qib is installed automatically as part of the software<br />
installation, and normally does not need to be changed. It runs as a system<br />
service.<br />
The primary configuration file for the TrueScale driver ib_qib and other modules<br />
and associated daemons is /etc/infiniband/openib.conf.<br />
3-22 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
Managing the TrueScale Driver<br />
Normally, this configuration file is set up correctly at installation and the drivers are<br />
loaded automatically during system boot once the RPMs have been installed.<br />
However, the ib_qib driver has several configuration variables that set reserved<br />
buffers for the software, define events to create trace records, and set the debug<br />
level.<br />
If you are upgrading, your existing configuration files will not be overwritten.<br />
See the ib_qib man page for more details.<br />
Configure the TrueScale Driver State<br />
Use the following commands to check or configure the state. These methods will<br />
not reboot the system.<br />
To check the configuration state, use this command. You do not need to be a root<br />
user:<br />
$ chkconfig --list openibd<br />
To enable the driver, use the following command (as a root user):<br />
# chkconfig openibd on 2345<br />
To disable the driver on the next system boot, use the following command (as a<br />
root user):<br />
# chkconfig openibd off<br />
NOTE:<br />
This command does not stop and unload the driver if the driver is already<br />
loaded.<br />
Start, Stop, or Restart TrueScale<br />
Restart the software if you install a new <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> release,<br />
change driver options, or do manual testing.<br />
<strong>QLogic</strong> recommends using the <strong>QLogic</strong> IFS Installer TUI to stop, stat and restart<br />
the IPoIB driver. Refer to the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more<br />
information. For using the command line to stop, start, and restart (as a root user)<br />
the TrueScale support use the following syntex:<br />
# /etc/init.d/openibd [start | stop | restart]<br />
WARNING!!<br />
If <strong>QLogic</strong> Fabric Manager, or OpenSM is configured and running on the<br />
node, it must be stopped before using the openibd stop command, and<br />
may be started after using the openibd start command.<br />
D000046-005 B 3-23
3–TrueScale Cluster Setup and Administration<br />
Managing the TrueScale Driver<br />
WARNING!!<br />
Stopping or restarting openibd terminates any <strong>QLogic</strong> MPI, VNIC, and SRP<br />
processes, as well as any OpenFabrics processes that are running at the<br />
time. <strong>QLogic</strong> recommends stopping all processes prior to using the openibd<br />
command.<br />
This method will not reboot the system. The following set of commands shows<br />
how to use this script.<br />
When you need to determine which TrueScale and OpenFabrics modules are<br />
running, use the following command. You do not need to be a root user.<br />
$ lsmod | egrep ’ipath_|ib_|rdma_|findex’<br />
You can check to see if opensmd is running by using the following command (as a<br />
root user); if there is no output, opensmd is not configured to run:<br />
# /sbin/chkconfig --list opensmd | grep -w on<br />
Unload the Driver/Modules Manually<br />
You can also unload the driver/modules manually without using<br />
/etc/init.d/openibd. Use the following series of commands (as a root user):<br />
# umount /ipathfs<br />
# fuser -k /dev/ipath* /dev/infiniband/*<br />
# lsmod | egrep ’^ib_|^rdma_|^iw_’ | xargs modprobe -r<br />
TrueScale Driver Filesystem<br />
The TrueScale driver supplies a filesystem for exporting certain binary statistics to<br />
user applications. By default, this filesystem is mounted in the /ipathfs directory<br />
when the TrueScale script is invoked with the start option (e.g. at system<br />
startup). The filesystem is unmounted when the TrueScale script is invoked with<br />
the stop option (for example, at system shutdown).<br />
Here is a sample layout of a system with two cards:<br />
/ipathfs/0/flash<br />
/ipathfs/0/port2counters<br />
/ipathfs/0/port1counters<br />
/ipathfs/0/portcounter_names<br />
/ipathfs/0/counter_names<br />
/ipathfs/0/counters<br />
/ipathfs/driver_stats_names<br />
3-24 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
More Information on Configuring and Loading Drivers<br />
/ipathfs/driver_stats<br />
/ipathfs/1/flash<br />
/ipathfs/1/port2counters<br />
/ipathfs/1/port1counters<br />
/ipathfs/1/portcounter_names<br />
/ipathfs/1/counter_names<br />
/ipathfs/1/counters<br />
The driver_stats file contains general driver statistics. There is one numbered<br />
subdirectory per TrueScale device on the system. Each numbered subdirectory<br />
contains the following per-device files:<br />
• port1counters<br />
• port2counters<br />
• flash<br />
The driver1counters and driver2counters files contain counters for the<br />
device, for example, interrupts received, bytes and packets in and out, etc. The<br />
flash file is an interface for internal diagnostic commands.<br />
The file counter_names provides the names associated with each of the counters<br />
in the binary port#counters files, and the file driver_stats_names provides the<br />
names for the stats in the binary driver_stats files.<br />
More Information on Configuring and Loading<br />
Drivers<br />
See the modprobe(8), modprobe.conf(5), and lsmod(8) man pages for more<br />
information. Also see the file /usr/share/doc/initscripts-*/sysconfig.txt<br />
for more general information on configuration files.<br />
Performance Settings and Management Tips<br />
The following sections provide suggestions for improving performance and<br />
simplifying cluster management. Many of these settings will be done by the<br />
system administrator. <strong>User</strong> level runtime performance settings are shown in<br />
“Performance Tuning” on page 4-23.<br />
D000046-005 B 3-25
3–TrueScale Cluster Setup and Administration<br />
Performance Settings and Management Tips<br />
Homogeneous Nodes<br />
To minimize management problems, the compute nodes of the cluster should<br />
have very similar hardware configurations and identical software installations. A<br />
mismatch between the TrueScale software versions can also cause problems. Old<br />
and new libraries must not be run within the same job. It may also be useful to<br />
distinguish between the TrueScale-specific drivers and those that are associated<br />
with kernel.org, OpenFabrics, or are distribution-built. The most useful tools are:<br />
• ident (see “ident” on page I-9)<br />
• ipathbug-helper (see “ipathbug-helper” on page I-10)<br />
• ipath_checkout (see “ipath_checkout” on page I-10)<br />
• ipath_control (see “ipath_control” on page I-12)<br />
• mpirun (see “mpirun” on page I-15)<br />
• rpm (see “rpm” on page I-16)<br />
• strings (see “strings” on page I-17)<br />
NOTE:<br />
Run these tools to gather information before reporting problems and<br />
requesting support.<br />
Adapter and Other Settings<br />
The following adapter and other settings can be adjusted for better performance.<br />
• Use an InfiniBand MTU of 4096 bytes instead of 2048 bytes, if available,<br />
with the QLE7340, QLE7342, QLE7240, QLE7280, and QLE7140. 4K<br />
MTU is enabled in the TrueScale driver by default. To change this setting for<br />
the driver, see “Changing the MTU Size” on page 3-21.<br />
• Use a PCIe Max Read Request size of at least 512 bytes with the<br />
QLE7340, QLE7342, QLE7240 and QLE7280. QLE7240 and QLE7280<br />
adapters can support sizes from 128 bytes to 4096 byte in powers of two.<br />
This value is typically set by the BIOS.<br />
For <strong>QLogic</strong> 7300 and 7200 Series <strong>Host</strong> Channel Adapters, to improve peak<br />
IB bandwidth on Nehalem and Harpertown CPU systems, set PCIe<br />
parameters as follows:<br />
<br />
<br />
Set PCIe Max Read Request to 4096 bytes<br />
Set PCIe Max Payload to 256 bytes by, as root, adding the following<br />
line:<br />
options ib_qib pcie_caps=0x51<br />
to the /etc/modprobe.conf file:<br />
The above should be sufficient on Intel Nehalem CPUs or newer.<br />
3-26 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
Performance Settings and Management Tips<br />
On Intel Harpertown CPUs, it may be beneficial to add a<br />
pcie_coalesce=1 parameter to this line.<br />
On AMD CPUs (PCIe Gen1) no ib_qib parameter changes are<br />
recommended.<br />
Alternatively, these PCIe parameters can also be set in the BIOS on some<br />
systems.<br />
• Use PCIe Max Payload size of 256, where available, with the QLE7340,<br />
QLE7342, QLE7240 and QLE7280. The QLE7240 and QLE7280 adapters<br />
can support 128, 256, or 512 bytes. This value is typically set by the BIOS<br />
as the minimum value supported both by the PCIe card and the PCIe root<br />
complex.<br />
• Make sure that write combining is enabled. The x86 Page Attribute Table<br />
(PAT) mechanism that allocates Write Combining (WC) mappings for the<br />
PIO buffers has been added and is now the default. If PAT is unavailable or<br />
PAT initialization fails for some reason, the code will generate a message in<br />
the log and fall back to the MTRR mechanism. See Appendix H Write<br />
Combining for more information.<br />
• Check the PCIe bus width. If slots have a smaller electrical width than<br />
mechanical width, lower than expected performance may occur. Use this<br />
command to check PCIe Bus width:<br />
$ ipath_control -iv<br />
This command also shows the link speed.<br />
• Experiment with non-default CPU affinity while running<br />
single-process-per-node latency or bandwidth benchmarks. Latency<br />
may be slightly lower when using different CPUs (cores) from the default. On<br />
some chipsets, bandwidth may be higher when run from a non-default CPU<br />
or core. See “Performance Tuning” on page 4-23 for more information on<br />
using taskset with <strong>QLogic</strong> MPI. With another MPI, look at its documentation<br />
to see how to force a benchmark to run with a different CPU affinity than the<br />
default. With OFED micro benchmarks such as from the qperf or perftest<br />
suites, taskset will work for setting CPU affinity.<br />
Turn C-state Off to improve MPI latency (ping-pong) benchmarks on<br />
Nehalem systems. In the BIOS, look for advanced CPU settings, and set the<br />
C-State parameter to "disable."<br />
D000046-005 B 3-27
3–TrueScale Cluster Setup and Administration<br />
Performance Settings and Management Tips<br />
• Allocate all chip resources to a single port on a dual port card. The<br />
singleport parameter, when set to a non-zero value at driver load, will cause<br />
dual port Truescale cards to act as single port cards, with only infiniband port<br />
1 enabled. The board identification string is not affected (that is, a QLE7342<br />
with singleport set will still be identified as a QLE7342, not a QLE7340). The<br />
default value of this parameter is 0, however it may be set to 1 during the<br />
IFS installation process<br />
By default, in the <strong>QLogic</strong> Installation TUI will configure a dual-port card so all<br />
chip resources are directed to port1 to maximize performance. This also<br />
saves power by not enabling the second port. If you want to utilize both ports<br />
specify when asked about single port to use dual ports.<br />
Remove Unneeded Services<br />
The cluster administrator can enhance application performance by minimizing the<br />
set of system services running on the compute nodes. Since these are presumed<br />
to be specialized computing appliances, they do not need many of the service<br />
daemons normally running on a general Linux computer.<br />
Following are several groups constituting a minimal necessary set of services.<br />
These are all services controlled by chkconfig. To see the list of services that are<br />
enabled, use the command:<br />
$ /sbin/chkconfig --list | grep -w on<br />
Basic network services are:<br />
• network<br />
• ntpd<br />
• syslog<br />
• xinetd<br />
• sshd<br />
For system housekeeping, use:<br />
• anacron<br />
• atd<br />
• crond<br />
If you are using Network File System (NFS) or yellow pages (yp) passwords:<br />
• rpcidmapd<br />
• ypbind<br />
• portmap<br />
• nfs<br />
• nfslock<br />
• autofs<br />
To watch for disk problems, use:<br />
3-28 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>Host</strong> Environment Setup for MPI<br />
• smartd<br />
• readahead<br />
The service comprising the TrueScale driver and SMA is:<br />
• openibd<br />
Other services may be required by your batch queuing system or user community.<br />
If your system is running the daemon irqbalance, <strong>QLogic</strong> recommends turning it<br />
off. Disabling irqbalance will enable more consistent performance with programs<br />
that use interrupts. Use this command:<br />
# /sbin/chkconfig irqbalance off<br />
See “Erratic Performance” on page F-9 for more information.<br />
Disable Powersaving Features<br />
Hyper-Threading<br />
If you are running benchmarks or large numbers of short jobs, it is beneficial to<br />
disable the powersaving features, since these features may be slow to respond to<br />
changes in system load.<br />
For RHEL4 and RHEL5, run this command as a root user:<br />
# /sbin/chkconfig --level 12345 cpuspeed off<br />
For SLES 10 and SLES 11, run this command as a root user:<br />
# /sbin/chkconfig --level 12345 powersaved off<br />
After running either of these commands, reboot the system for the changes to<br />
take effect.<br />
If you are using Intel NetBurst ® Processors that support Hyper-Threading, <strong>QLogic</strong><br />
recommends turning off Hyper-Threading in the BIOS, which will provide more<br />
consistent performance. You can check and adjust this setting using the BIOS<br />
Setup utility. For specific instructions, follow the hardware documentation that<br />
came with your system.<br />
<strong>Host</strong> Environment Setup for MPI<br />
After the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> software and the GNU (GCC) compilers have been<br />
installed on all the nodes, the host environment can be set up for running MPI<br />
programs.<br />
D000046-005 B 3-29
3–TrueScale Cluster Setup and Administration<br />
<strong>Host</strong> Environment Setup for MPI<br />
Configuring for ssh<br />
Running MPI programs with the command mpirun on an TrueScale cluster<br />
depends, by default, on secure shell ssh to launch node programs on the nodes.<br />
In <strong>QLogic</strong> MPI, mpirun uses the secure shell command ssh to start instances of<br />
the given MPI program on the remote compute nodes without the need for<br />
interactive password entry on every node.<br />
To use ssh, you must have generated Rivest, Shamir, Adleman (RSA) or Digital<br />
Signal Algorithm (DSA) keys, public and private. The public keys must be<br />
distributed and stored on all the compute nodes so that connections to the remote<br />
machines can be established without supplying a password.<br />
You or your administrator must set up the ssh keys and associated files on the<br />
cluster. There are two methods for setting up ssh on your cluster. The first<br />
method, the shosts.equiv mechanism, is typically set up by the cluster<br />
administrator. The second method, using ssh-agent, is more easily<br />
accomplished by an individual user.<br />
NOTE:<br />
• rsh can be used instead of ssh. To use rsh, set the environment<br />
variable MPI_SHELL=rsh. See “Environment Variables” on page 4-20 for<br />
information on setting environment variables. Also see “Shell Options”<br />
on page A-6 for information on setting shell options in mpirun.<br />
• rsh has a limit on the number of concurrent connections it can have,<br />
typically 255, which may limit its use on larger clusters.<br />
Configuring ssh and sshd Using shosts.equiv<br />
This section describes how the cluster administrator can set up ssh and sshd<br />
through the shosts.equiv mechanism. This method is recommended, provided<br />
that your cluster is behind a firewall and accessible only to trusted users.<br />
“Configuring for ssh Using ssh-agent” on page 3-32 shows how an individual user<br />
can accomplish the same thing using ssh-agent.<br />
The example in this section assumes the following:<br />
• Both the cluster nodes and the front end system are running the openssh<br />
package as distributed in current Linux systems.<br />
• All cluster end users have accounts with the same account name on the<br />
front end and on each node, by using Network Information Service (NIS) or<br />
another means of distributing the password file.<br />
• The front end used in this example is called ip-fe.<br />
3-30 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>Host</strong> Environment Setup for MPI<br />
• Root or superuser access is required on ip-fe and on each node to<br />
configure ssh.<br />
• ssh, including the host’s key, has already been configured on the system<br />
ip-fe. See the sshd and ssh-keygen man pages for more information.<br />
To use shosts.equiv to configure ssg and sshd:<br />
1. On the system ip-fe (the front end node), change the<br />
/etc/ssh/ssh_config file to allow host-based authentication. Specifically,<br />
this file must contain the following four lines, all set to yes. If the lines are<br />
already there but commented out (with an initial #), remove the #.<br />
RhostsAuthentication yes<br />
RhostsRSAAuthentication yes<br />
<strong>Host</strong>basedAuthentication yes<br />
EnableSSHKeysign yes<br />
2. On each of the TrueScale node systems, create or edit the file<br />
/etc/ssh/shosts.equiv, adding the name of the front end system. Add the<br />
line:<br />
ip-fe<br />
Change the file to mode 600 when you are finished editing.<br />
3. On each of the TrueScale node systems, create or edit the file<br />
/etc/ssh/ssh_known_hosts. You will need to copy the contents of the file<br />
/etc/ssh/ssh_host_dsa_key.pub from ip-fe to this file (as a single line),<br />
and then edit that line to insert ip-fe ssh-dss at the beginning of the line.<br />
This is very similar to the standard known_hosts file for ssh. An example<br />
line might look like this (displayed as multiple lines, but a single line in the<br />
file):<br />
ip-fe ssh-dss<br />
AAzAB3NzaC1kc3MAAACBAPoyES6+Akk+z3RfCkEHCkmYuYzqL2+1nwo4LeTVW<br />
pCD1QsvrYRmpsfwpzYLXiSJdZSA8hfePWmMfrkvAAk4ueN8L3ZT4QfCTwqvHV<br />
vSctpibf8n<br />
aUmzloovBndOX9TIHyP/Ljfzzep4wL17+5hr1AHXldzrmgeEKp6ect1wxAAAA<br />
FQDR56dAKFA4WgAiRmUJailtLFp8swAAAIBB1yrhF5P0jO+vpSnZrvrHa0Ok+<br />
Y9apeJp3sessee30NlqKbJqWj5DOoRejr2VfTxZROf8LKuOY8tD6I59I0vlcQ<br />
812E5iw1GCZfNefBmWbegWVKFwGlNbqBnZK7kDRLSOKQtuhYbGPcrVlSjuVps<br />
fWEju64FTqKEetA8l8QEgAAAIBNtPDDwdmXRvDyc0gvAm6lPOIsRLmgmdgKXT<br />
GOZUZ0zwxSL7GP1nEyFk9wAxCrXv3xPKxQaezQKs+KL95FouJvJ4qrSxxHdd1<br />
NYNR0DavEBVQgCaspgWvWQ8cL<br />
0aUQmTbggLrtD9zETVU5PCgRlQL6I3Y5sCCHuO7/UvTH9nneCg==<br />
Change the file to mode 600 when you are finished editing.<br />
D000046-005 B 3-31
3–TrueScale Cluster Setup and Administration<br />
<strong>Host</strong> Environment Setup for MPI<br />
4. On each node, the system file /etc/ssh/sshd_config must be edited, so<br />
that the following four lines are uncommented (no # at the start of the line)<br />
and set to yes. (These lines are usually there, but are commented out and<br />
set to no by default.)<br />
RhostsAuthentication yes<br />
RhostsRSAAuthentication yes<br />
<strong>Host</strong>basedAuthentication yes<br />
PAMAuthenticationViaKbdInt yes<br />
5. After creating or editing the three files in Steps 2, 3, and 4, sshd must be<br />
restarted on each system. If you are already logged in via ssh (or any other<br />
user is logged in via ssh), their sessions or programs will be terminated, so<br />
restart only on idle nodes. Type the following (as root) to notify sshd to use<br />
the new configuration files:<br />
# killall -HUP sshd<br />
NOTE:<br />
This command terminates all ssh sessions into that system. Run from<br />
the console, or have a way to log into the console in case of any<br />
problem.<br />
At this point, any end user should be able to login to the ip-fe front end system<br />
and use ssh to login to any TrueScale node without being prompted for a<br />
password or pass phrase.<br />
Configuring for ssh Using ssh-agent<br />
The ssh-agent, a daemon that caches decrypted private keys, can be used to<br />
store the keys. Use ssh-add to add your private keys to ssh-agent’s cache.<br />
When ssh establishes a new connection, it communicates with ssh-agent to<br />
acquire these keys, rather than prompting you for a passphrase.<br />
The process is described in the following steps:<br />
1. Create a key pair. Use the default file name, and be sure to enter a<br />
passphrase.<br />
$ ssh-keygen -t rsa<br />
2. Enter a passphrase for your key pair when prompted. Note that the key<br />
agent does not survive X11 logout or system reboot:<br />
$ ssh-add<br />
3. The following command tells ssh that your key pair should let you in:<br />
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys<br />
3-32 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
<strong>Host</strong> Environment Setup for MPI<br />
Edit the ~/.ssh/config file so that it reads like this:<br />
<strong>Host</strong>*<br />
ForwardAgent yes<br />
ForwardX11 yes<br />
Check<strong>Host</strong>IP no<br />
Strict<strong>Host</strong>KeyChecking no<br />
This file forwards the key agent requests back to your desktop. When you<br />
log into a front end node, you can use ssh to compute nodes without<br />
passwords.<br />
4. Follow your administrator’s cluster policy for setting up ssh-agent on the<br />
machine where you will be running ssh commands. Alternatively, you can<br />
start the ssh-agent by adding the following line to your ~/.bash_profile<br />
(or equivalent in another shell):<br />
eval ‘ssh-agent‘<br />
Use back quotes rather than single quotes. Programs started in your login<br />
shell can then locate the ssh-agent and query it for keys.<br />
5. Finally, test by logging into the front end node, and from the front end node<br />
to a compute node, as follows:<br />
$ ssh frontend_node_name<br />
$ ssh compute_node_name<br />
For more information, see the man pages for ssh(1), ssh-keygen(1),<br />
ssh-add(1), and ssh-agent(1).<br />
Process Limitation with ssh<br />
Process limitation with ssh is primarily an issue when using the mpirun option<br />
-distributed=off. The default setting is now -distributed=on; therefore, in<br />
most cases, ssh process limitations will not be encountered. This limitation for the<br />
-distributed=off case is described in the following paragraph. See “Process<br />
Limitation with ssh” on page F-19 for an example of an error message associated<br />
with this limitation.<br />
MPI jobs that use more than 10 processes per node may encounter an ssh<br />
throttling mechanism that limits the amount of concurrent per-node connections<br />
to 10. If you need to use more processes, you or your system administrator must<br />
increase the value of MaxStartups in your /etc/ssh/sshd_config file.<br />
D000046-005 B 3-33
3–TrueScale Cluster Setup and Administration<br />
Checking Cluster and <strong>Software</strong> Status<br />
Checking Cluster and <strong>Software</strong> Status<br />
ipath_control<br />
InfiniBand status, link speed, and PCIe bus width can be checked by running the<br />
program ipath_control. Sample usage and output are as follows:<br />
$ ipath_control -iv<br />
$Id: <strong>QLogic</strong> OFED Release 1.4.2 $ $Date: 2009-03-10-10:15 $<br />
0: Version: ChipABI 2.0, InfiniPath_QLE7280, InfiniPath1 5.2,<br />
PCI 2, SW Compat 2<br />
0: Status: 0xe1 Initted Present IB_link_up IB_configured<br />
0: LID=0x1f MLID=0xc042 GUID=00:11:75:00:00:ff:89:a6 Serial:<br />
AIB0810A30297<br />
0: HRTBT:Auto RX_polarity_invert:Auto RX_lane_reversal: Auto<br />
0: LinkWidth:4X of 1X|4X Speed:DDR of SDR|DDR<br />
0: LocalBus: PCIe,2500MHz,x16<br />
3-34 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
Checking Cluster and <strong>Software</strong> Status<br />
iba_opp_query<br />
iba_opp_query is used to check the operation of the Distributed SA. You can run it<br />
from any node to verify that the replica on that node is working correctly. See<br />
“iba_opp_query” on page I-4 for detailed usage information.<br />
# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107<br />
Query Parameters:<br />
resv1<br />
0x0000000000000107<br />
dgid ::<br />
sgid ::<br />
dlid<br />
0x75<br />
slid<br />
0x31<br />
hop<br />
0x0<br />
flow<br />
0x0<br />
tclass<br />
0x0<br />
num_path<br />
0x0<br />
pkey<br />
0x0<br />
qos_class<br />
0x0<br />
sl<br />
0x0<br />
mtu<br />
0x0<br />
rate<br />
0x0<br />
pkt_life<br />
0x0<br />
preference<br />
0x0<br />
resv2<br />
0x0<br />
resv3<br />
0x0<br />
Using HCA qib0<br />
Result:<br />
resv1<br />
0x0000000000000107<br />
dgid<br />
fe80::11:7500:79:e54a<br />
sgid<br />
fe80::11:7500:79:e416<br />
dlid<br />
0x75<br />
slid<br />
0x31<br />
hop<br />
0x0<br />
flow<br />
0x0<br />
tclass<br />
0x0<br />
num_path<br />
0x0<br />
pkey<br />
0xffff<br />
qos_class<br />
0x0<br />
sl<br />
0x1<br />
mtu<br />
0x4<br />
D000046-005 B 3-35
3–TrueScale Cluster Setup and Administration<br />
Checking Cluster and <strong>Software</strong> Status<br />
rate<br />
pkt_life<br />
preference<br />
resv2<br />
resv3<br />
0x6<br />
0x10<br />
0x0<br />
0x0<br />
0x0<br />
ibstatus<br />
Another useful program is ibstatus. Sample usage and output are as follows:<br />
$ ibstatus<br />
Infiniband device ’qib0’ port 1 status:<br />
default gid: fe80:0000:0000:0000:0011:7500:00ff:89a6<br />
base lid: 0x1f<br />
sm lid:<br />
0x1<br />
state:<br />
4: ACTIVE<br />
phys state: 5: LinkUp<br />
rate:<br />
20 Gb/sec (4X DDR)<br />
ibv_devinfo<br />
ibv_devinfo queries RDMA devices. Use the -v option to see more information.<br />
Sample usage:<br />
$ ibv_devinfo<br />
hca_id: qib0<br />
fw_ver: 0.0.0<br />
node_guid:<br />
0011:7500:00ff:89a6<br />
sys_image_guid:<br />
0011:7500:00ff:89a6<br />
vendor_id:<br />
0x1175<br />
vendor_part_id: 29216<br />
hw_ver:<br />
0x2<br />
board_id:<br />
InfiniPath_QLE7280<br />
phys_port_cnt: 1<br />
port: 1<br />
state: PORT_ACTIVE (4)<br />
max_mtu: 4096 (5)<br />
active_mtu: 4096 (5)<br />
sm_lid: 1<br />
port_lid: 31<br />
port_lmc:<br />
0x00<br />
3-36 D000046-005 B
3–TrueScale Cluster Setup and Administration<br />
Checking Cluster and <strong>Software</strong> Status<br />
ipath_checkout<br />
ipath_checkout is a bash script that verifies that the installation is correct and<br />
that all the nodes of the network are functioning and mutually connected by the<br />
TrueScale fabric. It must be run on a front end node, and requires specification of<br />
a nodefile. For example:<br />
$ ipath_checkout [options] nodefile<br />
The nodefile lists the hostnames of the nodes of the cluster, one hostname per<br />
line. The format of nodefile is as follows:<br />
hostname1<br />
hostname2<br />
...<br />
For more information on these programs, see “ipath_control” on page I-12,<br />
“ibstatus” on page I-7, and “ipath_checkout” on page I-10.<br />
D000046-005 B 3-37
3–TrueScale Cluster Setup and Administration<br />
Checking Cluster and <strong>Software</strong> Status<br />
3-38 D000046-005 B
4 Running <strong>QLogic</strong> MPI on<br />
<strong>QLogic</strong> Adapters<br />
Introduction<br />
<strong>QLogic</strong> MPI<br />
This section provides information on using the <strong>QLogic</strong> Message-Passing Interface<br />
(MPI). Examples are provided for setting up the user environment, and for<br />
compiling and running MPI programs.<br />
The MPI standard is a message-passing library or collection of routines used in<br />
distributed-memory parallel programming. It is used in data exchange and task<br />
synchronization between processes. The goal of MPI is to provide portability and<br />
efficient implementation across different platforms and architectures.<br />
<strong>QLogic</strong>’s implementation of the MPI standard is derived from the MPICH<br />
reference implementation version 1.2.7. The <strong>QLogic</strong> MPI (TrueScale) libraries<br />
have been highly tuned for the <strong>QLogic</strong> interconnect, and will not run over other<br />
interconnects.<br />
<strong>QLogic</strong> MPI is an implementation of the original MPI 1.2 standard. The MPI-2<br />
standard provides several enhancements of the original standard. Of the MPI-2<br />
features, <strong>QLogic</strong> MPI includes only the MPI-IO features implemented in ROMIO<br />
version 126 and the generalized MPI_All to allow communication exchange.<br />
The <strong>QLogic</strong> MPI implementation in this release supports hybrid MPI/OpenMP and<br />
other multi-threaded programs, as long as only one thread uses MPI. For more<br />
information, see “<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP Applications” on<br />
page 4-25.<br />
D000046-005 B 4-1
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Introduction<br />
PSM<br />
The PSM TrueScale Messaging API, or PSM API, is <strong>QLogic</strong>'s low-level user-level<br />
communications interface for the TrueScale family of products. Other than using<br />
some environment variables with the PSM prefix, MPI users typically need not<br />
interact directly with PSM. The PSM environment variables apply to other MPI<br />
implementations as long as the environment with the PSM variables is correctly<br />
forwarded. See “Environment Variables” on page 4-20 for a summary of the<br />
commonly used environment variables.<br />
For more information on PSM, email <strong>QLogic</strong> at support@qlogic.com.<br />
Other MPIs<br />
In addition to <strong>QLogic</strong> MPI, other high-performance MPIs such as HP-MPI version<br />
2.3, Open MPI version 1.4, Ohio State University MVAPICH version 1.2,<br />
MVAPICH2 version 1.4, and Scali (Platform) MPI, have been ported to the PSM<br />
interface.<br />
Open MPI, MVAPICH, HP-MPI, and Scali also run over InfiniBand Verbs (the<br />
Open Fabrics Alliance API that provides support for user-level upper-layer<br />
protocols like MPI). Intel MPI, although not ported to the PSM interface, is<br />
supported over uDAPL, which uses InfiniBand Verbs. For more information, see<br />
Section 5 Using Other MPIs.<br />
Linux File I/O in MPI Programs<br />
MPI node programs are Linux programs that can execute file I/O operations to<br />
local or remote files in the usual ways through APIs of the language in use.<br />
Remote files are accessed via a network file system, typically NFS. Parallel<br />
programs usually need to have some data in files to be shared by all of the<br />
processes of an MPI job. Node programs can also use non-shared, node-specific<br />
files, such as for scratch storage for intermediate results or for a node’s share of a<br />
distributed database.<br />
There are different ways of handling file I/O of shared data in parallel<br />
programming. You may have one process, typically on the front end node or on a<br />
file server that is the only process to touch the shared files, and passes data to<br />
and from the other processes via MPI messages. Alternately, the shared data files<br />
can be accessed directly by each node program. In this case, the shared files are<br />
available through some network file support, such as NFS. Also, in this case, the<br />
application programmer is responsible for ensuring file consistency, either through<br />
proper use of file locking mechanisms offered by the operating system and the<br />
programming language, such as fcntl in C, or by using MPI synchronization<br />
operations.<br />
4-2 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Getting Started with MPI<br />
MPI-IO with ROMIO<br />
MPI-IO is the part of the MPI-2 standard, supporting collective and parallel file I/O<br />
operations. One advantage of using MPI-IO is that it can take care of managing<br />
file locks when file data is shared among nodes.<br />
<strong>QLogic</strong> MPI includes ROMIO version 1.2.6. ROMIO is a high-performance,<br />
portable implementation of MPI-IO from Argonne National Laboratory. ROMIO<br />
includes everything defined in the MPI-2 I/O chapter of the MPI-2 standard except<br />
support for file interoperability and user-defined error handlers for files. Of the<br />
MPI-2 features, <strong>QLogic</strong> MPI includes only the MPI-IO features implemented in<br />
ROMIO version 126 and the generalized MPI_All to allow communication<br />
exchange. See the ROMIO documentation at http://www.mcs.anl.gov/romio for<br />
details.<br />
NFS, PanFS, and local (UFS) support is enabled.<br />
Getting Started with MPI<br />
Copy Examples<br />
This section shows how to compile and run some simple example programs that<br />
are included in the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> software product. Compiling and running<br />
these examples enables you to verify that <strong>QLogic</strong> MPI and its components have<br />
been properly installed on the cluster. See “<strong>QLogic</strong> MPI Troubleshooting” on<br />
page F-11 if you have problems compiling or running these examples.<br />
These examples assume that your cluster’s policy allows you to use the mpirun<br />
script directly, without having to submit the job to a batch queuing system.<br />
Start by copying the examples to your working directory:<br />
$ cp /usr/mpi/qlogic/share/mpich/examples/basic/* .<br />
or<br />
$ cp /usr/share/mpich/examples/basic/* .<br />
Create the mpihosts File<br />
Next, create an MPI hosts file in the same working directory. It contains the host<br />
names of the nodes in your cluster that run the examples, with one host name per<br />
line. Name this file mpihosts. The contents can be in the following format:<br />
hostname1<br />
hostname2<br />
...<br />
More details on the mpihosts file can be found in “mpihosts File Details” on<br />
page 4-15.<br />
D000046-005 B 4-3
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Getting Started with MPI<br />
Compile and Run an Example C Program<br />
In this step you will compile and run your MPI program.<br />
<strong>QLogic</strong> MPI uses some shell scripts to find the appropriate include files and<br />
libraries for each supported language. Use the script mpicc to compile an MPI<br />
program in C and the script mpirun to execute the file.<br />
The supplied example program cpi.c computes an approximation to pi. First,<br />
compile it to an executable named cpi. For example:<br />
$ mpicc -o cpi cpi.c<br />
By default, mpicc runs the GNU gcc compiler, and is used for both compiling and<br />
linking, the same function as the gcc command.<br />
NOTE:<br />
For information on using other compilers, see “To Use Another Compiler” on<br />
page 4-9.<br />
Then, run the program with several different specifications for the number of<br />
processes:<br />
$ mpirun -np 2 -m mpihosts ./cpi<br />
Process 0 on hostname1<br />
Process 1 on hostname2<br />
pi is approximately 3.1416009869231241,<br />
Error is 0.0000083333333309<br />
wall clock time = 0.000149<br />
In this example, ./cpi designates the executable of the example program in the<br />
working directory. The -np parameter to mpirun defines the number of<br />
processes to be used in the parallel computation. Here is an example with four<br />
processes, using the same two hosts in the mpihosts file:<br />
$ mpirun -np 4 -m mpihosts ./cpi<br />
Process 3 on hostname1<br />
Process 0 on hostname2<br />
Process 2 on hostname2<br />
Process 1 on hostname1<br />
pi is approximately 3.1416009869231249,<br />
Error is 0.0000083333333318<br />
wall clock time = 0.000603<br />
4-4 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Getting Started with MPI<br />
Generally, mpirun tries to distribute the specified number of processes evenly<br />
among the nodes listed in the mpihosts file. However, if the number of<br />
processes exceeds the number of nodes listed in the mpihosts file, then some<br />
nodes will be assigned more than one instance of the program.<br />
When you run the program several times with the same value of the -np<br />
parameter, the output lines may display in different orders. This is because they<br />
are issued by independent asynchronous processes, so their order is<br />
non-deterministic.<br />
Details on other ways of specifying the mpihosts file are provided in “mpihosts<br />
File Details” on page 4-15.<br />
More information on the mpirun options are in “Using mpirun” on page 4-16 and<br />
Appendix A mpirun Options Summary. “Process Allocation” on page 4-11<br />
explains how processes are allocated by using hardware and software contexts.<br />
Examples Using Other Programming Languages<br />
This section gives similar examples for computing pi for Fortran 77 and<br />
Fortran 90. Fortran 95 usage is similar to Fortran 90. The C++ example uses the<br />
traditional “Hello, World” program. All programs are located in the same directory.<br />
fpi.f is a Fortran 77 program that computes pi in a way similar to cpi.c.<br />
Compile and link, and run it as follows:<br />
$ mpif77 -o fpi fpi.f<br />
$ mpirun -np 2 -m mpihosts ./fpi<br />
pi3f90.f90 is a Fortran 90 program that does the same computation. Compile<br />
and link, and run it as follows:<br />
$ mpif90 -o pi3f90 pi3f90.f90<br />
$ mpirun -np 2 -m mpihosts ./pi3f90<br />
The C++ program hello++.cc is a parallel processing version of the traditional<br />
“Hello, World” program. Notice that this version makes use of the external C<br />
bindings of the MPI functions if the C++ bindings are not present.<br />
D000046-005 B 4-5
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Compile and run it as follows:<br />
$ mpicxx -o hello hello++.cc<br />
$ mpirun -np 10 -m mpihosts ./hello<br />
Hello World! I am 9 of 10<br />
Hello World! I am 2 of 10<br />
Hello World! I am 4 of 10<br />
Hello World! I am 1 of 10<br />
Hello World! I am 7 of 10<br />
Hello World! I am 6 of 10<br />
Hello World! I am 3 of 10<br />
Hello World! I am 0 of 10<br />
Hello World! I am 5 of 10<br />
Hello World! I am 8 of 10<br />
Each of the scripts invokes the GNU compiler for the respective language and the<br />
linker. See “To Use Another Compiler” on page 4-9 for an example of how to use<br />
other compilers. The use of mpirun is the same for programs in all languages.<br />
<strong>QLogic</strong> MPI Details<br />
The following sections provide more details on the use of <strong>QLogic</strong> MPI. These<br />
sections assume that you are familiar with standard MPI. For more information,<br />
see the references in “References for MPI” on page J-1. This implementation<br />
includes the man pages from the MPICH implementation for the numerous MPI<br />
functions.<br />
4-6 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Use Wrapper Scripts for Compiling and Linking<br />
The scripts in Table 4-1 invoke the compiler and linker for programs in each of the<br />
respective languages, and take care of referring to the correct include files and<br />
libraries in each case.<br />
Table 4-1. <strong>QLogic</strong> MPI Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpicc<br />
mpicxx<br />
C<br />
C++<br />
mpif77 Fortran 77<br />
mpif90 Fortran 90<br />
mpif95 Fortran 95<br />
On x86_64, these scripts (by default) call the GNU compiler and linker. To use<br />
other compilers, see “To Use Another Compiler” on page 4-9.<br />
These scripts all provide the command line options listed in Table 4-2.<br />
Table 4-2. Command Line Options for Scripts<br />
Command<br />
Meaning<br />
-help<br />
-show<br />
-echo<br />
-compile_info<br />
-link_info<br />
Provides help<br />
Lists each of the compiling and linking commands that would be<br />
called without actually calling them<br />
Gets verbose output of all the commands in the script<br />
Shows how to compile a program<br />
Shows how to link a program<br />
In addition, each of these scripts allow a command line option for specifying a<br />
different compiler/linker as an alternative to the GNU Compiler Collection (GCC).<br />
For more information, see “To Use Another Compiler” on page 4-9.<br />
Most other command line options are passed on to the invoked compiler and<br />
linker. The GNU compiler and alternative compilers all accept numerous<br />
command line options. See the GCC compiler documentation and the man pages<br />
for gcc and gfortran for complete information on available options. See the<br />
corresponding documentation for any other compiler/linker you may call for its<br />
options. Man pages for mpif90(1), mpif77(1), mpicc(1), and mpiCC(1) are<br />
available.<br />
D000046-005 B 4-7
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Configuring MPI Programs for <strong>QLogic</strong> MPI<br />
When configuring an MPI program (generating header files and/or Makefiles) for<br />
<strong>QLogic</strong> MPI, you usually need to specify mpicc, mpicxx, and so on as the<br />
compiler, rather than gcc, g++, etc.<br />
Specifying the compiler is typically done with commands similar to the following,<br />
assuming that you are using sh or bash as the shell:<br />
$ export CC=mpicc<br />
$ export CXX=mpicxx<br />
$ export F77=mpif77<br />
$ export F90=mpif90<br />
$ export F95=mpif95<br />
The shell variables will vary with the program being configured, but these<br />
examples show frequently used variable names. If you use csh, use commands<br />
similar to the following:<br />
$ setenv CC mpicc<br />
You may need to pass arguments to configure directly, for example:<br />
$ ./configure -cc=mpicc -fc=mpif77 -c++=mpicxx<br />
-c++linker=mpicxx<br />
You may also need to edit a Makefile to achieve this result, adding lines similar to:<br />
CC=mpicc<br />
F77=mpif77<br />
F90=mpif90<br />
F95=mpif95<br />
CXX=mpicxx<br />
In some cases, the configuration process may specify the linker. <strong>QLogic</strong><br />
recommends that the linker be specified as mpicc, mpif90, etc. in these cases.<br />
This specification automatically includes the correct flags and libraries, rather than<br />
trying to configure to pass the flags and libraries explicitly. For example:<br />
LD=mpicc<br />
LD=mpif90<br />
These scripts pass appropriate options to the various compiler passes to include<br />
header files, required libraries, etc. While the same effect can be achieved by<br />
passing the arguments explicitly as flags, the required arguments may vary from<br />
release to release, so it is good practice to use the provided scripts.<br />
4-8 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
To Use Another Compiler<br />
<strong>QLogic</strong> MPI and all other MPIs that run on TrueScale, support a number of<br />
compilers, in addition to the default GNU Compiler Collection (GCC, including gcc,<br />
g++ and gfortran, but g77 is not supported) versions 3.3 and later. These include<br />
the PathScale Compiler Suite 3.0, 3.1, and 3.2; PGI 5.2, 6.0, 7.1, 8.0, and 9.0;<br />
and Intel 9.x, 10.1, and 11.x.<br />
NOTE:<br />
The PathScale compiler suite is not supported on the RHEL 4 U8 or the<br />
SLES 11 distribution.<br />
These compilers can be invoked on the command line by passing options to the<br />
wrapper scripts. Command line options override environment variables, if set.<br />
Tables 4-3, 4-4, and 4-5 show the options for each of the compilers.<br />
In each case, ..... stands for the remaining options to the mpicxx script, the<br />
options to the compiler in question, and the names of the files that it operates.<br />
Table 4-3. Intel<br />
Compiler<br />
Command<br />
C $ mpicc -cc=icc .....<br />
C++<br />
$ mpicc -CC=icpc<br />
Fortran 77 $ mpif77 -fc=ifort .....<br />
Fortran 90/95 $ mpif90 -f90=ifort .....<br />
$ mpif95 -f95=ifort .....<br />
Table 4-4. Portland Group (PGI)<br />
Compiler<br />
Command<br />
C mpicc -cc=pgcc .....<br />
C++<br />
mpicc -CC=pgCC<br />
Fortran 77 mpif77 -fc=pgf77 .....<br />
Fortran 90/95 mpif90 -f90=pgf90 .....<br />
mpif95 -f95=pgf95 .....<br />
D000046-005 B 4-9
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Table 4-5. PathScale Compiler Suite<br />
Compiler<br />
Command<br />
C mpicc -cc=pathcc .....<br />
C++ mpicc -CC=pathCC .....<br />
Fortran 77 mpif77 -fc=pathf95 .....<br />
Fortran 90/95 mpif90 -f90=pathf95 .....<br />
mpif90 -f95=pathf95 .....<br />
NOTE: pathf95 invokes the Fortran 77,<br />
Fortran 90, and Fortran 95 compilers.<br />
Also, use mpif77, mpif90, or mpif95 for linking; otherwise, .true. may have<br />
the wrong value.<br />
If you are not using the provided scripts for linking, link a sample program using<br />
the -show option as a test (without the actual build) to see what libraries to add to<br />
your link line. Some examples of the using the PGI compilers follow.<br />
For Fortran 90 programs:<br />
$ mpif90 -f90=pgf90 -show pi3f90.f90 -o pi3f90<br />
pgf90 -I/usr/include/mpich/pgi5/x86_64 -c -I/usr/include<br />
pi3f90.f90 -c<br />
pgf90 pi3f90.o -o pi3f90 -lmpichf90 -lmpich -lmpichabiglue_pgi5<br />
Fortran 95 programs will be similar to the above.<br />
For C programs:<br />
$ mpicc -cc=pgcc -show cpi.c<br />
pgcc -c cpi.c<br />
pgcc cpi.o -lmpich -lpgftnrtl -lmpichabiglue_pgi5<br />
Compiler and Linker Variables<br />
When you use environment variables (e.g., $MPICH_CC) to select the compiler<br />
mpicc (and others) will use, the scripts will also set the matching linker variable<br />
(for example, $MPICH_CLINKER), if it is not already set. When both the<br />
environment variable and command line options are used (-cc=gcc), the<br />
command line variable is used.<br />
When both the compiler and linker variables are set, and they do not match for the<br />
compiler you are using, the MPI program may fail to link; or, if it links, it may not<br />
execute correctly. For a sample error message, see “Compiler/Linker Mismatch”<br />
on page F-14.<br />
4-10 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Process Allocation<br />
Normally MPI jobs are run with each node program (process) being associated<br />
with a dedicated <strong>QLogic</strong> infiniband adapter hardware context that is mapped to a<br />
CPU.<br />
If the number of node programs is greater than the available number of hardware<br />
contexts, software context sharing increases the number of node programs that<br />
can be run. Each adapter supports four software contexts per hardware context,<br />
so up to four node programs (from the same MPI job) can share that hardware<br />
context. There is a small additional overhead for each shared context.<br />
Table 4-6 shows the maximum number of contexts available for each adapter.<br />
Table 4-6. Available Hardware and <strong>Software</strong> Contexts<br />
Adapter<br />
Available Hardware<br />
Contexts (same as number<br />
of supported CPUs)<br />
Available Contexts when<br />
<strong>Software</strong> Context Sharing is<br />
Enabled<br />
QLE7140 4 16<br />
QLE7240/<br />
QLE7280<br />
QLE7342/<br />
QLE7340<br />
16 64<br />
16 64<br />
The default hardware context/CPU mappings can be changed on the TrueScale<br />
DDR and QDR InfiniBand Adapters (QLE72x0 and QLE734x). See “TrueScale<br />
Hardware Contexts on the DDR and QDR InfiniBand Adapters” on page 4-12 for<br />
more details.<br />
Context sharing is enabled by default. How the system behaves when context<br />
sharing is enabled or disabled is described in “Enabling and Disabling <strong>Software</strong><br />
Context Sharing” on page 4-13.<br />
When running a job in a batch system environment where multiple jobs may be<br />
running simultaneously, it is useful to restrict the number of TrueScale contexts<br />
that are made available on each node of an MPI. See “Restricting TrueScale<br />
Hardware Contexts in a Batch Environment” on page 4-13.<br />
Errors that may occur with context sharing are covered in “Context Sharing Error<br />
Messages” on page 4-14.<br />
There are multiple ways of specifying how processes are allocated. You can use<br />
the mpihosts file, the -np and -ppn options with mpirun, and the<br />
MPI_NPROCS and PSM_SHAREDCONTEXTS_MAX environment variables. How<br />
these all are set are covered later in this document.<br />
D000046-005 B 4-11
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
TrueScale Hardware Contexts on the DDR and QDR<br />
InfiniBand Adapters<br />
On the QLE7240 and QLE7280 DDR adapters, adapter receive resources are<br />
statically partitioned across the TrueScale contexts according to the number of<br />
TrueScale contexts enabled. The following defaults are automatically set<br />
according to the number of online CPUs in the node:<br />
For four or less CPUs: 5 (4 + 1 for kernel)<br />
For five to eight CPUs: 9 (8 + 1 for kernel)<br />
For nine or more CPUs: 17 (16 + 1 for kernel)<br />
On the QLE7340 and QLE7342 QDR adapters, adapter receive resources are<br />
statically partitioned across the TrueScale contexts according to the number of<br />
TrueScale contexts enabled. The following defaults are automatically set<br />
according to the number of online CPUs in the node:<br />
For four or less CPUs: 6 (4 + 2)<br />
For five to eight CPUs: 10 (8 + 2)<br />
For nine or more CPUs: 18 (16 + 2)<br />
The one additional context on QDR adapters are to support the kernel on each<br />
port.<br />
Performance can be improved in some cases by disabling TrueScale hardware<br />
contexts when they are not required so that the resources can be partitioned more<br />
effectively.<br />
To disable this behavior, explicitly configure for the number you want to use with<br />
the cfgctxts module parameter in the file /etc/modprobe.conf (or<br />
/etc/modprobe.conf.local on SLES).<br />
The maximum that can be set is 17 on DDR InfiniBand Adapters and 18 on QDR<br />
InfiniBand Adapters.<br />
The driver must be restarted if this default is changed. See “Managing the<br />
TrueScale Driver” on page 3-22.<br />
4-12 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
NOTE:<br />
In rare cases, setting contexts automatically on DDR and QDR InfiniBand<br />
Adapters can lead to sub-optimal performance where one or more<br />
TrueScale hardware contexts have been disabled and a job is run that<br />
requires software context sharing. Since the algorithm ensures that there is<br />
at least one TrueScale context per online CPU, this case occurs only if the<br />
CPUs are over-subscribed with processes (which is not normally<br />
recommended). In this case, it is best to override the default to use as many<br />
TrueScale contexts as are available, which minimizes the amount of<br />
software context sharing required.<br />
Enabling and Disabling <strong>Software</strong> Context Sharing<br />
By default, context sharing is enabled; it can also be specifically disabled.<br />
Context Sharing Enabled: The MPI library provides PSM the local process<br />
layout so that TrueScale contexts available on each node can be shared if<br />
necessary; for example, when running more node programs than contexts. All<br />
PSM jobs assume that they can make use of all available TrueScale contexts to<br />
satisfy the job requirement and try to give a context to each process.<br />
When context sharing is enabled on a system with multiple <strong>QLogic</strong> adapter<br />
(TrueScale) boards (units) and the IPATH_UNIT environment variable is set, the<br />
number of TrueScale contexts made available to MPI jobs is restricted to the<br />
number of contexts available on that unit. When multiple TrueScale devices are<br />
present, it restricts the use to a specific TrueScale unit. By default, all configured<br />
units are used in round robin order.<br />
Context Sharing Disabled: Each node program tries to obtain exclusive access<br />
to an TrueScale hardware context. If no hardware contexts are available, the job<br />
aborts.<br />
To explicitly disable context sharing, set this environment variable in one of the<br />
two following ways:<br />
PSM_SHAREDCONTEXTS=0<br />
PSM_SHAREDCONTEXTS=NO<br />
The default value of PSM_SHAREDCONTEXTS is 1 (enabled).<br />
Restricting TrueScale Hardware Contexts<br />
in a Batch Environment<br />
If required for resource sharing between multiple jobs in batch systems, you can<br />
restrict the number of TrueScale hardware contexts that are made available on<br />
each node of an MPI job by setting that number in the<br />
PSM_SHAREDCONTEXTS_MAX environment variable.<br />
D000046-005 B 4-13
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
For example, if you are running two different jobs on nodes using the QLE7140,<br />
set PSM_SHAREDCONTEXTS_MAX to 2 instead of the default 4. Each job would<br />
then have at most two of the four available hardware contexts. Both of the jobs<br />
that want to share a node would have to set PSM_SHAREDCONTEXTS_MAX=2 on<br />
that node before sharing begins.<br />
However, note that setting PSM_SHAREDCONTEXTS_MAX=2 as a clusterwide<br />
default would unnecessarily penalize nodes that are dedicated to running single<br />
jobs. So a per-node setting, or some level of coordination with the job scheduler<br />
with setting the environment variable, is recommended.<br />
If some nodes have more cores than others, then the setting must be adjusted<br />
properly for the number of cores on each node.<br />
Additionally, you can explicitly configure for the number of contexts you want to<br />
use with the cfgctxts module parameter. This will override the default settings<br />
(on the QLE7240 and QLE7280) based on the number of CPUs present on each<br />
node. See “TrueScale Hardware Contexts on the DDR and QDR InfiniBand<br />
Adapters” on page 4-12.<br />
Context Sharing Error Messages<br />
The error message when the context limit is exceeded is:<br />
No free InfiniPath contexts available on /dev/ipath<br />
This message appears when the application starts.<br />
Error messages related to contexts may also be generated by ipath_checkout<br />
or mpirun. For example:<br />
PSM found 0 available contexts on InfiniPath device<br />
The most likely cause is that the cluster has processes using all the available<br />
PSM contexts. Clean up these processes before restarting the job.<br />
Running in Shared Memory Mode<br />
<strong>QLogic</strong> MPI supports running exclusively in shared memory mode; no <strong>QLogic</strong><br />
adapter is required for this mode of operation. This mode is used for running<br />
applications on a single node rather than on a cluster of nodes.<br />
To enable shared memory mode, use either a single node in the mpihosts file or<br />
use these options with mpirun:<br />
$ mpirun -np= -ppn=<br />
needs to be equal in both cases.<br />
NOTE:<br />
For this release, must be ≤ 64.<br />
4-14 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
When you are using a non-<strong>QLogic</strong> MPI that uses the TrueScale PSM layer,<br />
ensure that the parallel job is contained on a single node and set the<br />
PSM_DEVICES environment variable:<br />
PSM_DEVICES="shm,self"<br />
If you are using <strong>QLogic</strong> MPI, you do not need to set this environment variable; it is<br />
set automatically if np == ppn.<br />
When running on a single node with <strong>QLogic</strong> MPI, no infiniband adapter hardware<br />
is required if -disable-dev-check is passed to mpirun.<br />
mpihosts File Details<br />
As noted in “Create the mpihosts File” on page 4-3, an mpihosts file (also called<br />
a machines file, nodefile, or hostsfile) has been created in your current working<br />
directory. This file names the nodes that the node programs may run.<br />
The two supported formats for the mpihosts file are:<br />
hostname1<br />
hostname2<br />
...<br />
or<br />
hostname1:process_count<br />
hostname2:process_count<br />
...<br />
In the first format, if the -np count (number of processes to spawn in the mpirun<br />
command) is greater than the number of lines in the machine file, the hostnames<br />
will be repeated (in order) as many times as necessary for the requested number<br />
of node programs.<br />
In the second format, process_count can be different for each host, and is<br />
normally the number of available processors on the node. When not specified, the<br />
default value is one. The value of process_count determines how many node<br />
programs will be started on that host before using the next entry in the mpihosts<br />
file. When the full mpihosts file is processed, and there are additional processes<br />
requested, processing starts again at the start of the file.<br />
NOTE:<br />
To create an mpihosts file, use the ibhosts program. It will generate a<br />
list of available nodes that are already connected to the switch.<br />
There are several alternative ways of specifying the mpihosts file:<br />
D000046-005 B 4-15
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Using mpirun<br />
• As noted in “Compile and Run an Example C Program” on page 4-4, you<br />
can use the command line option -m:<br />
$mpirun -np n -m mpihosts [other options] program-name<br />
In this case, if the named file cannot be opened, the MPI job fails.<br />
An alternate mechanism to -m for specifying hosts is the -H or -hosts<br />
followed by a host list. The host list can follow one of the following examples:<br />
host-[01-02,04,06-08] , or<br />
host-01,host-02,host-04,host-06,host-07,host-08<br />
• When neither the -m or the -H options are used, mpirun checks the<br />
environment variable MPIHOSTS for the name of the MPI hosts file. If this<br />
variable is defined and the file it names cannot be opened, the MPI job fails.<br />
• In the absence of the -m option, the -H option, and the MPIHOSTS<br />
environment variable, mpirun uses the file ./mpihosts, if it exists.<br />
• If none of these four methods of specifying the hosts file are used, mpirun<br />
looks for the file ~/.mpihosts.<br />
If you are working in the context of a batch queuing system, it may provide a job<br />
submission script that generates an appropriate mpihosts file.<br />
The script mpirun is a front end program that starts a parallel MPI job on a set of<br />
nodes in an TrueScale cluster. mpirun may be run on any i386 or x86_64<br />
machine inside or outside the cluster, as long as it is on a supported Linux<br />
distribution, and has TCP connectivity to all TrueScale cluster machines to be<br />
used in a job.<br />
The script starts, monitors, and terminates the node programs. mpirun uses ssh<br />
(secure shell) to log in to individual cluster machines and prints any messages<br />
that the node program prints on stdout or stderr, on the terminal where<br />
mpirun is invoked.<br />
4-16 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
NOTE:<br />
The mpi-frontend-* RPM needs to be installed on all nodes that will be<br />
using mpirun. Alternatively, you can use the mpirun option<br />
-distributed=off, which requires that only the mpi-frontend RPM is<br />
installed on the node where mpirun is invoked. Using -distributed=off<br />
can have a negative impact on mpirun’s performance when running<br />
large-scale jobs. More specifically, this option increases the memory usage<br />
on the host where mpirun is started and will slow down the job startup,<br />
since it will spawn MPI processes serially.<br />
The general syntax is:<br />
$ mpirun [mpirun_options...] program-name [program options]<br />
program-name is usually the pathname to the executable MPI program. When<br />
the MPI program resides in the current directory and the current directory is not in<br />
your search path, then program-name must begin with ‘./’, for example:<br />
./program-name<br />
Unless you want to run only one instance of the program, use the -np option, for<br />
example:<br />
$ mpirun -np n [other options] program-name<br />
This option spawns n instances of program-name. These instances are called<br />
node programs.<br />
Generally, mpirun tries to distribute the specified number of processes evenly<br />
among the nodes listed in the mpihosts file. However, if the number of<br />
processes exceeds the number of nodes listed in the mpihosts file, then some<br />
nodes will be assigned more than one instance of the program.<br />
Another command line option, -ppn, instructs mpirun to assign a fixed number p<br />
of node programs (processes) to each node, as it distributes n instances among<br />
the nodes:<br />
$ mpirun -np n -m mpihosts -ppn p program-name<br />
This option overrides the :process_count specifications, if any, in the lines of<br />
the mpihosts file. As a general rule, mpirun distributes the n node programs<br />
among the nodes without exceeding, on any node, the maximum number of<br />
instances specified by the :process_count option. The value of<br />
the :process_count option is specified by either the -ppn command line<br />
option or in the mpihosts file.<br />
D000046-005 B 4-17
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
NOTE:<br />
When the -np value is larger than the number of nodes in the mpihosts file<br />
times the -ppn value, mpirun cycles back through the hostsfile, assigning<br />
additional node programs per host.<br />
Typically, the number of node programs should not be larger than the number of<br />
processor cores, at least not for compute-bound programs.<br />
This option specifies the number of processes to spawn. If this option is not set,<br />
then environment variable MPI_NPROCS is checked. If MPI_NPROCS is not set,<br />
the default is to determine the number of processes based on the number of hosts<br />
in the machinefile -M or the list of hosts -H.<br />
-ppn processes-per-node<br />
This option creates up to the specified number of processes per node.<br />
Each node program is started as a process on one node. While a node program<br />
may fork child processes, the children themselves must not call MPI functions.<br />
The -distributed=on|off option has been added to mpirun. This option<br />
reduces overhead by enabling mpirun to start processes in parallel on multiple<br />
nodes. Initially, mpirun spawns one mpirun child per node from the root node,<br />
each of which in turn spawns the number of local processes for that particular<br />
node. Control the use of distributed mpirun job spawning mechanism with this<br />
option:<br />
-distributed [=on|off]<br />
The default is on. To change the default, put this option in the global<br />
mpirun.defaults file or a user-local file. See “Environment for Node<br />
Programs” on page 4-19 and “Environment Variables” on page 4-20 for details.<br />
mpirun monitors the parallel MPI job, terminating when all the node programs in<br />
that job exit normally, or if any of them terminates abnormally.<br />
Killing the mpirun program kills all the processes in the job. Use CTRL+C to kill<br />
mpirun.<br />
Console I/O in MPI Programs<br />
mpirun sends any output printed to stdout or stderr by any node program to<br />
the terminal. This output is line-buffered, so the lines output from the various node<br />
programs will be non-deterministically interleaved on the terminal. Using the -l<br />
option to mpirun will label each line with the rank of the node program where it<br />
was produced.<br />
Node programs do not normally use interactive input on stdin, and by default,<br />
stdin is bound to /dev/null. However, for applications that require standard<br />
input redirection, <strong>QLogic</strong> MPI supports two mechanisms to redirect stdin:<br />
4-18 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
• When mpirun is run from the same node as MPI rank 0, all input piped to<br />
the mpirun command is redirected to rank 0.<br />
• When mpirun is not run from the same node as MPI rank 0, or if the input<br />
must be redirected to all or specific MPI processes, the -stdin option can<br />
redirect a file as standard input to all nodes (or to a particular node) as<br />
specified by the -stdin-target option.<br />
Environment for Node Programs<br />
TrueScale-related environment variables are propagated to node programs.<br />
These include environment variables that begin with the prefix IPATH_, PSM_,<br />
MPI_ or LD_. Some other variables (such as HOME) are set or propagated by<br />
ssh(1).<br />
NOTE:<br />
The environment variable LD_BIND_NOW is not supported for <strong>QLogic</strong> MPI<br />
programs. Not all symbols referenced in the shared libraries can be resolved<br />
on all installations. (They provide a variety of compatible behaviors for<br />
different compilers, etc.) Therefore, the libraries are built to run in lazy<br />
binding mode; the dynamic linker evaluates and binds to symbols only when<br />
needed by the application in a given runtime environment.<br />
mpirun checks for these environment variables in the shell where it is invoked,<br />
and then propagates them correctly. The environment on each node is whatever it<br />
would be for the user’s login via ssh, unless you are using a Multi-Purpose<br />
Daemon (MPD) (see “MPD” on page 4-24).<br />
Environment variables are specified in descending order, as follows:<br />
1. Set in the default shell environment on a remote node, e.g., ~/.bashrc or<br />
equivalents.<br />
2. Set in -rcfile.<br />
3. Set the current shell environment for the mpirun command.<br />
4. If nothing has been set (none of the previous sets have been performed),<br />
the default value of the environment variable is used.<br />
As noted in the above list, using an mpirunrc file overrides any environment<br />
variables already set by the user. You can set environment variables for the node<br />
programs with the -rcfile option of mpirun with the following command:<br />
$ mpirun -np n -m mpihosts -rcfile mpirunrc program_name<br />
In the absence of this option, mpirun checks to see if a file called<br />
$HOME/.mpirunrc exists in the user's home directory. In either case, the file is<br />
sourced by the shell on each node when the node program starts.<br />
D000046-005 B 4-19
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
The .mpirunrc command line cannot contain any interactive commands. It can<br />
contain commands that output on stdout or stderr.<br />
There is a global options file for mpirun arguments. The default location of this<br />
file is:<br />
/opt/infinipath/etc/mpirun.defaults<br />
You can use an alternate file by setting the environment variable<br />
$PSC_MPIRUN_DEFAULTS_PATH. See the mpirun man page for more<br />
information.<br />
Environment Variables<br />
Table 4-7 contains a summary of the environment variables that are used by<br />
TrueScale and mpirun.<br />
Table 4-7. Environment Variables<br />
Name<br />
Description<br />
MPICH_ROOT<br />
This variable is used by mpirun to find the<br />
mpirun-ipath-ssh executable, set up<br />
LD_LIBRARY_PATH, and set up a prefix for all<br />
InfiniPath pathnames. This variable is used by the<br />
--prefix argument (or is the same as --prefix),<br />
if installing TrueScale RPMs in an alternate<br />
location.<br />
Default: Unset<br />
IPATH_PORT Specifies the port to use for the job, 1 or 2.<br />
Specifying 0 will autoselect IPATH_PORT.<br />
Default: Unset<br />
IPATH_SL<br />
IPATH_UNIT<br />
LD_LIBRARY_PATH<br />
Service Level for QDR Adapters, these are used<br />
to work with the switch's Vfabric feature.<br />
Default: Unset<br />
This variable is for context sharing. When multiple<br />
TrueScale devices are present, this variable<br />
restricts the use to a specific TrueScale unit. By<br />
default, all configured units are used in round<br />
robin order.<br />
Default: Unset<br />
This variable specifies the path to the run-time<br />
library. It is often set in the .mpirunrc file.<br />
Default: Unset<br />
4-20 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Table 4-7. Environment Variables (Continued)<br />
Name<br />
Description<br />
MPICH_CC<br />
MPICH_CCC<br />
MPICH_F90<br />
MPIHOSTS<br />
MPI_NPROCS<br />
MPI_SHELL<br />
PSM_DEVICES<br />
PSC_MPIRUN_DEFAULTS_PATH<br />
PSM_SHAREDCONTEXTS<br />
PSM_SHAREDCONTEXTS_MAX<br />
This variable selects the compiler to use for<br />
mpicc, and others.<br />
This variable selects the compiler to use for<br />
mpicxx, and others.<br />
This variable selects the compiler to use for<br />
mpif90, and others.<br />
This variable sets the name of the machines<br />
(mpihosts) file.<br />
Default: Unset<br />
This variable specifies the number of MPI processes<br />
to spawn.<br />
Default: Unset<br />
Specifies the name of the program to log into<br />
remote hosts.<br />
Default: ssh unless MPI_SHELL is defined.<br />
Non-<strong>QLogic</strong> MPI users can set this variable to<br />
enable running in shared memory mode on a single<br />
node. This variable is automatically set for<br />
<strong>QLogic</strong> MPI.<br />
Default: PSM_DEVICES="self,ipath"<br />
This variable sets the path to a user-local mpirun<br />
defaults file.<br />
Default:<br />
/opt/infinipath/etc/mpirun.defaults<br />
This variable overrides automatic context sharing<br />
behavior. YES is equivalent to 1 (see Default).<br />
Default: PSM_SHAREDCONTEXTS=1<br />
This variable restricts the number of TrueScale<br />
contexts that are made available on each node of<br />
an MPI job.<br />
Default:<br />
PSM_SHAREDCONTEXTS_MAX=4 (QLE7140)<br />
Up to 16 on (QLE7240 and QLE7280; set automatically<br />
based on number of CPUs on node)<br />
D000046-005 B 4-21
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Details<br />
Table 4-7. Environment Variables (Continued)<br />
Name<br />
IPATH_HCA_SELECTION_ALG<br />
Description<br />
This variable provides user-level support to specify<br />
<strong>Host</strong> Channel Adapter/port selection algorithm<br />
through the environment variable. The default<br />
option is Round Robin that allocates the <strong>Host</strong><br />
Channel Adapters in a round robin fashion. The<br />
older mechanism option is Packed that fills all<br />
contexts on a <strong>Host</strong> Channel Adapter before allocating<br />
from the next <strong>Host</strong> Channel Adapter.<br />
For example: In the case of using two single-port<br />
<strong>Host</strong> Channel Adapters, the default or<br />
IPATH_HCA_SELECTION_ALG="Round Robin"<br />
setting, will allow 2 or more MPI processes per<br />
node to use both <strong>Host</strong> Channel Adapters and to<br />
achieve performance improvements compared to<br />
what can be achieved with one <strong>Host</strong> Channel<br />
Adapter.<br />
Running Multiple Versions of TrueScale or MPI<br />
The variable MPICH_ROOT sets a root prefix for all InfiniPath-related paths. It is<br />
used by mpirun to try to find the mpirun-ipath-ssh executable, and it also<br />
sets up the LD_LIBRARY_PATH for new programs. Consequently, multiple<br />
versions of the TrueScale software releases can be installed on some or all<br />
nodes, and <strong>QLogic</strong> MPI and other versions of MPI can be installed at the same<br />
time. It may be set in the environment, in mpirun.defaults, or in an rcfile (such<br />
as .mpirunrc, .bashrc, or .cshrc) that will be invoked on remote nodes.<br />
If you have installed the software into an alternate location using the --prefix<br />
option with rpm, --prefix would have been set to $MPICH_ROOT.<br />
If MPICH_ROOT is not set, the normal PATH is used unless mpirun is invoked with<br />
a full pathname.<br />
NOTE:<br />
mpirun-ssh was renamed mpirun-ipath-ssh to avoid name conflicts<br />
with other MPI implementations.<br />
4-22 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Performance Tuning<br />
Job Blocking in Case of Temporary InfiniBand Link Failures<br />
By default, as controlled by mpirun’s quiescence parameter -q, an MPI job is<br />
killed for quiescence in the event of an InfiniBand link failure (or unplugged cable).<br />
This quiescence timeout occurs under one of the following conditions:<br />
• A remote rank’s process cannot reply to out-of-band process checks.<br />
• MPI is inactive on the InfiniBand link for more than 15 minutes.<br />
To keep remote process checks but disable triggering quiescence for temporary<br />
InfiniBand link failures, use the -disable-mpi-progress-check option with a<br />
nonzero -q option. To disable quiescence triggering altogether, use -q 0. No<br />
matter how these options are used, link failures (temporary or other) are always<br />
logged to syslog.<br />
If the link is down when the job starts and you want the job to continue blocking<br />
until the link comes up, use the -t -1 option.<br />
Performance Tuning<br />
CPU Affinity<br />
These methods may be used at runtime. Performance settings that are typically<br />
set by the system administrator are listed in “Performance Settings and<br />
Management Tips” on page 3-25.<br />
InfiniPath attempts to run each node program with CPU affinity set to a separate<br />
logical processor, up to the number of available logical processors. If CPU affinity<br />
is already set (with sched_setaffinity() or with the taskset utility), then<br />
InfiniPath will not change the setting.<br />
Use the taskset utility with mpirun to specify the mapping of MPI processes to<br />
logical processors. This combination makes the best use of available memory<br />
bandwidth or cache locality when running on dual-core Symmetric<br />
MultiProcessing (SMP) cluster nodes.<br />
The following example uses the NASA Advanced Supercomputing (NAS) Parallel<br />
Benchmark’s Multi-Grid (MG) benchmark and the -c option to taskset.<br />
$ mpirun -np 4 -ppn 2 -m $hosts taskset -c 0,2 bin/mg.B.4<br />
$ mpirun -np 4 -ppn 2 -m $hosts taskset -c 1,3 bin/mg.B.4<br />
The first command forces the programs to run on CPUs (or cores) 0 and 2. The<br />
second command forces the programs to run on CPUs 1 and 3. See the taskset<br />
man page for more information on usage.<br />
To turn off CPU affinity, set the environment variable IPATH_NO_CPUAFFINITY.<br />
This environment variable is propagated to node programs by mpirun.<br />
D000046-005 B 4-23
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
MPD<br />
mpirun Tunable Options<br />
MPD<br />
There are some mpirun options that can be adjusted to optimize communication.<br />
The most important one is:<br />
-long-len, -L [default: 64000]<br />
This option determines the length of the message that the rendezvous protocol<br />
(instead of the eager protocol) must use. The default value for -L was chosen for<br />
optimal unidirectional communication. Applications that have this kind of traffic<br />
pattern benefit from this higher default value. Other values for -L are appropriate<br />
for different communication patterns and data size. For example, applications that<br />
have bidirectional traffic patterns may benefit from using a lower value.<br />
Experimentation is recommended.<br />
Two other options that are useful are:<br />
-long-len-shmem, -s [default: 16000]<br />
This option determines the length of the message within the rendezvous protocol<br />
(instead of the eager protocol) to be used for intra-node communications. This<br />
option is for messages going through shared memory. The InfiniPath rendezvous<br />
messaging protocol uses a two-way handshake (with MPI synchronous send<br />
semantics) and receive-side DMA.<br />
-rndv-window-size, -W [default: 262144]<br />
When sending a large message using the rendezvous protocol, <strong>QLogic</strong> MPI splits<br />
it into a number of fragments at the source and recombines them at the<br />
destination. Each fragment is sent as a single rendezvous stage. This option<br />
specifies the maximum length of each fragment. The default is 262144 bytes.<br />
For more information on tunable options, type:<br />
$ mpirun -h<br />
MPD Description<br />
The complete list of options is contained in Appendix A.<br />
The Multi-Purpose Daemon (MPD) is an alternative to mpirun for launching MPI<br />
jobs. It is described briefly in the following sections.<br />
MPD was developed by Argonne National Laboratory (ANL) as part of the<br />
MPICH-2 system. While the ANL MPD had some advantages over the use of their<br />
mpirun (faster launching, better cleanup after crashes, better tolerance of node<br />
failures), the <strong>QLogic</strong> mpirun offers the same advantages.<br />
4-24 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP Applications<br />
Using MPD<br />
The disadvantage of MPD is reduced security, since it does not use ssh to launch<br />
node programs. It is also more complex to use than mpirun because it requires<br />
starting a ring of MPD daemons on the nodes. Therefore, <strong>QLogic</strong> recommends<br />
using the normal mpirun mechanism for starting jobs, as described in the<br />
previous chapter. However, if you want to use MPD, it is included in the InfiniPath<br />
software.<br />
To start an MPD environment, use the mpdboot program. You must provide<br />
mpdboot with a file that lists the machines that will run the mpd daemon. The<br />
format of this file is the same as for the mpihosts file in the mpirun command.<br />
Here is an example of how to run mpdboot:<br />
$ mpdboot -f hostsfile<br />
After mpdboot has started the MPD daemons, it will print a status message and<br />
drop into a new shell.<br />
To leave the MPD environment, exit from this shell. This will terminate the<br />
daemons.<br />
To use rsh instead of ssh with mpdboot, set the environment variable MPD_RSH<br />
to the pathname of the desired remote shell. For example:<br />
MPD_RSH=‘which rsh‘ mpdboot -n 16 -f hosts<br />
To run an MPI program from within the MPD environment, use the mpirun<br />
command. You do not need to provide an mpihosts file or a count of CPUs; by<br />
default, mpirun uses all nodes and CPUs available within the MPD environment.<br />
To check the status of the MPD daemons, use the mpdping command.<br />
NOTE:<br />
To use MPD, the software package mpi-frontend-*.rpm and python<br />
(available with your distribution) must be installed on all nodes. See the<br />
<strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more details on software<br />
installation.<br />
<strong>QLogic</strong> MPI and Hybrid MPI/OpenMP<br />
Applications<br />
<strong>QLogic</strong> MPI supports hybrid MPI/OpenMP applications, provided that MPI<br />
routines are called only by the master OpenMP thread. This application is called<br />
the funneled thread model. Instead of MPI_Init/MPI_INIT (for C/C++ and<br />
Fortran respectively), the program can call<br />
MPI_Init_thread/MPI_INIT_THREAD to determine the level of thread<br />
support, and the value MPI_THREAD_FUNNELED will be returned.<br />
D000046-005 B 4-25
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Debugging MPI Programs<br />
To use this feature, the application must be compiled with both OpenMP and MPI<br />
code enabled. To do this, use the -mp flag on the mpicc compile line.<br />
As mentioned previously, MPI routines can be called only by the master OpenMP<br />
thread. The hybrid executable is executed as usual using mpirun, but typically<br />
only one MPI process is run per node and the OpenMP library will create<br />
additional threads to utilize all CPUs on that node. If there are sufficient CPUs on<br />
a node, you may want to run multiple MPI processes and multiple OpenMP<br />
threads per node.<br />
The number of OpenMP threads is typically controlled by the OMP_NUM_THREADS<br />
environment variable in the .mpirunrc file. (OMP_NUM_THREADS is used by<br />
other compilers’ OpenMP products, but is not a <strong>QLogic</strong> MPI environment<br />
variable.) Use this variable to adjust the split between MPI processes and<br />
OpenMP threads. Usually, the number of MPI processes (per node) times the<br />
number of OpenMP threads will be set to match the number of CPUs per node. An<br />
example case would be a node with four CPUs, running one MPI process and four<br />
OpenMP threads. In this case, OMP_NUM_THREADS is set to four.<br />
OMP_NUM_THREADS is on a per-node basis.<br />
See “Environment for Node Programs” on page 4-19 for information on setting<br />
environment variables.<br />
At the time of publication, the MPI_THREAD_SERIALIZED and<br />
MPI_THREAD_MULTIPLE models are not supported.<br />
Debugging MPI Programs<br />
MPI Errors<br />
NOTE:<br />
When there are more threads than CPUs, both MPI and OpenMP<br />
performance can be significantly degraded due to over-subscription of the<br />
CPUs.<br />
Debugging parallel programs is substantially more difficult than debugging serial<br />
programs. Thoroughly debugging the serial parts of your code before parallelizing<br />
is good programming practice.<br />
Almost all MPI routines (except MPI_Wtime and MPI_Wtick) return an error<br />
code; either as the function return value in C functions or as the last argument in a<br />
Fortran subroutine call. Before the value is returned, the current MPI error handler<br />
is called. By default, this error handler aborts the MPI job. Therefore, you can get<br />
information about MPI exceptions in your code by providing your own handler for<br />
MPI_ERRORS_RETURN. See the man page for the MPI_Errhandler_set for<br />
details.<br />
4-26 D000046-005 B
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
Debugging MPI Programs<br />
NOTE:<br />
MPI does not guarantee that an MPI program can continue past an error.<br />
Using Debuggers<br />
See the standard MPI documentation referenced in Appendix J for details on the<br />
MPI error codes.<br />
The InfiniPath software supports the use of multiple debuggers, including<br />
pathdb, gdb, and the system call tracing utility strace. These debuggers let you<br />
set breakpoints in a running program, and examine and set its variables.<br />
Symbolic debugging is easier than machine language debugging. To enable<br />
symbolic debugging, you must have compiled with the -g option to mpicc so that<br />
the compiler will have included symbol tables in the compiled object code.<br />
To run your MPI program with a debugger, use the -debug or<br />
-debug-no-pause and -debugger options for mpirun. See the man pages to<br />
pathdb, gdb, and strace for details. When running under a debugger, you get<br />
an xterm window on the front end machine for each node process. Therefore, you<br />
can control the different node processes as desired.<br />
To use strace with your MPI program, the syntax is:<br />
$ mpirun -np n -m mpihosts strace program-name<br />
The following features of <strong>QLogic</strong> MPI facilitate debugging:<br />
• Stack backtraces are provided for programs that crash.<br />
• The -debug and -debug-no-pause options are provided for mpirun.<br />
These options make each node program start with debugging enabled. The<br />
-debug option allows you to set breakpoints, and start running programs<br />
individually. The -debug-no-pause option allows postmortem inspection.<br />
Be sure to set -q 0 when using -debug.<br />
• Communication between mpirun and node programs can be printed by<br />
specifying the mpirun -verbose option.<br />
• MPI implementation debug messages can be printed by specifying the<br />
mpirun -psc-debug-level option. This option can substantially impact<br />
the performance of the node program.<br />
• Support is provided for progress timeout specifications, deadlock detection,<br />
and generating information about where a program is stuck.<br />
D000046-005 B 4-27
4–Running <strong>QLogic</strong> MPI on <strong>QLogic</strong> Adapters<br />
<strong>QLogic</strong> MPI Limitations<br />
• Several misconfigurations (such as mixed use of 32-bit/64-bit executables)<br />
are detected by the runtime.<br />
• A formatted list containing information useful for high-level MPI application<br />
profiling is provided by using the -print-stats option with mpirun.<br />
Statistics include minimum, maximum, and median values for message<br />
transmission protocols as well as a more detailed information for expected<br />
and unexpected message reception. See “MPI Stats” on page F-30 for more<br />
information and a sample output listing.<br />
NOTE:<br />
The TotalView ® debugger can be used with the Open MPI supplied in this<br />
release. Consult the TotalView documentation for more information.<br />
<strong>QLogic</strong> MPI Limitations<br />
The current version of <strong>QLogic</strong> MPI has the following limitations:<br />
• There are no C++ bindings to MPI; use the extern C MPI function calls.<br />
• In MPI-IO file I/O calls in the Fortran binding, offset, or displacement<br />
arguments are limited to 32 bits. Thus, for example, the second argument of<br />
MPI_File_seek must be between -2 31 and 2 31 -1, and the argument to<br />
MPI_File_read_at must be between 0 and 2 32 -1.<br />
4-28 D000046-005 B
5 Using Other MPIs<br />
Introduction<br />
This section provides information on using other MPI implementations.<br />
Support for multiple high-performance MPI implementations has been added.<br />
Most implementations run over both PSM and OpenFabrics Verbs (see<br />
Table 5-1).<br />
Table 5-1. Other Supported MPI Implementations<br />
MPI<br />
Implementation<br />
Runs Over<br />
Compiled<br />
With<br />
Comments<br />
Open MPI 1.5<br />
PSM<br />
Verbs<br />
GCC, Intel, PGI,<br />
PathScale<br />
Provides some MPI-2 functionality<br />
(one-sided operations<br />
and dynamic<br />
processes).<br />
Available as part of the<br />
<strong>QLogic</strong> download.<br />
Can be managed by<br />
mpi-selector.<br />
MVAPICH version<br />
1.2<br />
PSM<br />
Verbs<br />
GCC, Intel, PGI,<br />
PathScale<br />
Provides MPI-1 functionality.<br />
Available as part of the<br />
<strong>QLogic</strong> download.<br />
Can be managed by<br />
mpi-selector.<br />
MVAPICH2 version<br />
1.4<br />
PSM<br />
Verbs<br />
GCC, Intel, PGI,<br />
PathScale<br />
Provides MPI-2 Functionality.<br />
Can be managed by<br />
MPI-Selector.<br />
Platform MPI 7 and<br />
HP-MPI 2.3<br />
PSM<br />
Verbs<br />
GCC (default)<br />
Provides some MPI-2 functionality<br />
(one-sided operations).<br />
Available for purchase from<br />
HP.<br />
D000046-005 B 5-1
5–Using Other MPIs<br />
Installed Layout<br />
Table 5-1. Other Supported MPI Implementations (Continued)<br />
MPI<br />
Implementation<br />
Runs Over<br />
Compiled<br />
With<br />
Comments<br />
Platform (Scali) 5.6<br />
PSM<br />
GCC (default)<br />
Provides MPI-1 functionality.<br />
Verbs<br />
Available for purchase from<br />
Platform.<br />
Intel MPI version 4.0<br />
TMI/PSM,<br />
uDAPL<br />
GCC (default)<br />
Provides MPI-1 and MPI-2<br />
functionality.<br />
Available for purchase from<br />
Intel.<br />
Table Notes<br />
MVAPICH and Open MPI have been have been compiled for PSM to support the following versions<br />
of the compilers:<br />
• (GNU) gcc 4.1.0<br />
• (PathScale) pathcc 3.2<br />
• (PGI) pgcc 9.0<br />
• (Intel) icc 11.1<br />
These MPI implementations run on multiple interconnects, and have their own<br />
mechanisms for selecting the interconnect that runs on. Basic information about<br />
using these MPIs is provided in this section. However, for more detailed<br />
information, see the documentation provided with the version of MPI that you want<br />
to use.<br />
Installed Layout<br />
By default, the MVAPICH and Open MPI MPIs are installed in this directory tree:<br />
/usr/mpi//-<br />
The <strong>QLogic</strong>-supplied MPIs precompiled with the GCC, PathScale, PGI, and the<br />
Intel compilers will also have -qlc appended after .<br />
For example:<br />
/usr/mpi/gcc/openmpi-1.5-qlc<br />
If a prefixed installation location is used, /usr is replaced by $prefix.<br />
The following examples assume that the default path for each MPI implementation<br />
to mpirun is:<br />
/usr/mpi///bin/mpirun<br />
Again, /usr may be replaced by $prefix. This path is sometimes referred to as<br />
$mpi_home/bin/mpirun in the following sections.<br />
5-2 D000046-005 B
5–Using Other MPIs<br />
Open MPI<br />
Open MPI<br />
Installation<br />
See the documentation for HP-MPI, Intel MPI, and Platform MPI for their default<br />
installation directories.<br />
Open MPI is an open source MPI-2 implementation from the Open MPI Project.<br />
Pre-compiled versions of Open MPI version 1.5 that run over PSM and are built<br />
with the GCC, PGI, PathScale, and Intel compilers are available with the <strong>QLogic</strong><br />
download.<br />
Open MPI that runs over Verbs and is pre-compiled with the GNU compiler is also<br />
available.<br />
Open MPI can be managed with the mpi-selector utility, as described in<br />
“Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility” on<br />
page 5-6.<br />
Follow the instructions in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for<br />
installing Open MPI.<br />
Newer versions than the one supplied with this release can be installed after<br />
<strong>QLogic</strong> OFED 1.4.2 has already been installed; these may be downloaded from<br />
the Open MPI web site. Note that versions that are released after the <strong>QLogic</strong><br />
OFED 1.4.2 release will not be supported.<br />
Setup<br />
If you use the mpi-selector tool, the necessary setup is done for you. If you do<br />
not use this tool, you can put your Open MPI installation directory in the PATH:<br />
add /bin to PATH<br />
The is the directory path where the desired MPI was installed.<br />
Compiling Open MPI Applications<br />
As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />
scripts that invoke the underlying compiler (see Table 5-2).<br />
Table 5-2. Open MPI Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpicc<br />
mpiCC, mpicxx, or mpic++<br />
C<br />
C++<br />
mpif77 Fortran 77<br />
D000046-005 B 5-3
5–Using Other MPIs<br />
Open MPI<br />
Table 5-2. Open MPI Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpif90 Fortran 90<br />
To compile your program in C, type:<br />
$ mpicc mpi_app_name.c -o mpi_app_name<br />
Running Open MPI Applications<br />
By default, Open MPI shipped with the InfiniPath software stack will run over PSM<br />
once it is installed.<br />
Here is an example of a simple mpirun command running with four processes:<br />
$ mpirun -np 4 -machinefile mpihosts mpi_app_name<br />
To specify the PSM transport explicitly, add --mca mtl psm to the above<br />
command line.<br />
To run over InfiniBand Verbs instead, use this mpirun command line:<br />
$ mpirun -np 4 -machinefile mpihosts --mca btl sm --mca btl<br />
openib,self --mca mtl ^psm mpi_app_name<br />
The following command enables shared memory:<br />
--mca btl sm<br />
The following command enables openib transport and communication to self:<br />
--mca btl openib, self<br />
The following command disables PSM transport:<br />
--mca mtl ^psm<br />
In these commands, btl stands for byte transport layer and mtl for matching<br />
transport layer.<br />
PSM transport works in terms of MPI messages. OpenIB transport works in terms<br />
of byte streams.<br />
Alternatively, you can use Open MPI with a sockets transport running over IPoIB,<br />
for example:<br />
$ mpirun -np 4 -machinefile mpihosts --mca btl sm --mca btl<br />
tcp,self --mca btl_tcp_if_exclude eth0 --mca btl_tcp_if_include<br />
ib0 --mca mtl ^psm mpi_app_name<br />
Note that eth0 and psm are excluded, while ib0 is included. These instructions<br />
may need to be adjusted for your interface names.<br />
Note that in Open MPI, machinefile is also known as the hostfile.<br />
5-4 D000046-005 B
5–Using Other MPIs<br />
MVAPICH<br />
Further Information on Open MPI<br />
MVAPICH<br />
Installation<br />
For more information about Open MPI, see:<br />
http://www.open-mpi.org/<br />
http://www.open-mpi.org/faq<br />
Pre-compiled versions of MVAPICH 1.2 built with the GNU, PGI, PathScale, and<br />
Intel compilers, and that run over PSM, are available with the <strong>QLogic</strong> download.<br />
MVAPICH that runs over Verbs and is pre-compiled with the GNU compiler is also<br />
available.<br />
MVAPICH can be managed with the mpi-selector utility, as described in<br />
“Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility” on<br />
page 5-6.<br />
To install MVAPICH, follow the instructions in the appropriate installation guide.<br />
Newer versions than the one supplied with this release can be installed after<br />
<strong>QLogic</strong> OFED 1.4.2 has already been installed; these may be downloaded from<br />
the MVAPICH web site. Note that versions that are released after the <strong>QLogic</strong><br />
OFED 1.4.2 release will not be supported.<br />
Setup<br />
To launch MPI jobs, the MVAPICH installation directory must be included in PATH<br />
and LD_LIBRARY_PATH.<br />
When using sh for launching MPI jobs, run the command:<br />
$ source /usr/mpi///bin/mpivars.sh<br />
When using csh for launching MPI jobs, run the command:<br />
$ source /usr/mpi///bin/mpivars.csh<br />
Compiling MVAPICH Applications<br />
As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />
scripts that invoke the underlying compiler (see Table 5-3).<br />
Table 5-3. MVAPICH Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpicc<br />
C<br />
D000046-005 B 5-5
5–Using Other MPIs<br />
Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility<br />
Table 5-3. MVAPICH Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpiCC, mpicxx<br />
C++<br />
mpif77 Fortran 77<br />
mpif90 Fortran 90<br />
To compile your program in C, type:<br />
$ mpicc mpi_app_name.c -o mpi_app_name<br />
To check the default configuration for the installation, check the following file:<br />
/usr/mpi///etc/mvapich.conf<br />
Running MVAPICH Applications<br />
By default, the MVAPICH shipped with the InfiniPath software stack runs over<br />
PSM once it is installed.<br />
Here is an example of a simple mpirun command running with four processes:<br />
$ mpirun -np 4 -hostfile mpihosts mpi_app_name<br />
Password-less ssh is used unless the -rsh option is added to the command line<br />
above.<br />
Further Information on MVAPICH<br />
For more information about MVAPICH, see:<br />
http://mvapich.cse.ohio-state.edu/<br />
Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI<br />
with the mpi-selector Utility<br />
When multiple MPI implementations have been installed on the cluster, you can<br />
use the mpi-selector to switch between them. The MPIs that can be managed<br />
with the mpi-selector are:<br />
• Open MPI<br />
• MVAPICH<br />
• MVAPICH2<br />
• <strong>QLogic</strong> MPI<br />
The mpi-selector is an OFED utility that is installed as a part of <strong>QLogic</strong> OFED<br />
1.4.2. Its basic functions include:<br />
5-6 D000046-005 B
5–Using Other MPIs<br />
Managing Open MPI, MVAPICH, and <strong>QLogic</strong> MPI with the mpi-selector Utility<br />
• Listing available MPI implementations<br />
• Setting a default MPI to use (per user or site wide)<br />
• Unsetting a default MPI to use (per user or site wide)<br />
• Querying the current default MPI in use<br />
Following is an example for listing and selecting an MPI:<br />
$ mpi-selector --list<br />
mpi-1.2.3<br />
mpi-3.4.5<br />
$ mpi-selector --set mpi-3.4.5<br />
The new default take effect in the next shell that is started. See the<br />
mpi-selector man page for more information.<br />
For <strong>QLogic</strong> MPI inter-operation with the mpi-selector utility, you must install all<br />
<strong>QLogic</strong> MPI RPMs using a prefixed installation. Once the $prefix for <strong>QLogic</strong><br />
MPI has been determined, install the qlogic-mpi-register with the same<br />
$prefix, this registers <strong>QLogic</strong> MPI with the mpi-selector utility and shows<br />
<strong>QLogic</strong> MPI as an available MPI implementation with the four different compilers.<br />
See the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for information on prefixed<br />
installations.<br />
The example shell scripts mpivars.sh and mpivars.csh, for registering with<br />
mpi-selector, are provided as part of the mpi-devel RPM in<br />
$prefix/share/mpich/mpi-selector-{intel,gnu,pathscale,pgi}<br />
directories.<br />
For all non-GNU compilers that are installed outside standard Linux search paths,<br />
set up the paths so that compiler binaries and runtime libraries can be resolved.<br />
For example, set LD_LIBRARY_PATH, both in your local environment and in an rc<br />
file (such as .mpirunrc, .bashrc, or .cshrc), are invoked on remote nodes.<br />
See “Environment for Node Programs” on page 4-19 and “Compiler and Linker<br />
Variables” on page 4-10 for information on setting up the environment and<br />
“Specifying the Run-time Library Path” on page F-15 for information on setting the<br />
run-time library path. Also see “Run Time Errors with Different MPI<br />
Implementations” on page F-17 for information on run time errors that may occur if<br />
there are MPI version mismatches.<br />
NOTE:<br />
The Intel-compiled versions require that the Intel compiler be installed and<br />
that paths to the Intel compiler runtime libraries be resolvable from the user’s<br />
environment. The version used is Intel 10.1.012.<br />
D000046-005 B 5-7
5–Using Other MPIs<br />
HP-MPI and Platform MPI 7<br />
HP-MPI and Platform MPI 7<br />
Installation<br />
Platform Computing acquired HP-MPI from HP. Platform MPI 7 (formerly HP–MPI)<br />
is a high performance, production–quality implementation of the Message Passing<br />
Interface (MPI), with full MPI-2 funcionality. HP-MPI / Platform MPI 7 is distributed<br />
by over 30 commercial software vendors, so you may need to use it if you use<br />
certain HPC applications, even if you don't purchase the MPI separately.<br />
Follow the instructions for downloading and installing Platform MPI 7 from the<br />
Platform Computing web site.<br />
Setup<br />
Edit two lines in the hpmpi.conf file as follows:<br />
Change,<br />
MPI_ICMOD_PSM__PSM_MAIN = "^ib_ipath"<br />
to,<br />
MPI_ICMOD_PSM__PSM_MAIN = "^"<br />
Change,<br />
to,<br />
MPI_ICMOD_PSM__PSM_PATH = "^ib_ipath"<br />
MPI_ICMOD_PSM__PSM_PATH = "^"<br />
Compiling Platform MPI 7 Applications<br />
As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />
scripts that invoke the underlying compiler (see Table 5-4).<br />
Table 5-4. Platform MPI 7 Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpicc<br />
mpiCC<br />
C<br />
C<br />
mpi77 Fortran 77<br />
mpif90 Fortran 90<br />
5-8 D000046-005 B
5–Using Other MPIs<br />
Platform (Scali) MPI 5.6<br />
To compile your program in C using the default compiler, type:<br />
$ mpicc mpi_app_name.c -o mpi_app_name<br />
Running Platform MPI 7 Applications<br />
Here is an example of a simple mpirun command running with four processes,<br />
over PSM:<br />
$ mpirun -np 4 -hostfile mpihosts -PSM mpi_app_name<br />
To run over InfiniBand Verbs, type:<br />
$ mpirun -np 4 -hostfile mpihosts -IBV mpi_app_name<br />
To run over TCP (which could be IPoIB if the hostfile is setup for IPoIB interfaces),<br />
type:<br />
$ mpirun -np 4 -hostfile mpihosts -TCP mpi_app_name<br />
More Information on Platform MPI 7<br />
For more information on Platform MPI 7, see the Platform Computing web site<br />
Platform (Scali) MPI 5.6<br />
Installation<br />
Platform MPI 5.6 was formerly known as Scali MPI Connect. The version tested<br />
with this release is 5.6.4.<br />
Follow the instructions for downloading and installing Platform MPI 5.6 from the<br />
Platform (Scali) web site.<br />
Setup<br />
To run over PSM, by default, add the line networks=infinpath to the file<br />
opt/scali/etc/ScaMPI.conf.<br />
If running over InfiniBand Verbs, Platform MPI needs to know which InfiniBand<br />
adapter to use. This is achieved by creating the file<br />
/opt/scali/etc/iba_params.conf using a line such as:<br />
hcadevice=qib0<br />
For a second InfiniPath card, ipath1 would be used, and so on.<br />
Compiling Platform MPI 5.6 Applications<br />
As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommends that you use the included wrapper<br />
scripts that invoke the underlying compiler (see Table 5-5). The scripts default to<br />
using gcc/g++/g77.<br />
D000046-005 B 5-9
5–Using Other MPIs<br />
Platform (Scali) MPI 5.6<br />
Table 5-5. Platform MPI Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpicc<br />
mpic++<br />
C<br />
C++<br />
mpif77 Fortran 77<br />
mpif90 Fortran 90<br />
To compile your program in C using the default compiler, type:<br />
$ mpicc mpi_app_name.c -o mpi_app_name<br />
To invoke another compiler, in this case PathScale, use the -cc1 option, for<br />
example:<br />
$ mpicc -cc1 pathcc mpi_app_name.c -o mpi_app_name<br />
Running Platform MPI 5.6 Applications<br />
Here is an example of a simple mpirun command running with four processes,<br />
over PSM:<br />
$ mpirun -np 4 -machinefile mpihosts mpi_app_name<br />
or if you have not set /opt/scali/etc/ScaMPI.conf to use PSM by default,<br />
use:<br />
$ mpirun -np 4 -machinefile mpihosts -networks=infinipath<br />
mpi_app_name<br />
Once installed, Platform MPI uses the PSM transport by default. To specify PSM<br />
explicitly, add -networks infinipath to the above command.<br />
To run Scali MPI over InfiniBand Verbs, type:<br />
$ mpirun -np 4 -machinefile mpihosts -networks ib,smp mpi_app_name<br />
This command indicates that ib is used for inter-node communications, and smp<br />
is used for intra-node communications.<br />
To run over TCP (or IPoIB), type:<br />
$ mpirun -np 4 -machinefile mpihosts -networks tcp,smp mpi_app_name<br />
Further Information on Platform MPI 5.6<br />
For more information on using Platform MPI 5.6, see:<br />
http://www.platform.com/cluster-computing/platform-mpi.<br />
5-10 D000046-005 B
5–Using Other MPIs<br />
Intel MPI<br />
Intel MPI<br />
Installation<br />
Intel MPI version 4.0 is the version tested with this release.<br />
Follow the instructions for download and installation of Intel MPI from the Intel web<br />
site.<br />
Setup<br />
Intel MPI can be run over Tag Matching Interface (TMI)<br />
The setup for Intel MPI is described in the following steps:<br />
1. Make sure that the TMI psm provider is installed on every node and all<br />
nodes have the same version installed. In this release it is called tmi-1.0 and<br />
is supplied with the <strong>QLogic</strong> <strong>OFED+</strong> host software package. It can be<br />
installed either with the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> installation or using<br />
the rpm files after the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> tar file has been<br />
unpacked. For example:<br />
$ rpm -qa | grep tmi<br />
tmi-1.0-1<br />
2. Verify that there is a /etc/tmi.conf file. It should be installed by the<br />
tmi-1.0-1 RPM. The file tmi.conf contains a list of TMI psm providers. In<br />
particular it must contain an entry for the PSM provider in a form similar to:<br />
psm 1.0 libtmip_psm.so " " # Comments OK<br />
Intel MPI can also be run over uDAPL, which uses InfiniBand Verbs. uDAPL is the<br />
user mode version of the Direct Access Provider Library (DAPL), and is provided<br />
as a part of the OFED packages. You will also have to have IPoIB configured.<br />
The setup for Intel MPI is described in the following steps:<br />
1. Make sure that DAPL 1.2 or 2.0 is installed on every node and all nodes<br />
have the same version installed. In this release they are called<br />
compat-dapl. Both versions are supplied with the OpenFabrics RPMs and<br />
are included in the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> package. They can be<br />
installed either with the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> installation or using<br />
the rpm files after the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> tar file has been<br />
unpacked. For example:<br />
Using DAPL 1.2.<br />
$ rpm -qa | grep compat-dapl<br />
compat-dapl-1.2.12-1.x86_64.rpm<br />
compat-dapl-debuginfo-1.2.12-1.x86_64.rpm<br />
compat-dapl-devel-1.2.12-1.x86_64.rpm<br />
D000046-005 B 5-11
5–Using Other MPIs<br />
Intel MPI<br />
compat-dapl-devel-static-1.2.12-1.x86_64.rpm<br />
compat-dapl-utils-1.2.12-1.x86_64.rpm<br />
Using DAPL 2.0.<br />
$ rpm -qa | grep dapl<br />
dapl-devel-static-2.0.19-1<br />
compat-dapl-1.2.14-1<br />
dapl-2.0.19-1<br />
dapl-debuginfo-2.0.19-1<br />
compat-dapl-devel-static-1.2.14-1<br />
dapl-utils-2.0.19-1<br />
compat-dapl-devel-1.2.14-1<br />
dapl-devel-2.0.19-1<br />
2. Verify that there is a /etc/dat.conf file. It should be installed by the<br />
dapl- RPM. The file dat.conf contains a list of interface adapters<br />
supported by uDAPL service providers. In particular, it must contain<br />
mapping entries for OpenIB-cma for dapl 1.2.x and ofa-v2-ib for<br />
dapl 2.0.x, in a form similar to this (each on one line):<br />
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2<br />
"ib0 0" ""<br />
and<br />
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0<br />
"ib0 0" ""<br />
3. On every node, type the following command (as a root user):<br />
# modprobe rdma_ucm<br />
To ensure that the module is loaded when the driver is loaded, add<br />
RDMA_UCM_LOAD=yes to the /etc/infiniband/openib.conf file.<br />
(Note that rdma_cm is also used, but it is loaded automatically.)<br />
4. Bring up an IPoIB interface on every node, for example, ib0. See the<br />
instructions for configuring IPoIB for more details.<br />
Intel MPI has different bin directories for 32-bit (bin) and 64-bit (bin64); 64-bit is<br />
the most commonly used.<br />
To launch MPI jobs, the Intel installation directory must be included in PATH and<br />
LD_LIBRARY_PATH.<br />
When using sh for launching MPI jobs, run the following command:<br />
$ source /bin64/mpivars.sh<br />
5-12 D000046-005 B
5–Using Other MPIs<br />
Intel MPI<br />
When using csh for launching MPI jobs, run the following command:<br />
$ source /bin64/mpivars.csh<br />
Substitute bin if using 32-bit.<br />
Compiling Intel MPI Applications<br />
As with <strong>QLogic</strong> MPI, <strong>QLogic</strong> recommended that you use the included wrapper<br />
scripts that invoke the underlying compiler. The default underlying compiler is<br />
GCC, including gfortran. Note that there are more compiler drivers (wrapper<br />
scripts) with Intel MPI than are listed here (see Table 5-6); check the Intel<br />
documentation for more information.<br />
Table 5-6. Intel MPI Wrapper Scripts<br />
Wrapper Script Name<br />
Language<br />
mpicc<br />
mpiCC<br />
C<br />
C++<br />
mpif77 Fortran 77<br />
mpif90 Fortran 90<br />
mpiicc<br />
mpiicpc<br />
mpiifort<br />
C (uses Intel C compiler)<br />
C++ (uses Intel C++ compiler)<br />
Fortran 77/90 (uses Intel Fortran compiler)<br />
To compile your program in C using the default compiler, type:<br />
$ mpicc mpi_app_name.c -o mpi_app_name<br />
To use the Intel compiler wrappers (mpiicc, mpiicpc, mpiifort), the Intel<br />
compilers must be installed and resolvable from the user’s environment.<br />
Running Intel MPI Applications<br />
Here is an example of a simple mpirun command running with four processes:<br />
$ mpirun -np 4 -f mpihosts mpi_app_name<br />
For more information, follow the Intel MPI instructions for usage of mpirun,<br />
mpdboot, and mpiexec (mpirun is a wrapper script that invoked both mpdboot<br />
and mpiexec). Remember to use -r ssh with mpdboot if you use ssh.<br />
Pass the following option to mpirun to select TMI:<br />
-genv I_MPI_FABRICS tmi<br />
Pass the following option to mpirun to select uDAPL:<br />
D000046-005 B 5-13
5–Using Other MPIs<br />
Improving Performance of Other MPIs Over InfiniBand Verbs<br />
uDAPL 1.2:<br />
-genv I_MPI_DEVICE rdma:OpenIB-cma<br />
uDAPL 2.0:<br />
-genv I_MPI_DEVICE rdma:ofa-v2-ib<br />
To help with debugging, you can add this option to the Intel mpirun command:<br />
TMI:<br />
-genv TMI_DEBUG 1<br />
uDAPL:<br />
-genv I_MPI_DEBUG 2<br />
Further Information on Intel MPI<br />
For more information on using Intel MPI, see: http://www.intel.com/<br />
Improving Performance of Other MPIs Over<br />
InfiniBand Verbs<br />
Performance of MPI applications when using an MPI implementation over<br />
InfiniBand Verbs can be improved by tuning the InfiniBand MTU size.<br />
NOTE:<br />
No manual tuning is necessary for PSM-based MPIs, since the PSM layer<br />
determines the largest possible InfiniBand MTU for each source/destination<br />
path.<br />
The maximum supported MTU size of InfiniBand adapter cards is 4K.<br />
Support for 4K InfiniBand MTU requires switch support for 4K MTU. The method<br />
to set the InfiniBand MTU size varies by MPI implementation:<br />
• Open MPI defaults to the lower of either the InfiniBand MTU size or switch<br />
MTU size.<br />
• MVAPICH defaults to an InfiniBand MTU size of 1024 bytes. This can be<br />
over-ridden by setting an environment variable:<br />
$ export VIADEV_DEFAULT_MTU=MTU4096<br />
Valid values are MTU256, MTU512, MTU1024, MTU2048 and MTU4096. This<br />
environment variable must be set for all processes in the MPI job. To do so,<br />
use ~/.bashrc or use of /usr/bin/env.<br />
5-14 D000046-005 B
5–Using Other MPIs<br />
Improving Performance of Other MPIs Over InfiniBand Verbs<br />
• HP-MPI over InfiniBand Verbs automatically determines the InfiniBand MTU<br />
size.<br />
• Platform (Scali) MPI defaults to an InfiniBand MTU of 1KB. This can be<br />
changed by adding a line to /opt/scali/etc/iba_params.conf, for<br />
example:<br />
mtu=2048<br />
A value of 4096 is not allowed by the Scali software (as of Scali<br />
Connect 5.6.0); in this case, a default value of 1024 bytes is used. This<br />
problem has been reported to support at Platform Inc. The largest value that<br />
can currently be used is 2048 bytes.<br />
• Intel MPI over uDAPL (which uses InfiniBand Verbs) automatically<br />
determines the InfiniBand MTU size.<br />
D000046-005 B 5-15
5–Using Other MPIs<br />
Improving Performance of Other MPIs Over InfiniBand Verbs<br />
5-16 D000046-005 B
6 Performance Scaled<br />
Messaging<br />
Introduction<br />
Performance Scaled Messaging (PSM) provides support for full Virtual Fabric<br />
(vFabric) integration, allowing users to specify InfiniBand Service Level (SL) and<br />
Partition Key (PKey), or to provide a configured Service ID (SID) to target a<br />
vFabric. Support for using InfiniBand path record queries to the <strong>QLogic</strong> Fabric<br />
Manager during connection setup is also available, enabling alternative switch<br />
topologies such as Mesh/Torus. Note that this relies on the Distributed SA cache<br />
from FastFabric.<br />
All PSM enabled MPIs can leverage these capabilities transparently, but only two<br />
MPIs (<strong>QLogic</strong> MPI and OpenMPI) are configured to support it natively. Native<br />
support here means that MPI specific mpirun switches are available to<br />
activate/deactivate these features. Other MPIs will require use of environment<br />
variables to leverage these capabilities. With MPI applications, the environment<br />
variables need to be propagated across all nodes/processes and not just the node<br />
from where the job is submitted/run. The mechanisms to do this are MPI specific,<br />
but for two common MPIs the following may be helpful:<br />
• OpenMPI: Use –x ENV_VAR=ENV_VAL in the mpirun command line.<br />
Example:<br />
mpirun –np 2 –machinefile machinefile -x<br />
PSM_ENV_VAR=PSM_ENV_VAL prog prog_args<br />
• MVAPICH2: Use mpirun_rsh to perform job launch. Do not use mpiexec<br />
or mpirun. Specify the environment variable and value in the mpirun<br />
command line before the program argument.<br />
Example:<br />
mpirun_rsh –np 2 –hostfile machinefile PSM_ENV_VAR=PSM_ENV_VAL<br />
prog prog_args<br />
Some of the features available require appropriate versions of associated<br />
software and firmware for correct operation. These requirements are listed in the<br />
relevant sections.<br />
D000046-005 B 6-1
6–Performance Scaled Messaging<br />
Virtual Fabric Support<br />
Virtual Fabric Support<br />
Virtual Fabric (vFabric) in PSM is supported with the <strong>QLogic</strong> Fabric Manager. The<br />
latest version of the <strong>QLogic</strong> Fabric Manager contains a sample qlogic_fm.xml<br />
file with pre-configured vFabrics for PSM. Sixteen unique Service IDs have been<br />
allocated for PSM enabled MPI vFabrics to ease their testing however any<br />
Service ID can be used. Refer to the <strong>QLogic</strong> Fabric Manager <strong>User</strong> <strong>Guide</strong> on how<br />
to configure vFabrics.<br />
There are two ways to use vFabric with PSM. The “legacy” method requires the<br />
user to specify the appropriate SL and Pkey for the vFabric in question. For<br />
complete integration with vFabrics, users can now specify a Service ID (SID) that<br />
identifies the vFabric to be used. PSM will automatically obtain the SL and Pkey to<br />
use for the vFabric from the <strong>QLogic</strong> Fabric Manager via path record queries.<br />
Using SL and PKeys<br />
SL and Pkeys can be specified natively for OpenMPI and <strong>QLogic</strong> MPI. For other<br />
MPIs use the following list of environment variables to specify the SL and Pkey.<br />
The environment variables need to be propagated across all processes for correct<br />
operation.<br />
NOTE:<br />
This is available with OpenMPI v1.3.4rc4 and above only!<br />
• OpenMPI: Use mca parameters (mtl_psm_ib_service_level and<br />
mtl_psm_ib_pkey) to specify the pkey on the mpirun command line.<br />
Example:<br />
mpirun –np 2 –machinefile machinefile -mca<br />
mtl_psm_ib_service_level SL -mca mtl_psm_ib_pkey Pkey prog<br />
prog_args.<br />
• <strong>QLogic</strong> MPI: Requires use of IPATH_SL environment variable to specify<br />
the SL and the –p switch to mpirun for the Pkey.<br />
Example:<br />
IPATH_SL=SL mpirun –np 2 –m machinefile -p Pkey prog prog_args<br />
• Other MPIs can use the following environment variables that are propagated<br />
across all processes. This process is MPI library specific but samples on<br />
how to do this for OpenMPI and MVAPICH2 are listed in the “Introduction”<br />
on page 6-1.<br />
IPATH_SL=SL # Service Level to Use 0-15<br />
<br />
PSM_PKEY=Pkey # Pkey to use<br />
6-2 D000046-005 B
6–Performance Scaled Messaging<br />
Using Service ID<br />
Using Service ID<br />
Full vFabric integration with PSM is available, allowing the user to specify a SID.<br />
For correct operation, PSM requires the following components to be available and<br />
configured correctly.<br />
• <strong>QLogic</strong> host Fabric Manager Configuration – PSM MPI vFabrics need to be<br />
configured and enabled correctly in the qlogic_fm.xml file. 16 unique<br />
SIDs have been allocated in the sample file.<br />
• <strong>OFED+</strong> library needs to be installed on all nodes. This is available as part of<br />
Fast Fabrics tools.<br />
• <strong>QLogic</strong> Distributed SA needs to be installed, configured and activated on all<br />
the nodes. This is part of FastFabrics tools. Please refer to <strong>QLogic</strong> Fast<br />
Fabric <strong>User</strong> <strong>Guide</strong> on how to configure and activate the Distributed SA. The<br />
SIDs configured in the <strong>QLogic</strong> Fabric Manager configuration file should also<br />
be provided to the Distributed SA for correct operation.<br />
Service ID can be specified natively for OpenMPI and <strong>QLogic</strong> MPI. For other MPIs<br />
use the following list of environment variables. The environment variables need to<br />
be propagated across all processes for correct operation.<br />
• OpenMPI: Use mca parameters (mtl_psm_ib_service_id and<br />
mtl_psm_path_query) to specify the service id on the mpirun command<br />
line. Example:<br />
mpirun –np 2 –machinefile machinefile -mca mtl_psm_path_query<br />
opp -mca mtl_psm_ib_service_id SID prog prog_args<br />
• <strong>QLogic</strong> MPI: Use the –P and –S switch to mpirun command line to specify<br />
the Path record query library (always opp for OFED Plus Path in this<br />
release) and Service ID to use. Example:<br />
mpirun –np 2 –m machinefile -P opp –S SID prog prog_args<br />
• Other MPIs can use the following environment variables:<br />
PSM_PATH_REC=opp # Path record query mechanism to<br />
use. Always specify opp<br />
PSM_IB_SERVICE_ID=SID # Service ID to use<br />
SL2VL mapping from the Fabric Manager<br />
PSM is able to use the SL2VL table as programmed by the <strong>QLogic</strong> Fabric<br />
Manager. Prior releases required manual specification of the SL2VL mapping via<br />
an environment variable.<br />
D000046-005 B 6-3
6–Performance Scaled Messaging<br />
Verifying SL2VL tables on <strong>QLogic</strong> 7300 Series Adapters<br />
Verifying SL2VL tables on <strong>QLogic</strong> 7300 Series<br />
Adapters<br />
iba_saquery can be used to get the SL2VL mapping for any given port<br />
however, <strong>QLogic</strong> 7300 series adapters exports the SL2VL mapping via sysfs files.<br />
These files are used by PSM to implement the SL2VL tables automatically. The<br />
SL2VL tables are per port and available under /sys/class/infiniband/hca<br />
name/ports/port #/sl2vl. The directory contains 16 files numbered 0-15<br />
that specify the SL. Listing the SL files returns the VL as programmed by the SL.<br />
6-4 D000046-005 B
7 Dispersive Routing<br />
Infiniband uses deterministic routing that is keyed from the Destination LID (DLID)<br />
of a port. The Fabric Manager programs the forwarding tables in a switch to<br />
determine the egress port a packet takes based on the DLID.<br />
Deterministic routing can create hotspots even in full bisection bandwidth (FBB)<br />
fabrics for certain communication patterns if the communicating node pairs map<br />
onto a common upstream link, based on the forwarding tables. Since routing is<br />
based on DLIDs, the InfiniBand fabric provides the ability to assign multiple LIDs<br />
to a physical port using a feature called Lid Mask Control (LMC). The total number<br />
of DLIDs assigned to a physical port is 2^LMC with the LIDS being assigned in a<br />
sequential manner. The common InfiniBand fabric uses a LMC of 0, meaning<br />
each port has 1 LID assigned to it. With non-zero LMC fabrics, this results in<br />
multiple potential paths through the fabric to reach the same physical port. For<br />
example, multiple DLID entries in the port forwarding table that could map to<br />
different egress ports.<br />
Dispersive routing, as implemented in the PSM, attempts to avoid congestion<br />
hotspots described above by “spraying” messages across these paths. A<br />
congested path will not bottleneck messages flowing down the alternate paths<br />
that are not congested. The current implementation of PSM supports fabrics with<br />
a maximum LMC of 3 (8 LIDs assigned per port). This can result in a maximum of<br />
64 possible paths between a SLID, DLID pair ([SLID, DLID],[SLID, DLID+1],<br />
[SLID,DLID+2]…..[SLID,DLID+8],[SLID+1, DLID],[SLID+1, DLID+1]…..[SLID+7,<br />
DLID+8]). Keeping state associated with these many paths requires large amount<br />
of memory resources, with empirical data showing not much gain in performance<br />
beyond utilizing a small set of multiple paths. Therefore PSM reduces the number<br />
of paths actually used in the above case to 8 where the following paths are the<br />
only ones considered for transmission — [SLID, DLID], [SLID + 1, DLID + 1],<br />
[SLID + 2, DLID + 2] ….. [SLID + N, DLID + N]. This makes the resource<br />
requirements manageable while providing most of the benefits of dispersive<br />
routing (congestion avoidance by utilizing multiple paths).<br />
D000046-005 B 7-1
7–Dispersive Routing<br />
Internally, PSM utilizes dispersive routing differently for small and large<br />
messages. Large messages are any messages greater-than or equal-to 64K. For<br />
large messages, the message is split into message fragments of 128K by default<br />
(called a window). Each of these message windows is sprayed across a distinct<br />
path between ports. All packets belonging to a window utilize the same path<br />
however the windows themselves can take a different path through the fabric.<br />
PSM assembles the windows that make up an MPI message before delivering it to<br />
the application. This allows limited out of order semantics through the fabrics to be<br />
maintain with little overhead. Small messages on the other hand always utilize a<br />
single path when communicating to a remote node however different processes<br />
executing on a node can utilize different paths for their communication between<br />
the nodes. For example, two nodes A and B each with 8 processors per node.<br />
Assuming the fabric is configured for a LMC of 3, PSM constructs 8 paths through<br />
the fabric as described above and a 16 process MPI application that spans these<br />
nodes (8 process per node). Then:<br />
• Each MPI process is automatically bound to a given CPU core numbered<br />
between 0-7. PSM does this at startup to get improved cache hit rates and<br />
other benefits.<br />
• Small Messages sent from a process on core N will use path N.<br />
NOTE:<br />
Only path N will be used by this process for all communications to any<br />
process on the remote node.<br />
• For a large message, each process will utilize all of the 8 paths and spray<br />
the windowed messages across it.<br />
The above highlights the default path selection policy that is active in PSM when<br />
running on non-zero LMC configured fabrics. There are 3 other path selection<br />
policies that determine how to select the path (or path index from the set of<br />
available paths) used by a process when communicating with a remote node. The<br />
above path policy is called adaptive. The 3 remaining path policies are static<br />
policies that assign a static path on job startup for both small and large message<br />
transfers.<br />
• Static_Src: Only one path per process is used for all remote<br />
communications. The path index is based on the CPU number the process<br />
is running.<br />
NOTE:<br />
Multiple paths are still used in the fabric if multiple processes (each on<br />
a different CPU) are communicating.<br />
7-2 D000046-005 B
7–Dispersive Routing<br />
• Static_Dest: The path selection is based on the CPU index of the<br />
destination process. Multiple paths can be used if data transfer is to different<br />
remote processes within a node. If multiple processes from Node A send a<br />
message to a single process on Node B only one path will be used across all<br />
processes.<br />
• Static_Base: The only path that is used is the base path [SLID,DLID]<br />
between nodes regardless of the LMC of the fabric or the number of paths<br />
available. This is similar to how PSM operated till the IFS 5.1 release.<br />
NOTE:<br />
A fabric configured with LMC of 0 even with the default adaptive policy<br />
enabled operates as the Static_Base policy as there only exists a<br />
single path between any pairs of port.<br />
D000046-005 B 7-3
7–Dispersive Routing<br />
7-4 D000046-005 B
8 gPXE<br />
gPXE Setup<br />
gPXE is an open source (GPL) network bootloader. It provides a direct<br />
replacement for proprietary PXE ROMs. See http://etherboot.org/wiki/index.php<br />
for documentation and general information.<br />
At least two machines and a switch are needed (or connect the two machines<br />
back-to-back and run <strong>QLogic</strong> Fabric Manager on the server).<br />
• A DHCP server<br />
• A boot server or http server (can be the same as the DHCP server)<br />
• A node to be booted<br />
Use a QLE7340 or QLE7342 adapter for the node.<br />
The following software is included with the <strong>QLogic</strong> <strong>OFED+</strong> installation software<br />
package:<br />
• gPXE boot image<br />
• patch for DHCP server<br />
• tool to install gPXE boot image in EPROM of card<br />
• sample gPXE script<br />
Everything that can be done with the proprietary PXE loader over Ethernet, can be<br />
done with the gPXE loader over IB. The gPXE boot code is only a mechanism to<br />
load an initial boot image onto the system. It is up to the downloaded boot image<br />
to do the rest.<br />
For example, the boot image could be:<br />
• A stand-alone memory test program<br />
• A diskless kernel image that mounts its file systems via NFS<br />
Refer to http://www.faqs.org/docs/Linux-HOWTO/Diskless-HOWTO.html<br />
• A Linux install image like kickstart, which then installs software to the local<br />
hard drive(s). Refer to<br />
http://www.faqs.org/docs/Linux-HOWTO/KickStart-HOWTO.html<br />
D000046-005 B 8-1
8–gPXE<br />
Preparing the DHCP Server in Linux<br />
Required Steps<br />
• A second stage boot loader<br />
• A live CD Linux image<br />
• A gPXE script<br />
1. Download a copy of the gPXE image.<br />
Located at:<br />
• The executable to flash the EXPROM on the TrueScale InfiniBand<br />
adapters is located at: /usr/sbin/ipath_exprom<br />
• The gPXE driver for QLE7200 series InfiniBand adapters (the<br />
EXPROM image) is located at:<br />
/usr/share/infinipath/gPXE/iba7220.rom<br />
• The gPXE driver for QLE7300 series InfiniBand adapters (the<br />
EXPROM image) is located at:<br />
/usr/share/infinipath/gPXE/iba7322.rom<br />
2. In order for dhcpd to correctly load, assign IP addresses to the InfiniBand<br />
adapter GUID. The dhcpd on the existing DHCP server may need to be<br />
patched. This patch will be provided via the gPXE rpm installation.<br />
3. Write the ROM image to the InfiniBand adapter.<br />
This only needs to be done once per InfiniBand adapter.<br />
ipath_exprom -e -w iba7xxx.rom<br />
In some cases, executing the above command results in a hang. If you<br />
experience a hang, type CTRL+C to quit, then execute one flag at a time:<br />
ipath_exprom -e iba7xxx.rom<br />
ipath_exprom -w iba7xxx.rom<br />
4. Enable booting from the InfiniBand adapter (gPXE device) in the BIOS<br />
Preparing the DHCP Server in Linux<br />
Installing DHCP<br />
When the boot session starts, the gPXE firmware attempts to bring up an adapter<br />
network link. If it succeeds to bring up a connected link, the gPXE firmware<br />
communicates with the DHCP server. The DHCP server assigns an IP address to<br />
the gPXE client and provides it with the location of the boot program.<br />
gPXE requires that the DHCP server runs on a machine that supports IP over IB.<br />
8-2 D000046-005 B
8–gPXE<br />
Preparing the DHCP Server in Linux<br />
NOTE:<br />
Prior to installing DHCP, make sure that <strong>QLogic</strong> <strong>OFED+</strong> is already installed<br />
on your DHCP server.<br />
1. Download and install the latest DHCP server from www.isc.org.<br />
Standard DHCP fields holding MAC address are not large enough to contain<br />
an IPoIB hardware address. To overcome this problem, DHCP over<br />
InfiniBand messages convey a client identifier field used to identify the<br />
DHCP session. This client identifier field can be used to associate an IP<br />
address with a client identifier value, such that the DHCP server will grant<br />
the same IP address to any client that conveys this client identifier.<br />
2. Unpack the latest downloaded DHCP server.<br />
tar zxf dhcp-release.tar.gz<br />
3. Uncomment the line /* #define USE_SOCKETS */ in<br />
dhcp-release/includes/site.h<br />
4. Change to the main directory.<br />
cd dhcp-release<br />
NOTE:<br />
If there is an older version of DHCP installed, save it before continuing<br />
with the following steps.<br />
5. Configure the source.<br />
./configure<br />
6. When the configuration of DHCP is finished, build the DHCP server.<br />
make<br />
Configuring DHCP<br />
7. When the DHCP has successfully finished building, install DHCP.<br />
make install<br />
1. From the client host, find the GUID of the <strong>Host</strong> Channel Adapter by using<br />
p1info or look at the GUID label on the Infiniband adapter.<br />
2. Turn the GUID into a MAC address and specify the port of the InfiniBand<br />
adapter that is going to be used at the end, using b0 for port0 or b1 for<br />
port1.<br />
D000046-005 B 8-3
8–gPXE<br />
Netbooting Over InfiniBand<br />
For example for a GUID that reads 0x00117500005a6eec, the MAC<br />
address would read: 00:11:75:00:00:5a:6e:ec:b0<br />
3. Add the MAC address to the DHCP server.<br />
The following is the sample /etc/dhcpd.conf file that specifies the <strong>Host</strong><br />
Channel Adapter GUID for the hardware address:<br />
#<br />
# DHCP Server Configuration file.<br />
# see /usr/share/doc/dhcp*/dhcpd.conf.sample<br />
#<br />
ddns-update-style none;<br />
subnet 10.252.252.0 netmask 255.255.255.0 {<br />
option subnet-mask 255.255.255.0;<br />
range dynamic-bootp 10.252.252.100 10.252.252.109;<br />
host hl5-0 {<br />
hardware unknown-32 00:11:75:00:00:7e:c1:b0;<br />
option host-name "hl5";<br />
}<br />
host hl5-1 {<br />
hardware unknown-32 00:11:75:00:00:7e:c1:b1;<br />
option host-name "hl5";<br />
}<br />
}<br />
filename<br />
"http://10.252.252.1/images/uniboot/uniboot.php";<br />
In this example, host hl5 has a dual port InfiniBand adapter. hl5-0<br />
corresponds to port 0, and hl5-1 corresponds to port 1 on the adapter.<br />
4. Restart the DHCP server<br />
Netbooting Over InfiniBand<br />
The following procedures are an example of netbooting over InfinBand, using an<br />
HTTP boot server.<br />
8-4 D000046-005 B
8–gPXE<br />
Netbooting Over InfiniBand<br />
Prerequisites<br />
• Required steps from above have been executed.<br />
• The BIOS has been configured to enable booting from the InfiniBand<br />
adapter. The gPXE InfiniBand device should be listed as the first boot<br />
device.<br />
• Apache server has been configured with PHP on your network, and is<br />
configured to serve pages out of /vault.<br />
• It is understood in this example that users would have their own tools and<br />
files for diskless booting with an http boot server.<br />
Boot Server Setup<br />
NOTE:<br />
The dhcpd and apache configuration files referenced in this example<br />
are included as examples, and are not part of the <strong>QLogic</strong> <strong>OFED+</strong><br />
installed software. Your site boot servers may be different, see their<br />
documentation for equivalent information.<br />
Instructions on installing and configuring a dhcp server or a boot server<br />
are beyond the scope of this document.<br />
Configure the boot server for your site.<br />
NOTE:<br />
gPXE supports several file transfer methods such as TFTP, HTTP, iSCSI.<br />
This example uses HTTP since it generally scales better and is the preferred<br />
choice.<br />
NOTE:<br />
This step involves setting up a http server and needs to be done by a user<br />
that understands server setup on the http server is being used<br />
1. Install Apache.<br />
2. Create an images.conf file and a kernels.conf file and place them in<br />
the /etc/httpd/conf.d directory. This sets up aliases for and tells<br />
apache where to find them:<br />
/images — http://10.252.252.1/images/<br />
/kernels — http://10.252.252.1/kernels/<br />
D000046-005 B 8-5
8–gPXE<br />
Netbooting Over InfiniBand<br />
The following is an example of the images.conf file<br />
Alias /images /vault/images<br />
<br />
AllowOverride All<br />
Options Indexes FollowSymLinks<br />
Order allow,deny<br />
Allow from all<br />
<br />
The following is an example of the kernels.conf file<br />
Alias /kernels /boot<br />
<br />
AllowOverride None<br />
Order allow,deny<br />
Allow from all<br />
<br />
3. Make a uniboot directory:<br />
mkdir -p /vault/images/uniboot<br />
4. Create a initrd.img file<br />
Prerequisites<br />
• “gPXE Setup” on page 8-1 has been completed.<br />
• “Preparing the DHCP Server in Linux” on page 8-2 has been<br />
completed<br />
To add an InfiniBand driver into the initrd file, The InfiniBand modules<br />
need to be copied to the diskless image. The host machine needs to be<br />
pre-installed with the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> that is appropriate for<br />
the kernel version the diskless image will run. The <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong><br />
<strong>Software</strong> is available for download from<br />
http://driverdownloads.qlogic.com/<strong>QLogic</strong>DriverDownloads_UI/default.aspx<br />
NOTE:<br />
The remainder of this section assumes that <strong>QLogic</strong> <strong>OFED+</strong> has been<br />
installed on the <strong>Host</strong> machine.<br />
8-6 D000046-005 B
8–gPXE<br />
Netbooting Over InfiniBand<br />
WARNING!!<br />
The following procedure modifies critical files used in the boot<br />
procedure. It must be executed by users with expertise in the boot<br />
process. Improper application of this procedure may prevent the<br />
diskless machine from booting.<br />
a. If /vault/images/initrd.img file is already present on the server<br />
machine, back it up. For example:<br />
cp -a /vault/images/initrd.img /vault/images/<br />
initrd.img.bak<br />
D000046-005 B 8-7
8–gPXE<br />
Netbooting Over InfiniBand<br />
b. The infinipath rpm will install the file<br />
/usr/share/infinipath/gPXE/gpxe-qib-modify-initrd<br />
with contents similar to the following example. You can either run the<br />
script to generate a new initrd image, or use it as an example, and<br />
customize as appropriate for your site.<br />
# This assumes you will use the currently running version<br />
of linux, and<br />
# that you are starting from a fully configured machine of<br />
the same type<br />
# (hardware configuration), and BIOS settings.<br />
#<br />
# start with a known path, to get the system commands<br />
PATH=/sbin:/usr/sbin:/bin:/usr/bin:$PATH<br />
# start from a copy of the current initd image<br />
mkdir -p /var/tmp/initrd-ib<br />
cd /var/tmp/initrd-ib<br />
kern=$(uname -r)<br />
if [ -e /boot/initrd-${kern}.img ]; then<br />
initrd=/boot/initrd-${kern}.img<br />
elif [ -e /boot/initrd ]; then<br />
initrd=/boot/initrd<br />
else<br />
echo Unable to locate correct initrd, fix script and<br />
re-run<br />
exit 1<br />
fi<br />
cp ${initrd} initrd-ib-${kern}.img<br />
# Get full original listing<br />
gunzip -dc initrd-ib-${kern}.img | cpio -it --quiet |<br />
grep -v '^\.$' | sort -o Orig-listing<br />
8-8 D000046-005 B
8–gPXE<br />
Netbooting Over InfiniBand<br />
# start building modified image<br />
rm -rf new # for retries<br />
mkdir new<br />
cd new<br />
# extract previous contents<br />
gunzip -dc ../initrd-ib-${kern}.img | cpio --quiet -id<br />
# add infiniband modules<br />
mkdir -p lib/ib<br />
find /lib/modules/${kern}/updates -type f | \<br />
egrep<br />
'(iw_cm|ib_(mad|addr|core|sa|cm|uverbs|ucm|umad|ipoib|qib<br />
).ko|rdma_|ipoib_helper)' | \<br />
xargs -I '{}' cp -a '{}' lib/ib<br />
# Some distros have ipoib_helper, others don't require it<br />
if [ -e lib/ib/ipoib_helper ]; then<br />
helper_cmd='/sbin/insmod /lib/ib/ipoib_helper.ko'<br />
fi<br />
# On some kernels, the qib driver will require the dca<br />
module<br />
if modinfo -F depends ib_qib | grep -q dca; then<br />
cp $(find /lib/modules/$(uname -r) -name dca.ko) lib/ib<br />
dcacmd='/sbin/insmod /lib/ib/dca.ko'<br />
else<br />
dcacmd=<br />
fi<br />
# IB requires loading an IPv6 module. If you do not have<br />
it in your initrd, add it<br />
if grep -q ipv6 ../Orig-listing; then<br />
# already added, and presumably insmod'ed, along with any<br />
dependencies<br />
v6cmd=<br />
else<br />
echo -e 'Adding IPv6 and related modules\n'<br />
cp /lib/modules/${kern}/kernel/net/ipv6/ipv6.ko lib<br />
IFS=' ' v6cmd='echo "Loading IPV6"<br />
D000046-005 B 8-9
8–gPXE<br />
Netbooting Over InfiniBand<br />
/sbin/insmod /lib/ipv6.ko'<br />
# Some versions of IPv6 have dependencies, add them.<br />
xfrm=$(modinfo -F depends ipv6)<br />
if [ ${xfrm} ]; then<br />
cp $(find /lib/modules/$(uname -r) -name ${xfrm}.ko)<br />
lib<br />
IFS=' ' v6cmd='/sbin/insmod /lib/'${xfrm}'.ko<br />
'"$v6cmd"<br />
crypto=$(modinfo -F depends $xfrm)<br />
if [ ${crypto} ]; then<br />
cp $(find /lib/modules/$(uname -r) -name<br />
${crypto}.ko) lib<br />
IFS=' ' v6cmd='/sbin/insmod /lib/'${crypto}'.ko<br />
'"$v6cmd"<br />
fi<br />
fi<br />
fi<br />
# we need insmod to load the modules; if not present it,<br />
copy it<br />
mkdir -p sbin<br />
grep -q insmod ../Orig-listing || cp /sbin/insmod sbin<br />
echo -e 'NOTE: you will need to config ib0 in the normal<br />
way in your booted root<br />
filesystem, in order to use it for NFS, etc.\n'<br />
# Now build the commands to load the additional modules.<br />
We add them just after<br />
# the last existing insmod command, so all other<br />
dependences will be resolved<br />
# You can change the location if desired or necessary.<br />
# loading order is important. You can verify the order<br />
works ahead of time<br />
# by running "/etc/init.d/openibd stop", and then running<br />
these commands<br />
# manually by cut and paste<br />
# This will work on SLES, although different than the<br />
standard mechanism<br />
cat > ../init-cmds
8–gPXE<br />
Netbooting Over InfiniBand<br />
# Start of IB module block<br />
$v6cmd<br />
echo "loading IB modules"<br />
/sbin/insmod /lib/ib/ib_addr.ko<br />
/sbin/insmod /lib/ib/ib_core.ko<br />
/sbin/insmod /lib/ib/ib_mad.ko<br />
/sbin/insmod /lib/ib/ib_sa.ko<br />
/sbin/insmod /lib/ib/ib_cm.ko<br />
/sbin/insmod /lib/ib/ib_uverbs.ko<br />
/sbin/insmod /lib/ib/ib_ucm.ko<br />
/sbin/insmod /lib/ib/ib_umad.ko<br />
/sbin/insmod /lib/ib/iw_cm.ko<br />
/sbin/insmod /lib/ib/rdma_cm.ko<br />
/sbin/insmod /lib/ib/rdma_ucm.ko<br />
$dcacmd<br />
/sbin/insmod /lib/ib/ib_qib.ko<br />
$helper_cmd<br />
/sbin/insmod /lib/ib/ib_ipoib.ko<br />
echo "finished loading IB modules"<br />
# End of IB module block<br />
EOF<br />
# first get line number where we append (after last insmod<br />
if any, otherwse<br />
# at start<br />
line=$(egrep -n insmod init | sed -n '$s/:.*//p')<br />
if [ ! "${line}" ]; then line=1; fi<br />
sed -e "${line}r ../init-cmds" init > init.new<br />
# show the difference, then rename<br />
echo -e 'Differences between original and new init<br />
command script\n'<br />
diff init init.new<br />
mv init.new init<br />
chmod 700 init<br />
# now rebuilt the initrd image<br />
find . | cpio --quiet -H newc -o | gzip ><br />
../initrd-${kern}.img<br />
D000046-005 B 8-11
8–gPXE<br />
Netbooting Over InfiniBand<br />
cd ..<br />
# get the file list in the new image<br />
gunzip -dc initrd-${kern}.img | cpio --quiet -it | grep<br />
-v '^\.$' | sort -o New-listing<br />
# and show the differences.<br />
echo -e '\nChanges in files in initrd image\n'<br />
diff Orig-listing New-listing<br />
# copy the new initrd to wherever you have configure the<br />
dhcp server to look<br />
# for it (here we assume it's /images)<br />
mkdir -p /images<br />
cp initrd-${kern}.img /images<br />
echo -e '\nCompleted initrd for IB'<br />
ls -l /images/initrd-${kern}.img<br />
c. Run the<br />
usr/share/infinipath/gPXE/gpxe-qib-modify-initrd<br />
script to create the initrd.img file.<br />
At this stage, the initrd.img file is ready and located at the location<br />
where the DHCP server was configured to look for it.<br />
5. Create auniboot.php file and save it to /vault/images/uniboot.<br />
NOTE:<br />
The uniboot.php generates a gPXE script that will attempt to<br />
boot from the /boot/vmlinuz-2.6.18-128.el5 kernel. If<br />
you want to boot from a different kernel, edit uniboot.php with<br />
the appropriate kernel string in the $kver variable.<br />
8-12 D000046-005 B
8–gPXE<br />
Netbooting Over InfiniBand<br />
The following is an example of a uniboot.php file:<br />
<<br />
header ( 'Content-type: text/plain' );<br />
function strleft ( $s1, $s2 ) {<br />
return substr ( $s1, 0, strpos ( $s1, $s2 ) );<br />
}<br />
function baseURL() {<br />
$s = empty ( $_SERVER["HTTPS"] ) '' :<br />
( $_SERVER["HTTPS"] == "on" ) "s" : "";<br />
$protocol = strleft ( strtolower (<br />
$_SERVER["SERVER_PROTOCOL"] ), "/" ).$s;<br />
$port = ( $_SERVER["SERVER_PORT"] == "80" ) "" :<br />
( ":".$_SERVER["SERVER_PORT"] );<br />
return $protocol."://".$_SERVER['SERVER_NAME'].$port;<br />
}<br />
$baseurl = baseURL();<br />
$selfurl = $baseurl.$_SERVER['REQUEST_URI'];<br />
$dirurl = $baseurl.( dirname ( $_SERVER['SCRIPT_NAME'] )<br />
);<br />
$kver = "2.6.18-164.11.1.el5";<br />
echo
8–gPXE<br />
HTTP Boot Setup<br />
This is the kernel that will boot.<br />
This file can be copied from any machine that has RHEL5.3 installed.<br />
2. Start httpd<br />
Steps on the gPXE Client<br />
1. Ensure that the <strong>Host</strong> Channel Adapter is listed as the first bootable device in<br />
the BIOS.<br />
2. Reboot the test node(s) and enter the BIOS boot setup.<br />
This is highly dependent on the BIOS for the system but you should see a<br />
menu for boot options and a submenu for boot devices.<br />
Select gPXE IB as the first boot device.<br />
When you power on the system or press the reset button, the system will<br />
execute the boot code on the <strong>Host</strong> Channel Adapter that will query the<br />
DHCP server for the IP address and boot image to download.<br />
Once the boot image is downloaded, the BIOS/<strong>Host</strong> Channel Adapter is<br />
finished and the boot image is ready.<br />
3. Verify system boots off of the kernel image on the boot server. The best way<br />
to do this is to boot into a different kernel from the one installed on the hard<br />
drive on the client, or to un-plug the hard drive on the client and verify that on<br />
boot up, a kernel and file system exist.<br />
HTTP Boot Setup<br />
gPXE supports booting diskless machines. To enable using an IB driver, the<br />
(remote) kernel or initrd image must include and be configured to load that driver.<br />
This can be achieved either by compiling the <strong>Host</strong> Channel Adapter driver into the<br />
kernel, or by adding the device driver module into the initrd image and loading it.<br />
1. Make a new directory<br />
mdir /vault/images/uniboot<br />
2. Change directories<br />
cd /vault/images/uniboot<br />
3. Create a initrd.img file using the information and example in Step 4<br />
of Boot Server Setup.<br />
4. Create a uniboot.php file using the example in Step 5 of Boot Server<br />
Setup.<br />
8-14 D000046-005 B
8–gPXE<br />
HTTP Boot Setup<br />
5. Create an images.conf file and a kernels.conf file using the<br />
examples in Step 2 of Boot Server Setup and place them in the<br />
/etc/httpd/conf.d directory.<br />
6. Edit /etc/dhcpd.conf file to boot the clients using HTTP<br />
filename "http://172.26.32.9/images/uniboot/uniboot.php";<br />
7. Restart the DHCP server<br />
8. Start HTTP if it is not already running:<br />
/etc/init.d/httpd start<br />
D000046-005 B 8-15
8–gPXE<br />
HTTP Boot Setup<br />
8-16 D000046-005 B
A<br />
mpirun Options Summary<br />
This section summarizes the most commonly used options to mpirun. See the<br />
mpirun (1) man page for a complete listing.<br />
Job Start Options<br />
-mpd<br />
This option is used after running mpdboot to start a daemon, rather than using the<br />
default ssh protocol to start jobs. See the mpdboot(1) man page for more<br />
information. None of the other mpirun options (with the exception of -h) are valid<br />
when using this option.<br />
-ssh<br />
This option uses the ssh program to start jobs, either directly or through<br />
distributed startup. This is the default.<br />
Essential Options<br />
-H, -hosts hostlist<br />
When this option is used, the list of possible hosts to run on is taken from the<br />
specified hostlist, which has precedence over the -machinefile option. The<br />
hostlist can be comma-delimited or quoted as a space-delimited list. The<br />
hostlist specification allows compressed representation of the form:<br />
host-[01-02,04,06-08], is equivalent to:<br />
host-01,host-02,host-04,host-06,host-07,host-08<br />
If the -np count is unspecified, it is adjusted to the number of hosts in the<br />
hostlist. If the -ppn count is specified, each host will receive as many<br />
processes.<br />
-machinefile filename, -m filename<br />
This option specifies the machines (mpihosts) file that contains the list of hosts to<br />
be used for this job. The default is $MPIHOSTS, then ./mpihosts, and finally<br />
~/.mpihosts.<br />
-nonmpi<br />
This option runs a non-MPI program, and is required if the node program makes<br />
no MPI calls. This option allows non-<strong>QLogic</strong> MPI applications to use mpirun’s<br />
parallel spawning mechanism.<br />
D000046-005 B A-1
A–mpirun Options Summary<br />
Spawn Options<br />
-np np<br />
This option specifies the number of processes to spawn. If this option is not set,<br />
then the environment variable MPI_NPROCS is checked. If MPI_NPROCS is not<br />
set, the default is to determine the number of processes based on the number of<br />
hosts in the machinefile -M or the list of hosts -H.<br />
-ppn processes-per-node<br />
This option creates up to the specified number of processes per node.<br />
By default, a limit is enforced that depends on how many InfiniPath contexts are<br />
supported by the node (depends on the hardware type and the number of<br />
InfiniPath cards present).<br />
InfiniPath context (port) sharing is supported, beginning with the InfiniPath 2.0<br />
release. This feature allows running up to four times as many processes per node<br />
as was previously possible, with a small additional overhead for each shared<br />
context.<br />
Context sharing is enabled automatically if needed. Use of the full number of<br />
available contexts is assumed. To restrict the number of contexts, use the<br />
environment variable PSM_SHAREDCONTEXTS_MAX to divide the available<br />
number of contexts.<br />
Context sharing behavior can be overriden by using the environment variable<br />
PSM_SHAREDCONTEXTS. Setting this variable to zero disables context sharing,<br />
and jobs that require more than the available number of contexts cannot be run.<br />
Setting this variable it to one (the default) causes context sharing to be enabled if<br />
needed.<br />
-rcfile node-shell-script<br />
This is the startup script for setting the environment on nodes. Before starting<br />
node programs, mpirun checks to see if a file called .mpirunrc exists in the<br />
user’s home directory. If the file exists, it is sourced into the running remote<br />
shell. Use -rcfile node-shell-script or .mpirunrc to set paths and other<br />
environment variables such as LD_LIBRARY_PATH.<br />
Default: $HOME/.mpirunrc<br />
Spawn Options<br />
-distributed [=on|off]<br />
This option controls use of the distributed mpirun job spawning mechanism. The<br />
default is on. To change the default, put this option in the global<br />
mpirun.defaults file or a user-local file (see the environment variable<br />
PSC_MPIRUN_DEFAULTS_PATH for details). When the option appears more than<br />
once on the command line, the last setting controls the behavior.<br />
Default: on.<br />
A-2 D000046-005 B
A–mpirun Options Summary<br />
Quiescence Options<br />
Quiescence Options<br />
-disable-mpi-progress-check<br />
This option disables the MPI communication progress check without disabling the<br />
ping reply check.<br />
If quiescence or a lack of ping reply is detected, the job and all compute<br />
processes are terminated.<br />
-i, -ping-interval,seconds<br />
This options specifies the seconds to wait between ping packets to mpirun<br />
(if -q > 0).<br />
Default: 60<br />
-q, -quiescence-timeout,seconds<br />
This option specifies the wait time (in seconds) for quiescence (absence of MPI<br />
communication or lack of ping reply) on the nodes. It is useful for detecting<br />
deadlocks. A value of zero disables quiescence detection.<br />
Default: 900<br />
Verbosity Options<br />
-job-info<br />
This option prints brief job startup and shutdown timing information.<br />
-no-syslog<br />
When this option is specified, critical errors are not sent through syslog. By<br />
default, critical errors are sent to the console and through syslog.<br />
-V, -verbose<br />
This option prints diagnostic messages from mpirun itself. The verbose option is<br />
useful in troubleshooting.<br />
Verbosity will also list the IPATH_* and PSM_* environment variable settings that<br />
affect MPI operation.<br />
Startup Options<br />
-I, -open-timeout seconds<br />
This option tries for the number of seconds to open the InfiniPath device. If<br />
seconds is -1 (negative one), the node program waits indefinitely. Use this option<br />
to avoid having all queued jobs in a batch queue fail when a node fails for some<br />
reason, or is taken down for administrative purposes. The -t option is also<br />
normally set to -1.<br />
D000046-005 B A-3
A–mpirun Options Summary<br />
Stats Options<br />
Stats Options<br />
-k, -kill-timeout seconds<br />
This option indicates the time to wait for other ranks after the first rank exits.<br />
Default: 60<br />
-listen-addr |<br />
This option specifies the hostname (or IPv4 address) to listen on for incoming<br />
socket connections. It is useful for an mpirun front-end multihomed host. By<br />
default, mpirun assumes that ranks can independently resolve the hostname<br />
obtained on the head node with gethostname(2). To change the default, put<br />
this option in the global mpirun.defaults file or a user-local file.<br />
-runscript<br />
This is the script used to run the node program.<br />
-t, -timeout seconds<br />
This option waits for specified time (in seconds) for each node to establish<br />
connection back to mpirun. If seconds is -1 (negative one), mpirun will wait<br />
indefinitely.<br />
Default: 60<br />
-M [=stats_types], -print-stats [=stats_types]<br />
Statistics include minimum, maximum, and median values for message<br />
transmission protocols as well as more detailed information for expected and<br />
unexpected message reception. If the option is provided without an argument,<br />
stats_types is assumed to be mpi.<br />
The following stats_types can be specified:<br />
mpi<br />
ipath<br />
p2p<br />
counters<br />
devstats.<br />
all<br />
Shows an MPI-level summary (expected, unexpected message)<br />
Shows a summary of InfiniPath interconnect communication<br />
Shows detailed per-MPI rank communication information<br />
Shows low-level InfiniPath device counters<br />
Shows InfiniPath driver statistics<br />
Shows statistics for all stats_types<br />
One or more statistics types can be specified by separating them with a comma.<br />
For example, -print-stats=ipath,counters displays InfiniPath<br />
communication protocol as well as low-level device counter statistics. For details,<br />
see “MPI Stats” on page F-30.<br />
A-4 D000046-005 B
A–mpirun Options Summary<br />
Tuning Options<br />
-statsfile file-prefix<br />
This option specifies an alternate file to receive the output from the<br />
-print-stats option.<br />
Default: stderr<br />
-statsmode absolute|diffs<br />
When printing process statistics with the -print-stats option, this option<br />
specifies if the printed statistics have the absolute values of the <strong>QLogic</strong> adapter<br />
chip counters and registers or if there are differences between those values at the<br />
start and end of the process.<br />
Default mode: diffs<br />
Tuning Options<br />
-L, -long-len length<br />
This option determines the length of the message used by the rendezvous<br />
protocol. The InfiniPath rendezvous messaging protocol uses two-way handshake<br />
(with MPI synchronous send semantics) and receive-side DMA.<br />
Default: 64000<br />
-N, -num-send-bufs buffer-count<br />
<strong>QLogic</strong> MPI uses the specified number as the number of packets that can be sent<br />
without having to wait from an acknowledgement from the receiver. Each packet<br />
contains approximately 2048 bytes of user data.<br />
Default: 512<br />
-s,-long-len-shmem length<br />
This option specifies the length of the message used by the rendezvous protocol<br />
for intra-node communications. The InfiniPath rendezvous messaging protocol<br />
uses two-way handshake (with MPI synchronous send semantics) and<br />
receive-side DMA.<br />
Default: 16000<br />
-W, -rndv-window-size length<br />
When sending a large message using the rendezvous protocol, <strong>QLogic</strong> MPI splits<br />
the message into a number of fragments at the source and recombines them at<br />
the destination. Each fragment is sent as a single rendezvous stage. This option<br />
specifies the maximum length of each fragment.<br />
Default: 262144 bytes<br />
D000046-005 B A-5
A–mpirun Options Summary<br />
Shell Options<br />
Shell Options<br />
-shell shell-name<br />
This option specifies the name of the program to use to log into remote hosts.<br />
Default: ssh, unless $MPI_SHELL is defined.<br />
-shellx shell-name<br />
This option specifies the name of program to use to log into remote hosts with X11<br />
forwarding. This option is useful when running with -debug or in xterm.<br />
Default: ssh, unless $MPI_SHELL_X is defined.<br />
Debug Options<br />
-debug<br />
This option starts all the processes under debugger, and waits for the user to set<br />
breakpoints and run the program. The gdb option is used by default, but can be<br />
overridden using the -debugger argument. Other supported debuggers are<br />
strace and the <strong>QLogic</strong> debugger pathdb.<br />
-debug-no-pause<br />
This option is similar to -debug, except that it does not pause at the beginning.<br />
The gdb option is used by default.<br />
-debugger gdb|pathdb|strace<br />
This option uses the specified debugger instead of the default gdb.<br />
-display X-server<br />
This option uses the specified X server for invoking remote xterms. (-debug,<br />
-debug-no-pause, and -in-xterm options use this value.)<br />
Default: whatever is set in $DISPLAY<br />
-in-xterm<br />
This option runs each process in an xterm window. This is implied when -debug<br />
or -debug-no-pause is used.<br />
Default: write to stdout with no stdin<br />
-psc-debug-level mask<br />
This option controls the verbosity of messages printed by the MPI and InfiniPath<br />
protocol layer. The default is 1, which displays error messages. A value of 3 displays<br />
short messaging details such as source, destination, size, etc. A value of FFh prints<br />
detailed information in a messaging layer for each message. Use this option with care,<br />
since too much verbosity will negatively affect application performance.<br />
Default: 1<br />
-xterm xterm<br />
This option specifies the xterm to use.<br />
Default: xterm<br />
A-6 D000046-005 B
A–mpirun Options Summary<br />
Format Options<br />
Format Options<br />
-l, -label-output<br />
This option labels each line of output on stdout and stderr with the rank of the<br />
MPI process that produces the output.<br />
-y, -labelstyle string<br />
This option specifies the label that is prefixed to error messages and statistics.<br />
Process rank is the default prefix. The label that is prefixed to each message can<br />
be specified as one of the following:<br />
Other Options<br />
%n <strong>Host</strong>name that the node process executes<br />
%r Rank of the node process<br />
%p Process ID of the node process<br />
%L LID (InfiniPath local identifier (LID) adapter identifier) of the node<br />
%P InfiniPath port of the node process<br />
%l Local rank of the node process within a node<br />
%% Percent sign<br />
-h -help<br />
This option prints a summary of mpirun options, then exits.<br />
-stdin filename<br />
This option specifies the filename that must be fed as stdin to the node program.<br />
Default: /dev/null<br />
-stdin-target 0..np-1 | -1<br />
This option specifies the process rank that must receive the file specified with the<br />
-stdin option. Negative one (-1) means all ranks.<br />
Default: -1<br />
-v, -version<br />
This option prints the mpirun version, then exits.<br />
-wdir path-to-working_dir<br />
This option sets the working directory for the node program.<br />
Default: -wdir current-working-dir<br />
D000046-005 B A-7
A–mpirun Options Summary<br />
Other Options<br />
Notes<br />
A-8 D000046-005 B
B<br />
Benchmark Programs<br />
Several MPI performance measurement programs are installed from the<br />
mpi-benchmark RPM. This appendix describes these benchmarks and how to<br />
run them. These programs are based on code from the group of Dr. Dhabaleswar<br />
K. Panda at the Network-Based Computing Laboratory at the Ohio State<br />
University. For more information, see: http://mvapich.cse.ohio-state.edu/<br />
These programs allow you to measure the MPI latency and bandwidth between<br />
two or more nodes in your cluster. Both the executables, and the source for those<br />
executables, are shipped. The executables are installed by default under<br />
/usr/mpi/qlogic/bin (though /usr/bin where they are under a<br />
non-default-install). The remainder of this chapter will assume that <strong>QLogic</strong> MPI<br />
was installed in the default location of /usr/mpi/qlogic and that mpi-selector<br />
is used to choose the MPI to be used. The source is installed under<br />
/usr/mpi/qlogic/share/mpich/examples/performance.<br />
The following examples are intended to show only the syntax for invoking these<br />
programs and the meaning of the output. They are not representations of actual<br />
TrueScale performance characteristics.<br />
Benchmark 1: Measuring MPI Latency Between<br />
Two Nodes<br />
In the MPI community, latency for a message of given size is the time difference<br />
between a node program’s calling MPI_Send and the time that the corresponding<br />
MPI_Recv in the receiving node program returns. The term latency, alone without<br />
a qualifying message size, indicates the latency for a message of size zero. This<br />
latency represents the minimum overhead for sending messages, due to both<br />
software overhead and delays in the electronics of the fabric. To simplify the<br />
timing measurement, latencies are usually measured with a ping-pong method,<br />
timing a round-trip and dividing by two.<br />
The program osu_latency, from Ohio State University, measures the latency for<br />
a range of messages sizes from 0 to 4 megabytes. It uses a ping-pong method,<br />
where the rank zero process initiates a series of sends and the rank one process<br />
echoes them back, using the blocking MPI send and receive calls for all<br />
operations. Half the time interval observed by the rank zero process for each<br />
exchange is a measure of the latency for messages of that size, as previously<br />
D000046-005 B B-1
B–Benchmark Programs<br />
Benchmark 1: Measuring MPI Latency Between Two Nodes<br />
defined. The program uses a loop, executing many such exchanges for each<br />
message size, to get an average. The program defers the timing until the<br />
message has been sent and received a number of times, to be sure that all the<br />
caches in the pipeline have been filled.<br />
This benchmark always involves two node programs. It can be run with the<br />
command:<br />
$ mpirun -H host1,host2 osu_latency<br />
-H (or --hosts) allows the specification of the host list on the command line<br />
instead of using a host file (with the -m or -machinefile option). Since only two<br />
hosts are listed, this implies that two host programs will be started (as if -np 2<br />
were specified). The output of the program looks like:<br />
# OSU MPI Latency Test (Version 2.0)<br />
# Size Latency (us)<br />
0 1.06<br />
1 1.06<br />
2 1.06<br />
4 1.05<br />
8 1.05<br />
16 1.30<br />
32 1.33<br />
64 1.30<br />
128 1.36<br />
256 1.51<br />
512 1.84<br />
1024 2.47<br />
2048 3.79<br />
4096 4.99<br />
8192 7.28<br />
16384 11.75<br />
32768 20.57<br />
65536 58.28<br />
131072 98.59<br />
262144 164.68<br />
524288 299.08<br />
1048576 567.60<br />
2097152 1104.50<br />
4194304 2178.66<br />
The first column displays the message size in bytes. The second column displays<br />
the average (one-way) latency in microseconds. This example shows the syntax of<br />
the command and the format of the output, and is not meant to represent actual<br />
values that might be obtained on any particular TrueScale installation.<br />
B-2 D000046-005 B
B–Benchmark Programs<br />
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes<br />
Benchmark 2: Measuring MPI Bandwidth<br />
Between Two Nodes<br />
The osu_bw benchmark measures the maximum rate that you can pump data<br />
between two nodes. This benchmark also uses a ping-pong mechanism, similar to<br />
the osu_latency code, except in this case, the originator of the messages<br />
pumps a number of them (64 in the installed version) in succession using the<br />
non-blocking MPI_I send function, while the receiving node consumes them as<br />
quickly as it can using the non-blocking MPI_Irecv function, and then returns a<br />
zero-length acknowledgement when all of the sent data has been received.<br />
You can run this program by typing:<br />
$ mpirun -H host1,host2 osu_bw<br />
Typical output might look like:<br />
# OSU MPI Bandwidth Test (Version 2.0)<br />
# Size Bandwidth (MB/s)<br />
1 3.549325<br />
2 7.110873<br />
4 14.253841<br />
8 28.537989<br />
16 42.613030<br />
32 81.144290<br />
64 177.331433<br />
128 348.122982<br />
256 643.742171<br />
512 1055.355552<br />
1024 1566.702234<br />
2048 1807.872057<br />
4096 1865.128035<br />
8192 1891.649180<br />
16384 1898.205188<br />
32768 1888.039542<br />
65536 1931.339589<br />
131072 1942.417733<br />
262144 1950.374843<br />
524288 1954.286981<br />
1048576 1956.301287<br />
2097152 1957.351171<br />
4194304 1957.810999<br />
The increase in measured bandwidth with the messages’ size is because the<br />
latency’s contribution to the measured time interval becomes relatively smaller.<br />
D000046-005 B B-3
B–Benchmark Programs<br />
Benchmark 3: Messaging Rate Microbenchmarks<br />
Benchmark 3: Messaging Rate Microbenchmarks<br />
mpi_multibw is the microbenchmark that highlights <strong>QLogic</strong>’s messaging rate<br />
results. This benchmark is a modified form of the OSU Network-Based Computing<br />
Lab’s osu_bw benchmark (as shown in the previous example). It has been<br />
enhanced with the following additional functionality:<br />
• The messaging rate and the bandwidth are reported.<br />
• N/2 is dynamically calculated at the end of the run.<br />
• You can run multiple processes per node and see aggregate bandwidth and<br />
messaging rates.<br />
The benchmark has been updated with code to dynamically determine what<br />
processes are on which host. Here is an example output when running<br />
mpi_multibw:<br />
$ mpirun -np 16 -ppn 8 -H host1,host2 ./mpi_multibw<br />
This will run on eight processes per node. Typical output might look like:<br />
# PathScale Modified OSU MPI Bandwidth Test<br />
(OSU Version 2.2, PathScale $<strong>Rev</strong>ision$)<br />
# Running on 8 procs per node (uni-directional traffic for each<br />
process pair)<br />
# Size Aggregate Bandwidth (MB/s) Messages/s<br />
1 26.890668 26890667.530474<br />
2 53.692685 26846342.327320<br />
4 107.662814 26915703.518342<br />
8 214.526573 26815821.579971<br />
16 88.356173 5522260.840754<br />
32 168.514373 5266074.141949<br />
64 503.086611 7860728.303972<br />
128 921.257051 7197320.710406<br />
256 1588.793989 6206226.519112<br />
512 1716.731626 3352991.457783<br />
1024 1872.073401 1828196.680564<br />
2048 1928.774223 941784.288727<br />
4096 1928.763048 470889.416123<br />
8192 1921.127830 234512.674597<br />
16384 1919.122008 117133.911629<br />
32768 1898.415975 57935.057817<br />
65536 1953.063214 29801.379615<br />
131072 1956.731895 14928.679615<br />
262144 1957.544289 7467.438845<br />
524288 1957.952782 3734.498562<br />
1048576 1958.235791 1867.519179<br />
2097152 1958.333161 933.806019<br />
4194304 1958.400649 466.919100<br />
B-4 D000046-005 B
B–Benchmark Programs<br />
Benchmark 4: Measuring MPI Latency in <strong>Host</strong> Rings<br />
Searching for N/2 bandwidth. Maximum Bandwidth of 1958.400649<br />
MB/s...<br />
Found N/2 bandwidth of 992.943275 MB/s at size 153 bytes<br />
Benchmark 4: Measuring MPI Latency in <strong>Host</strong><br />
Rings<br />
The program mpi_latency measures latency in a ring of hosts. Its syntax is<br />
different from Benchmark 1 in that it takes command line arguments that let you<br />
specify the message size and the number of messages to average the results. For<br />
example, running on the nodes, host1, host2, host3 and host4, the command:<br />
$ mpirun -np 4 -H host[1-4] mpi_latency 100 0<br />
Might produce output like this:<br />
0 1.760125<br />
This output indicates that it took an average of 1.76 microseconds per hop to send<br />
a zero-length message from the first host, to the second, to the third, to the fourth,<br />
and then receive replies back in the other direction.<br />
D000046-005 B B-5
B–Benchmark Programs<br />
Benchmark 4: Measuring MPI Latency in <strong>Host</strong> Rings<br />
Notes<br />
B-6 D000046-005 B
C<br />
VirtualNIC Interface<br />
Configuration and<br />
Administration<br />
VirtualNIC Interface Configuration and<br />
Administration<br />
The VirtualNIC (VNIC) Upper Layer Protocol (ULP) works in conjunction with<br />
firmware running on Virtual Input/Output (VIO) hardware such as the <strong>QLogic</strong><br />
Ethernet Virtual I/O Controller (EVIC) or the InfiniBand/Ethernet Bridge Module<br />
for IBM ® BladeCenter ® , providing virtual Ethernet connectivity.<br />
The VNIC driver, along with <strong>QLogic</strong> EVIC’s two 10 Gigabit ethernet ports, enables<br />
Infiniband clusters to connect to Ethernet networks. This driver also works with the<br />
earlier version of the I/O controller, the VEx.<br />
The <strong>QLogic</strong> VNIC driver creates virtual Ethernet interfaces and tunnels the<br />
Ethernet data to/from the EVIC over InfiniBand using an InfiniBand reliable<br />
connection.<br />
The virtual Ethernet interface supports any Ethernet protocol. It operates like any<br />
other interface: ping, ssh, scp, netperf, etc.<br />
The VNIC interface must be configured before it can be used. Perform the steps in<br />
the following sub-sections to set up and configure the VNIC interface:<br />
Getting Information about Ethernet IOCs on the Fabric<br />
When ib_qlgc_vnic_query is executed without any options, it displays detailed<br />
information about all Virtual I/O IOCs present on the fabric including the EVIC/VEx<br />
Input/Output Controllers (IOCs) present on the fabric.<br />
For writing the configuration file, you will need information about the EVIC/VEx<br />
IOCs present on the fabric, such as their IOCGUID, IOCSTRING, etc.<br />
D000046-005 B C-1
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
NOTE:<br />
An EVIC has 2 IOCs; one for each Ethernet port. Each EVIC contains a<br />
unique set of IOCGUIDs: (e.g., IOC 1 maps to Ethernet Port 1 and IOC 2<br />
maps to Ethernet Port 2).<br />
1. Ensure you are logged in as a root user.<br />
2. Type ib_qlgc_vnic_query<br />
This displays detailed information about all the EVIC/VEx IOCs present on<br />
the fabric. For example:<br />
# ib_qlgc_vnic_query<br />
HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f9,<br />
State = Active<br />
IO Unit Info:<br />
port LID: 0009<br />
port GID: fe8000000000000000066a11de000070<br />
change ID: 0003<br />
max controllers: 0x02<br />
controller[ 1]<br />
GUID: 00066a01de000070<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1<br />
service entries: 2<br />
service[ 0]: 1000066a00000001 /<br />
InfiniNIC.InfiniConSys.Control:01<br />
service[ 1]: 1000066a00000101 /<br />
InfiniNIC.InfiniConSys.Data:01<br />
IO Unit Info:<br />
port LID: 000b<br />
port GID: fe8000000000000000066a21de000070<br />
change ID: 0003<br />
max controllers: 0x02<br />
C-2 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
controller[ 2]<br />
GUID: 00066a02de000070<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2<br />
service entries: 2<br />
service[ 0]: 1000066a00000002 /<br />
InfiniNIC.InfiniConSys.Control:02<br />
service[ 1]: 1000066a00000102 /<br />
InfiniNIC.InfiniConSys.Data:02<br />
HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010fa,<br />
State = Active<br />
IO Unit Info:<br />
port LID: 0009<br />
port GID: fe8000000000000000066a11de000070<br />
change ID: 0003<br />
max controllers: 0x02<br />
controller[ 1]<br />
GUID: 00066a01de000070<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1<br />
service entries: 2<br />
service[ 0]: 1000066a00000001 /<br />
InfiniNIC.InfiniConSys.Control:01<br />
service[ 1]: 1000066a00000101 /<br />
InfiniNIC.InfiniConSys.Data:01<br />
IO Unit Info:<br />
port LID: 000b<br />
port GID: fe8000000000000000066a21de000070<br />
change ID: 0003<br />
max controllers: 0x02<br />
D000046-005 B C-3
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
controller[ 2]<br />
GUID: 00066a02de000070<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2<br />
service entries: 2<br />
service[ 0]: 1000066a00000002 /<br />
InfiniNIC.InfiniConSys.Control:02<br />
service[ 1]: 1000066a00000102 /<br />
InfiniNIC.InfiniConSys.Data:02<br />
Editing the VirtualNIC Configuration file<br />
Look at the qlgc_vnic.cfg.sample file to see how VNIC configuration files<br />
are written. This file can be found with the OFED documentation, or in the<br />
qlgc_vnictools subdirectory of the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong><br />
download. You can use this configuration file as the basis for creating a<br />
configuration file by replacing the destination global identifier (DGID),<br />
IOCGUID, and IOCSTRING values with those of the EVIC/VEx IOCs<br />
present on your fabric.<br />
<strong>QLogic</strong> recommends using the DGID of the EVIC/VEx IOC, as it ensures the<br />
quickest startup of the VNIC service. When DGID is specified, the IOCGUID<br />
must also be specified. For more details, see the qlgc_vnic.cfg sample<br />
file.<br />
1. Edit the VirtualNIC configuration file, /etc/infiniband/qlgc_vnic.cfg.<br />
For each IOC connection, add a CREATE block to the file using the following<br />
format:<br />
{CREATE; NAME="eioc2";<br />
PRIMARY={IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; }<br />
SECONDARY={IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2;}<br />
}<br />
NOTE:<br />
The qlgc_vnic.cfg file is case and format sensitive.<br />
C-4 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
NOTE:<br />
For the following sections, determine the necessary connection type. To<br />
always have a host connect to the same IOC on the same VIO card<br />
regardless of where that card is in the fabric, use format 1. To always have a<br />
host to connect to the same IOC in the same chassis and/or slot, use format<br />
2.<br />
Format 1: Defining an IOC using the IOCGUID<br />
NOTE:<br />
Determine the needed connection type. If a host is to connect to the same<br />
IOC on the same card, no matter where that card is in the fabric, then use<br />
this format.<br />
Use the following format to cause the host to connect to a specific VIO hardware<br />
card, regardless of which chassis and/or slot the VIO hardware card resides:<br />
{CREATE;<br />
NAME="eioc1";<br />
IOCGUID=0x66A0137FFFFE7;}<br />
}<br />
The following is an example of VIO hardware failover:<br />
{CREATE; NAME="eioc1";<br />
PRIMARY={IOCGUID=0x66a01de000003; INSTANCE=1; PORT=1; }<br />
SECONDARY={IOCGUID=0x66a02de000003; INSTANCE=1; PORT=1;}<br />
}<br />
NOTE:<br />
Do not create EIOC names with similar character strings (for example,<br />
eioc3 and eioc30). There is a limitation with certain Linux operating<br />
systems that cannot recognize the subtle differences. The result is that<br />
the user will be unable to ping across the network.<br />
D000046-005 B C-5
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
Format 2: Defining an IOC using the IOCSTRING<br />
Defining the IOC using the IOCSTRING allows VIO hardware to be hot-swapped<br />
in and out of a specific slot. The host will attempt to connect to the specified IOC<br />
(1 or 2) on the VIO hardware that currently resides in the specified slot of the<br />
specified Chassis. Use the following format to allow the host to connect to a VIO<br />
hardware that resides in a specific slot of a specific chassis:<br />
{CREATE;<br />
NAME="eioc1";<br />
IOCSTRING="Chassis 0x00066A0005000001, Slot 1, IOC 1";<br />
RX_CSUM=TRUE;<br />
HEARTBEAT=100; }<br />
NOTE:<br />
The IOCSTRING field is a literal, case-sensitive string, whose syntax must<br />
be exactly in the format shown in the example above, including the<br />
placement of commas. To reduce the likelihood of syntax error, the user<br />
should execute ib_qlgc_vnic_query -es. Note that the chassis serial<br />
number must match the chassis 0x (Hex) value. The slot serial number is<br />
specific to the line card as well.<br />
Each CREATE block must specify a unique NAME. The NAME represents the Ethernet<br />
interface name that will be registered with the Linux Operating System.<br />
Format 3: Starting VNIC using DGID<br />
NOTE:<br />
It is not always necessary to use DGID. It is recommended to use DGID if<br />
the user has a large cluster.<br />
Following is an example of a DGID and IOCGUID VNIC configuration. This<br />
configuration allows for the quickest start up of VNIC service:<br />
{CREATE; NAME="eioc1";<br />
DGID=0xfe8000000000000000066a0258000001;IOCGUID=0x66a01300000<br />
01;<br />
}<br />
This example uses DGID, IOCGUID and IOCSTRING:<br />
{CREATE; NAME="eioc1";<br />
DGID=0xfe8000000000000000066a0258000001;<br />
IOCGUID=0x66a0130000001;<br />
IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";<br />
}<br />
C-6 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Failover Definition<br />
The VirtualNIC configuration file allows the user to define virtual NIC failover as:<br />
1. Failover to a different adapter port on the same adapter.<br />
2. Failover to a port on a different adapter.<br />
3. Failover to a different Ethernet port on the same Ethernet gateway.<br />
4. Failover to a port on a different Ethernet gateway.<br />
5. A combination of scenarios 1 or 2 and 3 or 4.<br />
Failover to a different Adapter port on the same Adapter<br />
In this example, if adapter Port 1 fails (i.e., the InfiniBand connection between the<br />
adapter and the InfiniBand switch is lost) then traffic going over eioc1 will begin<br />
using adapter port 2. In this example the same Ethernet gateway port is being<br />
used for both the primary and the secondary connection.<br />
{CREATE; NAME="eioc1";<br />
PRIMARY={ DGID=fe8000000000000000066a11de00003c;<br />
IOCGUID=00066a01de00003c; INSTANCE=0; PORT=1; }<br />
SECONDARY={ DGID=fe8000000000000000066a11de00003c;<br />
IOCGUID=00066a01de00003c; INSTANCE=1; PORT=2; }<br />
}<br />
Failover to a different Ethernet port on the same<br />
Ethernet gateway<br />
In this example, if the Ethernet port associated with iocguid 0x66a01de00003c<br />
(i.e., Ethernet port 1 on gateway 03c) fails, then traffic going over eioc1 will begin<br />
using Ethernet port 2 on the same Ethernet gateway.<br />
NOTE:<br />
In this example the same adapter port is being used for both the primary and<br />
the secondary connection.<br />
{CREATE; NAME="eioc1";<br />
PRIMARY={ DGID=fe8000000000000000066a11de00003c;<br />
IOCGUID=00066a01de00003c; INSTANCE=0; PORT=1; }<br />
SECONDARY={ DGID=fe8000000000000000066a21de00003c;<br />
IOCGUID=00066a02de00003c; INSTANCE=0; PORT=1; }<br />
}<br />
D000046-005 B C-7
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
Failover to a port on a different Ethernet gateway<br />
In this example, if the Ethernet port associated with iocguid 0x66a02de000048<br />
(i.e., Ethernet port 2 on gateway 48) fails, then traffic going over eioc1 will begin<br />
using Ethernet port 2 on gateway 03c. This type of failover allows traffic to<br />
continue even if the entire gateway card fails or is rebooted. This type of failure<br />
requires a second gateway card to be in the fabric.<br />
NOTE:<br />
In this example the same adapter port is being used for both the primary and<br />
the secondary connection.<br />
{CREATE; NAME="eioc1";<br />
PRIMARY={ DGID=fe8000000000000000066a21de000048;<br />
IOCGUID=00066a02de000048; INSTANCE=0; PORT=1; }<br />
SECONDARY={ DGID=fe8000000000000000066a21de00003c;<br />
IOCGUID=00066a02de00003c; INSTANCE=0; PORT=1; }<br />
}<br />
Combination method<br />
In this example, if either the Ethernet port, the Ethernet gateway or the adapter<br />
port fails, traffic using eioc1 will begin using Ethernet port 2 on gateway 03c via<br />
adapter port 1.<br />
{CREATE; NAME="eioc1";<br />
PRIMARY={ DGID=fe8000000000000000066a21de000048;<br />
IOCGUID=00066a02de000048; INSTANCE=0; PORT=2; }<br />
SECONDARY={ DGID=fe8000000000000000066a21de00003c;<br />
IOCGUID=00066a02de00003c; INSTANCE=0; PORT=1; }<br />
}<br />
Creating VirtualNIC Ethernet Interface Configuration Files<br />
For each Ethernet interface defined in the /etc/infiniband/qlgc_vnic.cfg<br />
file, create an interface configuration file:<br />
For SuSE and SLES OS:<br />
/etc/sysconfig/network/ifcfg-NAME<br />
For RHEL OS:<br />
/etc/sysconfig/network-scripts/ifcfg-NAME<br />
where NAME is the value of the NAME field specified in the CREATE block.<br />
C-8 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
Example of ifcfg-eiocx setup for RedHat systems:<br />
DEVICE=eioc1<br />
BOOTPROTO=static<br />
IPADDR=172.26.48.132<br />
BROADCAST=172.26.63.130<br />
NETMASK=255.255.240.0<br />
NETWORK=172.26.48.0<br />
ONBOOT=yes<br />
TYPE=Ethernet<br />
Example of ifcfg-eiocx setup for SuSE and SLES systems:<br />
BOOTPROTO='static'<br />
IPADDR='172.26.48.130'<br />
BROADCAST='172.26.63.255'<br />
NETMASK='255.255.240.0'<br />
NETWORK='172.26.48.0'<br />
STARTMODE='hotplug'<br />
TYPE='Ethernet'<br />
After modifying the /etc/infiniband/qlgc_vnic.cfg file, restart the<br />
VirtualNIC driver with the following:<br />
/etc/init.d/qlgc_vnic restart<br />
VirtualNIC Multicast<br />
The primary goal of VNIC Multicast is to reduce the replication of multicast packet<br />
transmission from the EVIC to InfiniBand hosts. Figure C-1 demonstrates the<br />
transmission of multicast traffic without using IB_MULTICAST:<br />
D000046-005 B C-9
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
IB <strong>Host</strong> 1<br />
IB <strong>Host</strong> 2<br />
IB Switch<br />
EVIC<br />
IB <strong>Host</strong> 3<br />
Multicast<br />
Data (RC)<br />
Figure C-1. Without IB_Multicast<br />
Figure C-1 is showing that each multicast packet is sent by the EVIC via the<br />
Reliable Connect (RC) Queue Pair to each and every host. In the case when<br />
multicast applications on multiple hosts are sharing the same multicast address,<br />
the result is large amounts of traffic, one per host, between the EVIC and the<br />
switch. This also creates additional work load for the EVIC.<br />
With multicast enabled at the host using IB_MULTICAST (located in the VNIC<br />
configuration file) and also enabled at the EVIC, multicast traffic is handled as<br />
follows:<br />
C-10 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
IB <strong>Host</strong> 1<br />
IB <strong>Host</strong> 2<br />
IB Switch<br />
EVIC<br />
IB <strong>Host</strong> 3<br />
Single packet to<br />
IB_multicast group<br />
traffic<br />
Packet forwarded to<br />
host (UD)<br />
Figure C-2. With IB_Multicast<br />
Figure C-2 is showing that the EVIC sends a multicast packet ONCE to the<br />
InfiniBand switch while posting to a specific InfiniBand multicast group. The switch<br />
then forwards the packet to all hosts who have joined the same multicast group<br />
that the host has joined. Since the EVIC only has to post the packet once instead<br />
of once for each host, there is a substantial savings. Additionally, the packets are<br />
delivered by an unreliable datagram (UD) queue pair, which means there is less<br />
overhead per packet, per host.<br />
For each IOC, the EVIC creates a unique multicast group. When the VNIC driver<br />
provides multicast addresses for a virtual port (i.e., viport) to the EVIC, the EVIC<br />
returns the multicast group for the IOC used by the viport. The VNIC driver then<br />
joins the specified multicast group. When a viport connection is taken down, the<br />
host leaves the multicast group. If an EVIC has two IOCs and the VNIC driver on a<br />
host is configured to create viports using both IOCs, a join is issued by each<br />
viport to join the corresponding IOC it is using.<br />
D000046-005 B C-11
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
NOTE:<br />
When a viport is using two different IOCs for primary and secondary paths,<br />
then the host will join the multicast group for the IOC on the primary path as<br />
well as the multicast group for the IOC on the secondary path. The<br />
secondary path and the corresponding multicast group will not be used until<br />
a failover occurs.<br />
The create multicast group issued by the EVIC, along with the join multicast<br />
group and leave multicast group requests issued by the host are handled by the<br />
subnet manager. For details, refer to the Fabric Manager <strong>User</strong>s <strong>Guide</strong>.<br />
On the EVIC, IB_MULTICAST can be enabled or disabled manually using the CLI.<br />
For details, refer to the Ethernet section of the Hardware CLI Reference <strong>Guide</strong>. A<br />
reboot of the EVIC is required after IB_MULTICAST has been enabled or<br />
disabled.<br />
On the host, IB_MULTICAST is enabled by default. The user must add<br />
IB_MULTICAST=FALSE to the VNIC configuration file to disable the feature. To view<br />
the current status of the feature do the following:<br />
cat<br />
/sys/class/infiniband_qlgc_vnic/interfaces//*<br />
path/multicast_state<br />
For each path, there is a primary_path and a secondary_path directory present<br />
and the contents of the multicast_state file contain:<br />
feature not enabled - feature is disabled at either host or<br />
EVIC or both ends<br />
state=Joined & Attached MGID: MLID:<br />
To disable IB_MULTICAST at the host edit the /etc/infiniband/qlgc_vnic.cfg<br />
file to add IB_MULTICAST=FALSE for the viport inside both Primary and<br />
Secondary.<br />
Starting, Stopping and Restarting the VirtualNIC Driver<br />
Once you have created a configuration file, you can start the VNIC driver and<br />
create the VNIC interfaces specified in the configuration file.<br />
NOTE:<br />
Ensure you are logged in as a root user to start, stop, or restart the VNIC<br />
driver.<br />
C-12 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
To start the qlgc_vnic driver and the <strong>QLogic</strong> VNIC interfaces, use the following<br />
command:<br />
/etc/init.d/qlgc_vnic start<br />
To stop the qlgc_vnic driver and bring down the VNIC interfaces, use the following<br />
command:<br />
/etc/init.d/qlgc_vnic stop<br />
To restart the qlgc_vnic driver, use the following command:<br />
/etc/init.d/qlgc_vnic restart<br />
If you have not started the InfiniBand network stack (<strong>QLogic</strong> <strong>OFED+</strong> or OFED),<br />
then running the /etc/init.d/qlgc_vnic start command also starts the<br />
InfiniBand network stack, since the <strong>QLogic</strong> VNIC service requires the InfiniBand<br />
stack.<br />
If you start the InfiniBand network stack separately, then the correct starting order<br />
is:<br />
• Start the InfiniBand stack.<br />
• Start <strong>QLogic</strong> VNIC service.<br />
For example, if you use <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong>, the correct starting order<br />
is:<br />
/etc/init.d/openibd start<br />
/etc/init.d/qlgc_vnic start<br />
If you want to restart the <strong>QLogic</strong> VNIC interfaces, run the following command:<br />
/etc/init.d/qlgc_vnic restart<br />
You can get information about the <strong>QLogic</strong> VNIC interfaces by using the following<br />
script (as a root user):<br />
ib_qlgc_vnic_info<br />
This information is collected from the<br />
/sys/class/infiniband_qlgc_vnic/interfaces/ directory, where there is a<br />
separate directory corresponding to each VNIC interface.<br />
VNIC interfaces can be deleted by writing the name of the interface to the<br />
/sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file. For<br />
example, to delete interface veth0, run the following command (as a root user):<br />
echo -n veth0 ><br />
/sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic<br />
D000046-005 B C-13
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
Link Aggregation Configuring<br />
Troubleshooting<br />
To configure link aggregation for all ports of a VIO hardware card, do the following:<br />
1. Modify the chassis GUID and slot number.<br />
2. Edit /etc/infiniband/qlgc_vnic.cfg file with information similar to<br />
the following:<br />
{CREATE; NAME="eioc1";<br />
PRIMARY={IOCSTRING="Chassis 0x00066A0050000018, Slot<br />
2, IOC 1"; }<br />
SECONDARY={IOCSTRING="Chassis 0x00066A0050000018, Slot<br />
2, IOC 2"; }<br />
}<br />
3. Create ifcfg-eioc1 file in directory<br />
/etc/sysconfig/network-scripts.<br />
4. Physically connect 2 or 3 ports of a VEx to an Ethernet switch.<br />
5. Using the management interface for the Ethernet switch, configure the<br />
switch to aggregate all 3 ports that connect to the VEx.<br />
6. Restart the qlgc_vnic module with the following command:<br />
/etc/init.d/qlgc_vnic restart<br />
Refer to Appendix G for information about troubleshooting.<br />
VirtualNIC Configuration Variables<br />
NAME<br />
The device name for the interface.<br />
IOCGUID<br />
Defines the port controller GUID to be used to identify the specific EVIC leaf<br />
and port used by the VNIC. All of the IOC GUIDS detected on the fabric can<br />
be found by typing ib_qlgc_vnic_query -e and noting the value of the<br />
reported IOC GUIDs. This can be used instead of IOCSTRING.<br />
IOCSTRING<br />
Defines the IOC Profile ID String of the IOC to be used. All of the IOC Profile<br />
ID Strings detected on the fabric can be found by typing<br />
ib_qlgc_vnic_query -s and noting the value of the reported string. This<br />
can be used instead of IOCGUID.<br />
C-14 D000046-005 B
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
DGID<br />
Defines the Destination Global Identifier (DGID) of the IOC to be used. This<br />
parameter should be used in all CREATE blocks. The ib_qlgc_vnic_query<br />
command will show all the DGIDs that the host can see. The DGID<br />
parameter is identified as dgid in the output of the ib_qlgc_vnic_query<br />
-e command.<br />
INSTANCE<br />
Defaults to 0. The range is 0-255. If a host connects to the same IOC more<br />
than once, each connection must be assigned a unique instance.<br />
RX_CSUM<br />
Defaults to TRUE. When true, RX_CSUM indicates that the receive<br />
checksum should be verified by the VIO hardware.<br />
HEARTBEAT<br />
Defaults to 100. Specifies the time in 1/100 of a second between the<br />
heartbeats that occur between a host and an EVIC.<br />
PORT<br />
Alternative specification for a local adapter port. The first port is 1.<br />
NOTE:<br />
If there is only one adapter in the system, then use PORT/HCA to<br />
specify the adapter port to be used. If there is more than one adapter in<br />
the system then it is recommended to use PortGuid to specify the<br />
adapter port to be used.<br />
PORTGUID<br />
The PORTGUID of the InfiniBand port to be used. Use of the PORTGUID<br />
parameter for configuring the VNIC interface has an advantage on hosts<br />
having more than 1 <strong>Host</strong> Channel Adapter. PORTGUID is persistent for a<br />
given InfiniBand port, meaning VNIC configurations should be consistent<br />
and reliable - unaffected by restarts of the OFED InfiniBand stack on host<br />
having more than 1 adapter.<br />
HCA<br />
An optional <strong>Host</strong> Channel Adapter specification for use with the PORT<br />
specification. The first adapter found by the sysem upon boot is HCA 0. If<br />
there is only one adapter in the system, then the default value of this<br />
parameter (0) can be used, so "HCA" does not need to be specified in the<br />
qlgc_vnic.cfg file. If there is more than one adapter in the system, then<br />
<strong>QLogic</strong> recommends using the PORTGUID parameter to specify an adapter<br />
port (instead of using HCA/PORT) since there is no guarantee that a system<br />
booting up always finds the same adapter first.<br />
D000046-005 B C-15
C–VirtualNIC Interface Configuration and Administration<br />
VirtualNIC Interface Configuration and Administration<br />
IB_MULTICAST<br />
Defaults to TRUE. The InfiniBand multicast parameter can be set to either<br />
TRUE or FALSE. When the parameter is set to true, it will take effect only if<br />
the corresponding CLI command (ethVirtVnic2Mcastset) is executed on<br />
the EVIC. This command is supported on EVIC FW 4.3 and above.<br />
C-16 D000046-005 B
D<br />
SRP Configuration<br />
SRP Configuration Overview<br />
SRP stands for SCSI RDMA Protocol. It allows the SCSI protocol to run over<br />
InfiniBand for Storage Area Network (SAN) usage. SRP interfaces directly to the<br />
Linux file system through the SRP Upper Layer Protocol (ULP). SRP storage can<br />
be treated as another device.<br />
In this release, two versions of SRP are available: <strong>QLogic</strong> SRP and OFED SRP.<br />
<strong>QLogic</strong> SRP is available as part of the <strong>QLogic</strong> OFED <strong>Host</strong> <strong>Software</strong>, <strong>QLogic</strong><br />
InfiniBand Fabric Suite, Rocks Roll, and Platform PCM downloads.<br />
SRP has been tested on targets from DataDirect Networks and Engenio (now<br />
LSI Logic ® ).<br />
NOTE:<br />
Before using SRP, the SRP targets must already be set up by your system<br />
administrator.<br />
Important Concepts<br />
• A SRP Initiator Port is a adapter port through which the host communicates<br />
with a SRP target device (e.g., a Fibre Channel disk array) via a SRP target<br />
port.<br />
• A SRP Target Port is an IOC of the VIO hardware. In the context of VIO<br />
hardware, an IOC can be thought of as a SRP target. An FVIC contains 2<br />
IOCs. IOC1 maps to the first adapter on the FVIC, and IOC2 maps to the<br />
2nd adapter on the FVIC. On an FCBM, there are also 2 IOCs, and IOC1<br />
maps to port 1 of the adapter of the FC BM and IOC2 maps to port 2 of the<br />
adapter of the FC BM.<br />
• A Fibre Channel Target Device is a device containing storage resources that<br />
is located remotely from a Fibre Channel host. In the context of SRP/VIO<br />
hardware, this is typically an array of disks connected via Fibre Channel to<br />
the VIO hardware.<br />
D000046-005 B D-1
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
• A SRP Initiator Extension is a 64-bit numeric value that is appended to the<br />
port GUID of the SRP initiator port, which allows an SRP initiator port to<br />
have multiple SRP maps associated with it. Maps are for FVIC only.<br />
InfiniBand attached storage will use their own mechanism as maps are not<br />
necessary.<br />
• A SRP Initiator is the combination of an SRP initiator port and an SRP<br />
initiator extension.<br />
• A SRP Target is identified by the combination of an SRP target IOC and a<br />
SRP target extension.<br />
• A SRP Session defines a connection between an SRP initiator and a SRP<br />
target.<br />
• A SRP Map associates an SRP session with a Fibre Channel Target Device.<br />
This mapping is configured on the VIO hardware. Maps are for FVIC only.<br />
InfiniBand attached storage will use their own mechanism as maps are not<br />
necessary.<br />
NOTE:<br />
• If a device connected to a map is changed, the SRP driver must<br />
be restarted.<br />
• If the connected device is unreachable for a period of time, the<br />
Linux kernel may set the device offline. If this occurs the SRP<br />
driver must be restarted.<br />
• A SRP Adapter is a collection of SRP sessions. This collection is then<br />
presented to the Linux kernel as if those sessions were all from a single<br />
adapter. All sessions configured for an adapter must ultimately connect to<br />
the same target device.<br />
NOTE:<br />
• The SRP driver must be stopped before OFED (i.e., openibd) is<br />
stopped or restarted. This is due to SRP having references on<br />
OFED modules. The Linux kernel will not allow those OFED<br />
modules to be unloaded.<br />
<strong>QLogic</strong> SRP Configuration<br />
The <strong>QLogic</strong> SRP is installed as part of the <strong>QLogic</strong> <strong>OFED+</strong> <strong>Host</strong> <strong>Software</strong> or the<br />
<strong>QLogic</strong> InfiniBand Fabric Suite. The following section provide procedures to set up<br />
and configure the <strong>QLogic</strong> SRP.<br />
D-2 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Stopping, Starting and Restarting the SRP Driver<br />
To stop the qlgc_srp driver, use the following command:<br />
/etc/init.d/qlgc_srp stop<br />
To start the qlgc_srp driver, use the following command:<br />
/etc/init.d/qlgc_srp start<br />
To restart the qlgc_srp driver, use the following command:<br />
Specifying a Session<br />
/etc/init.d/qlgc_srp restart<br />
In the SRP configuration file, a session command is a block of configuration<br />
commands, surrounded by begin and end statements. Sessions can be specified<br />
in several different ways, but all consist of specifying an SRP initiator and an SRP<br />
target port. For example:<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCGuid: 0x00066AXXXXXXXXXX<br />
initiatorExtension: 2<br />
end<br />
The session command has two parts; the part that specifies the SRP initiator and<br />
the part that specifies the SRP target port. The SRP initiator contains two parts,<br />
the SRP initiator port and the SRP initiator extension. The SRP initiator extension<br />
portion of the SRP initiator is optional, and defaults to a value of 1. However, if a<br />
SRP initiator extension is not specified, each port on the adapter can use only one<br />
SRP map per VIO device. In addition a targetExtension can be specified (the<br />
default is 1).<br />
The SRP Initiator Port may be specified in two different ways:<br />
1. By using the port GUID of the adapter port used for the connection, or<br />
2. Specify the index of the adapter card being used (this is zero-based, so if<br />
there is only one adapter card in the system use a value of 0) and the index<br />
of the port number (1 or 2) of the adapter card being used.<br />
The SRP target port may be specified in two different ways:<br />
D000046-005 B D-3
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
1. By the port GUID of the IOC, or<br />
2. By the IOC profile string that is created by the VIO device (i.e., a string<br />
containing the chassis GUID, the slot number and the IOC number). FVIC<br />
creates the device in this manner, other devices have their own naming<br />
method.<br />
To specify the host InfiniBand port to use, the user can either specify the port<br />
GUID of the local InfiniBand port, or simply use the index numbers of the cards<br />
and the ports on the cards. Cards are numbered from 0 on up, based on the order<br />
they occur in the PCI bus. Ports are numbered in the same way, from first to last.<br />
To see which cards and ports are available for use, type the following command:<br />
ib_qlgc_srp_query<br />
The system returns input similar to the following:<br />
D-4 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
st187:~/qlgc-srp-1_3_0_0_1 # ib_qlgc_srp_query<br />
<strong>QLogic</strong> Corporation. Virtual HBA (SRP) SCSI Query Application, version 1.3.0.0.1<br />
1 IB <strong>Host</strong> Channel Adapter present in system.<br />
HCA Card 0 : 0x0002c9020026041c<br />
Port 1 GUID : 0x0002c9020026041d<br />
Port 2 GUID : 0x0002c9020026041e<br />
SRP Targets :<br />
SRP IOC Profile : FVIC in Chassis 0x00066a000300012a, Slot 17, Ioc 1<br />
SRP IOC GUID : 0x00066a01dd000021<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />
service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />
service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a11dd000021<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a11dd000021<br />
SRP IOC Profile : FVIC in Chassis 0x00066a000300012a, Slot 17, Ioc 2<br />
SRP IOC GUID : 0x00066a02dd000021<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />
service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />
service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a21dd000021<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a21dd000021<br />
SRP IOC Profile : Chassis 0x00066A0050000135, Slot 5, IOC 1<br />
SRP IOC GUID : 0x00066a013800016c<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />
service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />
service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a026000016c<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a026000016c<br />
D000046-005 B D-5
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
SRP IOC Profile : Chassis 0x00066A0050000135, Slot 5, IOC 2<br />
SRP IOC GUID : 0x00066a023800016c<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
service 1 : name SRP.T10:0000000000000002 id 0x0000494353535250<br />
service 2 : name SRP.T10:0000000000000003 id 0x0000494353535250<br />
service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a026000016c<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a026000016c<br />
SRP IOC Profile : Chassis 0x00066A0050000135, Slot 8, IOC 1<br />
SRP IOC GUID : 0x00066a0138000174<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a0260000174<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a0260000174<br />
SRP IOC Profile : Chassis 0x00066A0050000135, Slot 8, IOC 2<br />
SRP IOC GUID : 0x00066a0238000174<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a0260000174<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a0260000174<br />
st187:~/qlgc-srp-1_3_0_0_1 #<br />
Determining the values to use for the configuration<br />
In order to build the configuration file, use the command<br />
ib_qlgc_srp_build_cfg script as follows:<br />
Enter ib_qlgc_srp_build_cfg. The system provides output similar to the<br />
following:<br />
D-6 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
# qlgc_srp.cfg file generated by /usr/sbin/ib_qlgc_srp_build_cfg, version 1.3.0.0.17, on<br />
Mon Aug 25 13:42:16 EDT 2008<br />
#Found <strong>QLogic</strong> OFED SRP<br />
registerAdaptersInOrder: ON<br />
# =============================================================<br />
# IOC Name: BC2FC in Chassis 0x0000000000000000, Slot 6, Ioc 1<br />
# IOC GUID: 0x00066a01e0000149 SRP IU SIZE : 320<br />
# service 0 : name SRP.T10:0000000000000001 id 0x0000494353535250<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
#portGuid: 0x0002c9030000110d<br />
initiatorExtension: 1<br />
targetIOCGuid: 0x00066a01e0000149<br />
targetIOCProfileIdString: "FVIC in Chassis 0x0000000000000000, Slot 6, Ioc 1"<br />
targetPortGid: 0xfe8000000000000000066a01e0000149<br />
targetExtension: 0x0000000000000001<br />
SID: 0x0000494353535250<br />
IOClass: 0xff00<br />
end<br />
adapter<br />
begin<br />
adapterIODepth: 1000<br />
lunIODepth: 16<br />
adapterMaxIO: 128<br />
adapterMaxLUNs: 512<br />
adapterNoConnectTimeout: 60<br />
adapterDeviceRequestTimeout: 2<br />
# set to 1 if you want round robin load balancing<br />
roundrobinmode: 0<br />
# set to 1 if you do not want target connectivity verification<br />
noverify: 0<br />
description: "SRP Virtual HBA 0"<br />
end<br />
The ib_qlgc_srp_build_cfg command creates a configuration file based on<br />
discovered target devices. By default, the information is sent to stdout. In order<br />
to create a configuration file, output should be redirected to a disk file. Enter<br />
ib_qlgc_srp_build_cfg -h for a list and description of the option flags.<br />
D000046-005 B D-7
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
NOTE:<br />
The default configuration generated by ib_qlgc_srp_build_cfg for OFED<br />
is similar to the one generated for the QuickSilver host stack with the<br />
following differences:<br />
• For OFED, the configuration automatically includes IOClass<br />
• For OFED, the configuration automatically includes SID<br />
• For OFED, the configuration provides information on targetPortGid<br />
instead of targetPortGuid<br />
• For OFED, the configuration automatically includes<br />
targetIOCProfileIdString.<br />
Specifying an SRP Initiator Port of a Session by Card and<br />
Port Indexes<br />
The following example specifies a session by card and port indexes. If the system<br />
contains only one adapter, use this method.<br />
session<br />
begin<br />
#Specifies the near side by card index<br />
card: 0 #Specifies first HCA<br />
port: 1 #Specifies first port<br />
targetIOCGuid: 0x00066013800016C<br />
end<br />
Specifying an SRP Initiator Port of Session by Port GUID<br />
The following example specifies a session by port GUID. If the system contains<br />
more than one adapter, use this method.<br />
session<br />
begin<br />
portGuid: 0x00066A00a00001a2 #Specifies port by its GUID<br />
targetIOCGuid: 0x00066A013800016C<br />
end<br />
NOTE:<br />
When using this method, if the port GUIDs are changed, they must also be<br />
changed in the configuration file.<br />
D-8 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Specifying a SRP Target Port<br />
The SRP target can be specified in two different ways. To connect to a particular<br />
SRP target no matter where it is in the fabric, use the first method (By IOCGUID).<br />
To connect to a SRP target that is in a certain chassis/slot, no matter which card it<br />
is on (For FVIC, the user does not want to change the configuration, if cards are<br />
switched in a slot) then use the second method.<br />
1. By IOCGUID. For example:<br />
targetIOCGuid: 0x00066A013800016c<br />
2. By target IOC Profile String. For example:<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A005000010E, Slot 1, IOC 1"<br />
NOTE:<br />
When specifying the targetIOCProfileIdString, the string is case<br />
and format sensitive. The easiest way to get the correct format is to cut<br />
and paste it from the output of the /usr/sbin/ib_qlgc_srp_query<br />
program.<br />
NOTE:<br />
For FVIC, by specifying the SRP Target Port by IOCGUID, this ensures<br />
that the session will always be mapped to the specific port on this<br />
specific VIO hardware card, even if the card is moved to a different slot<br />
in the same chassis or even if it is moved to a different chassis.<br />
NOTE:<br />
For FVIC, by specifying the SRP Target Port by Profile String, this<br />
ensures that the session will always be mapped to the VIO hardware<br />
card in the specific slot of a chassis, even if the VIO hardware card<br />
currently in that slot is replaced by a different VIO hardware card.<br />
D000046-005 B D-9
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Specifying a SRP Target Port of a Session by IOCGUID<br />
The following example specifies a target by IOC GUID:<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCGuid: 0x00066A013800016c #IOC GUID of the InfiniFibre port<br />
end<br />
• 0x00066a10dd000046<br />
• 0x00066a20dd000046<br />
Specifying a SRP Target Port of a Session by Profile String<br />
The following example specifies a target by Profile String:<br />
Specifying an Adapter<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
# FVIC in Chassis 0x00066A005000010E,<br />
# Slot number 1, port 1<br />
targetIOCProfileIdString: “FVIC in Chassis 0x00066A005000010E, Slot 1, IOC 1”<br />
end<br />
An adapter is a collection of sessions. This collection is presented to the Linux<br />
kernel as if the collection was a single Fibre Channel adapter. The host system<br />
has no information regarding session connectivity. It only sees the end target fibre<br />
channel devices. The adapter section of the qlgc_srp configuration file contains<br />
multiple parameters. These parameters are listed in the adapter section of the<br />
ib_qlgc_srp_build_cfg script system output shown in “Determining the values<br />
to use for the configuration” on page D-6 The following example specifies an<br />
adapter:<br />
adapter<br />
begin<br />
description: “Oracle RAID Array”<br />
end<br />
Restarting the SRP Module<br />
For changes to take effect, including changes to the SRP map on the VIO card,<br />
SRP will need to be restarted. To restart the qlgc_srp driver, use the following<br />
command:<br />
/etc/init.d/qlgc_srp restart<br />
D-10 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Configuring an Adapter with Multiple Sessions<br />
Each adapter can have an unlimited number of sessions attached to it. Unless<br />
round robin is specified, SRP will only use one session at a time. However,<br />
there is still an advantage to configuring an adapter with multiple sessions. For<br />
example, if an adapter is configured with only one session and that session fails,<br />
all SCSI I/Os on that session will fail and access to SCSI target devices will be<br />
lost. While the qlgc_srp module will attempt to recover the broken session, this<br />
may take some time (e.g., if a cable was pulled, the FC port has failed, or an<br />
adapter has failed). However, if the host is using an adapter configured with<br />
multiple sessions and the current session fails, the host will automatically switch<br />
to an alternate session. The result is that the host can quickly recover and<br />
continue to access the SCSI target devices.<br />
WARNING!!<br />
When using two VIO hardware cards within one Adapter, the cards must<br />
have identical Fibre Channel configurations and maps. Data corruption can<br />
result from using different configurations and/or maps.<br />
D000046-005 B D-11
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
When the qlgc_srp module encounters an adapter command, that adapter is<br />
assigned all previously defined sessions (that have not been assigned to other<br />
adapters). This makes it easy to configure a system for multiple SRP adapters.<br />
The following is an example configuration that uses multiple sessions and<br />
adapters:<br />
session<br />
begin<br />
card: 0<br />
port: 2<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A005000011D, Slot 1, IOC 1"<br />
initiatorExtension: 3<br />
end<br />
adapter<br />
begin<br />
description: "Test Device"<br />
end<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A005000011D, Slot 2, IOC 1"<br />
initiatorExtension: 2<br />
end<br />
adapter<br />
begin<br />
description: "Test Device 1"<br />
end<br />
D-12 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A005000011D, Slot 1, IOC 2"<br />
initiatorExtension: 2<br />
end<br />
adapter<br />
begin<br />
description: "Test Device 1"<br />
end<br />
Configuring Fibre Channel Failover<br />
Fibre Channel failover is essentially failing over from one session in an adapter to<br />
another session in the same adapter.<br />
Following is a list of the different type of failover scenarios:<br />
• Failing over from one SRP initiator port to another.<br />
• Failing over from a port on the VIO hardware card to another port on the VIO<br />
hardware card.<br />
• Failing over from a port on a VIO hardware card to a port on a different VIO<br />
hardware card within the same virtual I/O chassis.<br />
• Failing over from a port on a VIO hardware card to a port on a different VIO<br />
hardware card in a different virtual I/O chassis.<br />
D000046-005 B D-13
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Failover Configuration File 1: Failing over from one<br />
SRP Initiator port to another<br />
In this failover configuration file, the first session (using adapter Port 1) is used to<br />
reach the SRP Target Port. If a problem is detected in this session (e.g., the<br />
InfiniBand cable on port 1 of the adapter is pulled) then the 2nd session (using<br />
adapter Port 2) will be used.<br />
# service 0: name SRP.T10:0000000000000001 id<br />
0x0000494353535250<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
#portGuid: 0x0002c903000010f1<br />
initiatorExtension: 1<br />
targetIOCGuid: 0x00066a01e0000149<br />
targetIOCProfileIdString: "BC2FC in Chassis<br />
0x0000000000000000, Slot 6, Ioc 1"<br />
targetPortGid: 0xfe8000000000000000066a01e0000149<br />
targetExtension: 0x0000000000000001<br />
SID: 0x0000494353535250<br />
IOClass: 0xff00<br />
end<br />
session<br />
begin<br />
card: 0<br />
port: 2<br />
#portGuid: 0x0002c903000010f2<br />
initiatorExtension: 1<br />
targetIOCGuid: 0x00066a01e0000149<br />
targetIOCProfileIdString: "BC2FC in Chassis<br />
0x0000000000000000, Slot 6, Ioc 1"<br />
targetPortGid: 0xfe8000000000000000066a01e0000149<br />
targetExtension: 0x0000000000000001<br />
SID: 0x0000494353535250<br />
IOClass: 0xff00<br />
end<br />
adapter<br />
begin<br />
adapterIODepth: 1000<br />
lunIODepth: 16<br />
adapterMaxIO: 128<br />
adapterMaxLUNs: 512<br />
adapterNoConnectTimeout: 60<br />
D-14 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
adapterDeviceRequestTimeout: 2<br />
# set to 1 if you want round robin load balancing<br />
roundrobinmode: 0<br />
# set to 1 if you do not want target connectivity<br />
verification<br />
noverify: 0<br />
description: "SRP Virtual HBA 0"<br />
end<br />
Failover Configuration File 2: Failing over from a port on the<br />
VIO hardware card to another port on the VIO hardware card<br />
session<br />
begin<br />
card: 0 (InfiniServ HCA card number)<br />
port: 1 (InfiniServ HCA port number)<br />
targetIOCProfileIdString: "FVIC in Chassis ,<br />
, "<br />
initiatorExtension: 1<br />
end<br />
session<br />
begin<br />
card: 0 (InfiniServ HCA card number)<br />
port: 1 (InfiniServ HCA port number)<br />
targetIOCProfileIdString: "FVIC in Chassis ,<br />
, "<br />
initiatorExtension: 1 (Here the extension should be different<br />
if using the same IOC in this adapter for FVIC, so that<br />
separate maps can be created for each session).<br />
end<br />
adapter<br />
begin<br />
description: “FC port Failover”<br />
end<br />
On the VIO hardware side, the following needs to be ensured:<br />
• The target device is discovered and configured for each of the ports that is<br />
involved in the failover.<br />
• The SRP Initiator is discovered and configured once for each different<br />
initiatorExtension.<br />
• Each map should use a different Configured Device, e.g Configured Device<br />
1 has the Target being discovered over FC Port 1, and Configured Device 2<br />
has the Target being discovered over FC Port 2.)<br />
D000046-005 B D-15
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Failover Configuration File 3: Failing over from a port on a<br />
VIO hardware card to a port on a different VIO hardware card<br />
within the same Virtual I/O chassis<br />
session<br />
begin<br />
card: 0 (InfiniServ HCA card number)<br />
port: 1 (InfiniServ HCA port number)<br />
targetIOCProfileIdString: "FVIC in Chassis ,<br />
, "<br />
initiatorExtension: 1<br />
end<br />
session<br />
begin<br />
card: 0 (InfiniServ HCA card number)<br />
port: 1 (InfiniServ HCA port number)<br />
targetIOCProfileIdString: "FVIC in Chassis ,<br />
, " (Slot number differs to indicate a different<br />
VIO card)<br />
initiatorExtension: 1 (Here the initiator extension can be the<br />
same as in the previous definition, because the SRP map is<br />
being defined on a different FC gateway card)<br />
end<br />
adapter<br />
begin<br />
description: “FC Port Failover”<br />
end<br />
On the VIO hardware side, the following need to be ensured on each FVIC<br />
involved in the failover:<br />
• The target device is discovered and configured through the appropriate FC<br />
port<br />
• The SRP Initiator is discovered and configured once for the proper<br />
initiatorExtension.<br />
• The SRP map created for the initiator connects to the same target<br />
D-16 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Failover Configuration File 4: Failing over from a port on a<br />
VIO hardware card to a port on a different VIO hardware<br />
card in a different Virtual I/O chassis<br />
session<br />
begin<br />
card: 0 (InfiniServ HCA card number)<br />
port: 1 (InfiniServ HCA port number)<br />
targetIOCProfileIdString: "FVIC in Chassis ,<br />
, "<br />
initiatorExtension: 1<br />
end<br />
session<br />
begin<br />
card: 0 (InfiniServ HCA card number)<br />
port: 1 (InfiniServ HCA port number)<br />
targetIOCProfileIdString: "FVIC in Chassis ,<br />
, " (Chassis GUID differs to indicate a card in a<br />
different chassis)<br />
initiatorExtension:1 (Here the initiator extension can be the<br />
same as in the previous definition, because the SRP map is<br />
being defined on a different FC gateway card)<br />
end<br />
adapter<br />
begin<br />
description: “FC Port Failover”<br />
end<br />
On the VIO hardware side, the following need to be ensured on each FVIC<br />
involved in the failover:<br />
• The target device is discovered and configured through the appropriate FC<br />
port<br />
• The SRP Initiator is discovered and configured once for the proper<br />
initiatorExtension.<br />
• The SRP map created for the initiator connects to the same target<br />
D000046-005 B D-17
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Configuring Fibre Channel Load Balancing<br />
The following examples display typical scenarios for how to configure Fibre<br />
Channel load balancing.<br />
In the first example, traffic going to any Fibre Channel Target Device where both<br />
ports of the VIO hardware card have a valid map, are split between the two ports<br />
of the VIO hardware card. If one of the VIO hardware ports goes down, then all of<br />
the traffic will go over the remaining port that is up.<br />
1 Adapter Port and 2 Ports on a Single VIO<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A0050000123, Slot 1, IOC 1"<br />
initiatorExtension: 3<br />
end<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A0050000123, Slot 1, IOC 2"<br />
initiatorExtension: 3<br />
end<br />
adapter<br />
begin<br />
description: "Test Device"<br />
roundrobinmode: 1<br />
end<br />
D-18 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
2 Adapter Ports and 2 Ports on a Single VIO Module<br />
In this example, traffic is load balanced between adapter Port 2/VIO hardware<br />
Port 1 and adapter Port1/VIO hardware Port 1. If one of the sessions goes down<br />
(due to an InfiniBand cable failure or an FC cable failure), all traffic will begin using<br />
the other session.<br />
session<br />
begin<br />
card: 0<br />
port: 2<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A0050000123, Slot 1, IOC 1"<br />
initiatorExtension: 3<br />
end<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A0050000123, Slot 1, IOC 2"<br />
initiatorExtension: 3<br />
end<br />
adapter<br />
begin<br />
description: "Test Device"<br />
roundrobinmode: 1<br />
end<br />
D000046-005 B D-19
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Using the roundrobinmode Parameter<br />
In this example, the two sessions use different VIO hardware cards as well as<br />
different adapter ports. Traffic will be load-balanced between the two sessions. If<br />
there is a failure in one of the sessions (e.g., one of the VIO hardware cards is<br />
rebooted) traffic will begin using the other session.<br />
session<br />
begin<br />
card: 0<br />
port: 2<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A005000011D, Slot 1, IOC 1"<br />
initiatorExtension: 2<br />
end<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
targetIOCProfileIdString: "FVIC in Chassis<br />
0x00066A005000011D, Slot 2, IOC 1"<br />
initiatorExtension: 2<br />
end<br />
adapter<br />
begin<br />
description: "Test Device"<br />
roundrobinmode: 1<br />
end<br />
D-20 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
Configuring SRP for Native InfiniBand Storage<br />
1. <strong>Rev</strong>iew ib_qlgc_srp_query.<br />
<strong>QLogic</strong> Corporation. Virtual HBA (SRP) SCSI Query Application,<br />
version 1.3.0.0.1<br />
1 IB <strong>Host</strong> Channel Adapter present in system.<br />
HCA Card 1 : 0x0002c9020026041c<br />
Port 1 GUID : 0x0002c9020026041d<br />
Port 2 GUID : 0x0002c9020026041e<br />
SRP Targets :<br />
SRP IOC Profile : Native IB Storage SRP Driver<br />
SRP IOC GUID : 0x00066a01dd000021<br />
SRP IU SIZE : 320<br />
SRP IU SG SIZE: 15<br />
SRP IO CLASS : 0xff00<br />
service 0 : name SRP.T10:0000000000000001 id<br />
0x0000494353535250<br />
service 1 : name SRP.T10:0000000000000002 id<br />
0x0000494353535250<br />
service 2 : name SRP.T10:0000000000000003 id<br />
0x0000494353535250<br />
service 3 : name SRP.T10:0000000000000004 id<br />
0x0000494353535250<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID<br />
0xfe8000000000000000066a11dd000021<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID<br />
0xfe8000000000000000066a11dd000021<br />
2. Edit /etc/sysconfig/qlgc_srp.cfg to add this information.<br />
# service : name SRP.T10:0000000000000001 id<br />
0x0000494353535250<br />
session<br />
begin<br />
card: 0<br />
port: 1<br />
#portGuid: 0x0002c903000010f1<br />
initiatorExtension: 1<br />
targetIOCGuid: 0x00066a01e0000149<br />
D000046-005 B D-21
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
targetIOCProfileIdString: “Native IB Storage SRP Driver”<br />
targetPortGid: 0xfe8000000000000000066a01e0000149<br />
targetExtension: 0x0000000000000001<br />
SID: 0x0000494353535250<br />
IOClass: 0x0100<br />
end<br />
session<br />
begin<br />
card: 0<br />
port: 2<br />
#portGuid: 0x0002c903000010f2<br />
initiatorExtension: 1<br />
targetIOCGuid: 0x00066a01e0000149<br />
targetIOCProfileIdString: Native IB Storage SRP Driver<br />
targetPortGid: 0xfe8000000000000000066a01e0000149<br />
targetExtension: 0x0000000000000001<br />
SID: 0x0000494353535250<br />
IOClass: 0x0100<br />
end<br />
adapter<br />
begin<br />
adapterIODepth: 1000<br />
lunIODepth: 16<br />
adapterMaxIO: 128<br />
adapterMaxLUNs: 512<br />
adapterNoConnectTimeout: 60<br />
adapterDeviceRequestTimeout: 2<br />
# set to 1 if you want round robin load balancing<br />
roundrobinmode: 0<br />
# set to 1 if you do not want target connectivity<br />
verification<br />
noverify: 0<br />
description: "SRP Virtual HBA 0"<br />
end<br />
D-22 D000046-005 B
D–SRP Configuration<br />
<strong>QLogic</strong> SRP Configuration<br />
3. Note the correlation between the output of ib_qlgc_srp_query and<br />
qlgc_srp.cfg<br />
Target Path(s):<br />
HCA 0 Port 1 0x0002c9020026041d -> Target Port GID<br />
0xfe8000000000000000066a11dd000021<br />
HCA 0 Port 2 0x0002c9020026041e -> Target Port GID<br />
0xfe8000000000000000066a11dd000021<br />
qlgc_srp.cfg:<br />
session<br />
begin<br />
. . . .<br />
targetIOCGuid: 0x0002C90200400098<br />
targetExtension: 0x0002C90200400098<br />
end<br />
adapter<br />
begin<br />
description: "Native IB storage"<br />
end<br />
Notes<br />
• There is a sample configuration in qlgc_srp.cfg.<br />
<br />
<br />
The correct TargetExtension must be added to session.<br />
It is important to use the IOC ID method since most Profile ID strings<br />
are not guaranteed to be unique.<br />
• Other possible parameters:<br />
<br />
initiatorExtension may be used by the storage device to identify<br />
the host.<br />
Additional Details<br />
• All LUNs found are reported to the Linux SCSI mid-layer.<br />
• Linux may need the max_scsi_luns (2.4 kernels) or max_luns (2.6 kernels)<br />
parameter configured in scsi_mod.<br />
Troubleshooting<br />
For troubleshooting information, refer to “Troubleshooting SRP Issues” on<br />
page G-8.<br />
D000046-005 B D-23
D–SRP Configuration<br />
OFED SRP Configuration<br />
OFED SRP Configuration<br />
To use OFED SRP, follow these steps:<br />
1. Add the line SRP_LOAD=yes to the module list in<br />
/etc/infiniband/openib.conf to have it automatically loaded.<br />
2. Discover the SRP devices on your fabric by running this command (as a root<br />
user):<br />
ibsrpdm<br />
In the output, look for lines similar to these:<br />
GUID: 0002c90200402c04<br />
ID: LSI Storage Systems SRP Driver 200400a0b8114527<br />
service entries: 1<br />
service[ 0]: 200400a0b8114527 / SRP.T10:200400A0B8114527<br />
GUID: 0002c90200402c0c<br />
ID: LSI Storage Systems SRP Driver 200500a0b8114527<br />
service entries: 1<br />
service[ 0]: 200500a0b8114527 / SRP.T10:200500A0B8114527<br />
GUID: 21000001ff040bf6<br />
ID: Data Direct Networks SRP Target System<br />
service entries: 1<br />
service[ 0]: f60b04ff01000021 / SRP.T10:21000001ff040bf6<br />
Note that not all the output is shown here; key elements are expected to<br />
show the match in Step 3.<br />
3. Choose the device you want to use, and run the command again with the -c<br />
option (as a root user):<br />
# ibsrpdm -c<br />
id_ext=200400A0B8114527,ioc_guid=0002c90200402c04,dgid=fe8000<br />
00000000000002c90200402c05,pkey=ffff,service_id=200400a0b8114<br />
527<br />
id_ext=200500A0B8114527,ioc_guid=0002c90200402c0c,dgid=fe8000<br />
00000000000002c90200402c0d,pkey=ffff,service_id=200500a0b8114<br />
527<br />
id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe8000<br />
000000000021000001ff040bf6,pkey=ffff,service_id=f60b04ff01000<br />
021<br />
D-24 D000046-005 B
D–SRP Configuration<br />
OFED SRP Configuration<br />
4. Find the result that corresponds to the target you want, and echo it into the<br />
add_target file:<br />
echo<br />
"id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe800<br />
0000000000021000001ff040bf6,pkey=ffff,service_id=f60b04ff0100<br />
0021,initiator_ext=0000000000000001" ><br />
/sys/class/infiniband_srp/srp-ipath0-1/add_target<br />
5. Look for the newly created devices in the /proc/partitions file. The file<br />
will look similar to this example (the partition names may vary):<br />
# cat /proc/partitions<br />
major minor #blocks name<br />
8 64 142325760 sde<br />
8 65 142319834 sde1<br />
8 80 71162880 sdf<br />
8 81 71159917 sdf1<br />
8 96 20480 sdg<br />
8 97 20479 sdg1<br />
6. Create a mount point (as root) where you will mount the SRP device. For<br />
example:<br />
mkdir /mnt/targetname<br />
mount /dev/sde1 /mnt/targetname<br />
NOTE:<br />
Use sde1 rather than sde. See the mount(8) man page for more<br />
information on creating mount points.<br />
D000046-005 B D-25
D–SRP Configuration<br />
OFED SRP Configuration<br />
Notes<br />
D-26 D000046-005 B
E<br />
Integration with a Batch<br />
Queuing System<br />
Most cluster systems use some kind of batch queuing system as an orderly way to<br />
provide users with access to the resources they need to meet their job’s<br />
performance requirements. One task of the cluster administrator is to allow users<br />
to submit MPI jobs through these batch queuing systems. Two methods are<br />
described in this document:<br />
• Use mpiexec within the Portable Batch System (PBS) environment.<br />
• Invoke a script, similar to mpirun, within the SLURM context to submit MPI<br />
jobs. A sample is provided in “Using SLURM for Batch Queuing” on<br />
page E-2.<br />
Using mpiexec with PBS<br />
mpiexec can be used as a replacement for mpirun within a PBS cluster<br />
environment. The PBS software performs job scheduling.<br />
For PBS-based batch systems, <strong>QLogic</strong> MPI processes can be spawned using the<br />
mpiexec utility distributed and maintained by the Ohio Supercomputer Center<br />
(OSC).<br />
Starting with mpiexec version 0.84, MPI applications compiled and linked with<br />
<strong>QLogic</strong> MPI can use mpiexec and PBS’s Task Manager (TM) interface to spawn<br />
and correctly terminate parallel jobs.<br />
To download the latest version of mpiexec, go to:<br />
http://www.osc.edu/~pw/mpiexec/<br />
To build mpiexec for <strong>QLogic</strong> MPI and install it in /usr/local, type:<br />
$ tar zxvf mpiexec-0.84.tgz<br />
$ cd mpiexec-0.84<br />
$ ./configure --enable-default-comm=mpich-psm && gmake all install<br />
D000046-005 B E-1
E–Integration with a Batch Queuing System<br />
Using SLURM for Batch Queuing<br />
NOTE:<br />
This level of support is specific to <strong>QLogic</strong> MPI, and not to other MPIs that<br />
currently support InfiniPath.<br />
For more usage information, see the OSC mpiexec documentation.<br />
For more information on PBS, go to: http://www.pbsgridworks.com/<br />
Using SLURM for Batch Queuing<br />
The following is an example of the some of the functions that a batch queuing<br />
script might perform. The example is in the context of the Simple Linux Utility<br />
Resource Manager (SLURM) developed at Lawrence Livermore National<br />
Laboratory. These functions assume the use of the bash shell. The following<br />
script is called batch_mpirun:<br />
#! /bin/sh<br />
# Very simple example batch script for <strong>QLogic</strong> MPI, using slurm<br />
# (http://www.llnl.gov/linux/slurm/)<br />
# Invoked as:<br />
#batch_mpirun #cpus mpi_program_name mpi_program_args ...<br />
#<br />
np=$1 mpi_prog="$2" # assume arguments to script are correct<br />
shift 2 # program args are now $@<br />
eval ‘srun --allocate --ntasks=$np --no-shell‘<br />
mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘<br />
srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \<br />
| awk ’{printf "%s:%s\n", $2, $1}’ > $mpihosts_file<br />
mpirun -np $np -m $mpihosts_file "$mpi_prog" $@<br />
exit_code=$<br />
scancel ${SLURM_JOBID}<br />
rm -f $mpihosts_file<br />
exit $exit_code<br />
In the following sections, the setup and the various script functions are discussed<br />
in more detail.<br />
E-2 D000046-005 B
E–Integration with a Batch Queuing System<br />
Using SLURM for Batch Queuing<br />
Allocating Resources<br />
When the mpirun command starts, it requires specification of the number of node<br />
programs it must spawn (via the -np option) and specification of an mpihosts<br />
file listing the nodes that the node programs run on. (See “Environment for Node<br />
Programs” on page 4-19 for more information.) Since performance is usually<br />
important, a user might require that his node program be the only application<br />
running on each node CPU. In a typical batch environment, the MPI user would<br />
still specify the number of node programs, but would depend on the batch system<br />
to allocate specific nodes when the required number of CPUs become available.<br />
Thus, batch_mpirun would take at least an argument specifying the number of<br />
node programs and an argument specifying the MPI program to be executed. For<br />
example:<br />
$ batch_mpirun -np n my_mpi_program<br />
After parsing the command line arguments, the next step of batch_mpirun is to<br />
request an allocation of n processors from the batch system. In SLURM, this uses<br />
the command:<br />
eval ‘srun --allocate --ntasks=$np --no-shell‘<br />
Make sure to use back quotes rather than normal single quotes. $np is the shell<br />
variable that your script has set from the parsing of its command line options. The<br />
--no-shell option to srun prevents SLURM from starting a subshell. The srun<br />
command is run with eval to set the SLURM_JOBID shell variable from the output<br />
of the srun command.<br />
With these specified arguments, the SLURM function srun blocks until there are<br />
$np processors available to commit to the caller. When the requested resources<br />
are available, this command opens a new shell and allocates the number of<br />
processors to the requestor.<br />
Generating the mpihosts File<br />
Once the batch system has allocated the required resources, your script must<br />
generate an mpihosts file that contains a list of nodes that are used. To do this,<br />
the script must determine which nodes the batch system has allocated, and how<br />
many processes can be started on each node. This is the part of the script<br />
batch_mpirun that performs these tasks, for example:<br />
mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘<br />
srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \<br />
| awk ’{printf "%s:%s\n", $2, $1}’ > $mpihosts_file<br />
The first command creates a temporary hosts file with a random name, and<br />
assigns the name to the variable mpihosts_file it has generated.<br />
The next instance of the SLURM srun command runs hostname -s once for<br />
each process slot that SLURM has allocated. If SLURM has allocated two slots on<br />
one node, hostname -s is output twice for that node.<br />
D000046-005 B E-3
E–Integration with a Batch Queuing System<br />
Using SLURM for Batch Queuing<br />
The sort | uniq -c component determines the number of times each unique<br />
line was printed. The awk command converts the result into the mpihosts file<br />
format used by mpirun. Each line consists of a node name, a colon, and the<br />
number of processes to start on that node.<br />
NOTE:<br />
This is one of two formats that the file can use. See “Console I/O in MPI<br />
Programs” on page 4-18 for more information.<br />
Simple Process Management<br />
At this point, the script has enough information to be able to run an MPI program.<br />
The next step is to start the program when the batch system is ready, and notify<br />
the batch system when the job completes. This is done in the final part of<br />
batch_mpirun, for example:<br />
mpirun -np $np -m $mpihosts_file "$mpi_prog" $@<br />
exit_code=$<br />
scancel ${SLURM_JOBID}<br />
rm -f $mpihosts_file<br />
exit $exit_code<br />
Clean Termination of MPI Processes<br />
The InfiniPath software normally ensures clean termination of all MPI programs<br />
when a job ends, but in some rare circumstances an MPI process may remain<br />
alive, and potentially interfere with future MPI jobs. To avoid this problem, run a<br />
script before and after each batch job that kills all unwanted processes. <strong>QLogic</strong><br />
does not provide such a script, but it is useful to know how to find out which<br />
processes on a node are using the <strong>QLogic</strong> interconnect. The easiest way to do<br />
this is with the fuser command, which is normally installed in /sbin.<br />
Run these commands as a root user to ensure that all processes are reported.<br />
# /sbin/fuser -v /dev/ipath<br />
/dev/ipath: 22648m 22651m<br />
In this example, processes 22648 and 22651 are using the <strong>QLogic</strong> interconnect. It<br />
is also possible to use this command (as a root user):<br />
# lsof /dev/ipath<br />
This command displays a list of processes using InfiniPath. Additionally, to get all<br />
processes, including stats programs, ipath_sma, diags, and others, run the<br />
program in this way:<br />
# /sbin/fuser -v /dev/ipath*<br />
lsof can also take the same form:<br />
# lsof /dev/ipath*<br />
E-4 D000046-005 B
E–Integration with a Batch Queuing System<br />
Lock Enough Memory on Nodes when Using SLURM<br />
The following command terminates all processes using the <strong>QLogic</strong> interconnect:<br />
# /sbin/fuser -k /dev/ipath<br />
For more information, see the man pages for fuser(1) and lsof(8).<br />
Note that hard and explicit program termination, such as kill -9 on the mpirun<br />
Process ID (PID), may result in <strong>QLogic</strong> MPI being unable to guarantee that the<br />
/dev/shm shared memory file is properly removed. As many stale files<br />
accumulate on each node, an error message can appear at startup:<br />
node023:6.Error creating shared memory object in shm_open(/dev/shm<br />
may have stale shm files that need to be removed):<br />
If this occurs, administrators should clean up all stale files by using this command:<br />
# rm -rf /dev/shm/psm_shm.*<br />
See “Error Creating Shared Memory Object” on page F-23 for more information.<br />
Lock Enough Memory on Nodes when Using<br />
SLURM<br />
This section is identical to information provided in “Lock Enough Memory on<br />
Nodes When Using a Batch Queuing System” on page F-22. It is repeated here<br />
for your convenience.<br />
<strong>QLogic</strong> MPI requires the ability to lock (pin) memory during data transfers on each<br />
compute node. This is normally done via /etc/initscript, which is created or<br />
modified during the installation of the infinipath RPM (setting a limit of<br />
128 MB, with the command ulimit -l 131072).<br />
Some batch systems, such as SLURM, propagate the user’s environment from<br />
the node where you start the job to all the other nodes. For these batch systems,<br />
you may need to make the same change on the node from which you start your<br />
batch jobs.<br />
If this file is not present or the node has not been rebooted after the infinipath<br />
RPM has been installed, a failure message similar to one of the following will be<br />
generated.<br />
The following message displays during installation:<br />
$ mpirun -np 2 -m ~/tmp/sm mpi_latency 1000 1000000<br />
iqa-19:0.ipath_userinit: mmap of pio buffers at 100000 failed:<br />
Resource temporarily unavailable<br />
iqa-19:0.Driver initialization failure on /dev/ipath<br />
iqa-20:1.ipath_userinit: mmap of pio buffers at 100000 failed:<br />
Resource temporarily unavailable<br />
iqa-20:1.Driver initialization failure on /dev/ipath<br />
D000046-005 B E-5
E–Integration with a Batch Queuing System<br />
Lock Enough Memory on Nodes when Using SLURM<br />
The following message displays after installation:<br />
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000<br />
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory<br />
mpi_latency:<br />
/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src<br />
mq_ips.c:691:<br />
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program<br />
unexpectedly quit. Exiting.<br />
You can check the ulimit -l on all the nodes by running ipath_checkout. A<br />
warning similar to this displays if ulimit -l is less than 4096:<br />
!!!ERROR!!! Lockable memory less than 4096KB on x nodes<br />
To fix this error, install the infinipath RPM on the node, and reboot it to ensure<br />
that /etc/initscript is run.<br />
Alternately, you can create your own /etc/initscript and set the ulimit<br />
there.<br />
E-6 D000046-005 B
F<br />
Troubleshooting<br />
This appendix describes some of the tools you can use to diagnose and fix<br />
problems. The following topics are discussed:<br />
• Using LEDs to Check the State of the Adapter<br />
• BIOS Settings<br />
• Kernel and Initialization Issues<br />
• OpenFabrics and InfiniPath Issues<br />
• System Administration Troubleshooting<br />
• Performance Issues<br />
• <strong>QLogic</strong> MPI Troubleshooting<br />
Troubleshooting information for hardware installation is found in the <strong>QLogic</strong><br />
InfiniBand Adapter Hardware Installation <strong>Guide</strong> and software installation is found<br />
in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong>.<br />
Using LEDs to Check the State of the Adapter<br />
The LEDs function as link and data indicators once the InfiniPath software has<br />
been installed, the driver has been loaded, and the fabric is being actively<br />
managed by a subnet manager.<br />
Table F-1 describes the LED states. The green LED indicates the physical link<br />
signal; the amber LED indicates the link. The green LED normally illuminates first.<br />
The normal state is Green On, Amber On. The QLE7240 and QLE7280 have an<br />
additional state, as shown in Table F-1.<br />
Table F-1. LED Link and Data Indicators<br />
Green OFF<br />
Amber OFF<br />
LED States<br />
Indication<br />
The switch is not powered up. The software is neither<br />
installed nor started. Loss of signal.<br />
Verify that the software is installed and configured with<br />
ipath_control -i. If correct, check both cable connectors.<br />
D000046-005 B F-1
F–Troubleshooting<br />
BIOS Settings<br />
Table F-1. LED Link and Data Indicators (Continued)<br />
LED States<br />
Green ON<br />
Amber OFF<br />
Green ON<br />
Amber ON<br />
Green BLINKING (quickly)<br />
Amber ON<br />
Green BLINKING a<br />
Amber BLINKING<br />
Table Notes<br />
Signal detected and the physical link is up. Ready to talk<br />
to SM to bring the link fully up.<br />
If this state persists, the SM may be missing or the link<br />
may not be configured.<br />
Use ipath_control -i to verify the software state. If<br />
all infiniband adapters are in this state, then the SM is<br />
not running. Check the SM configuration, or install and<br />
run opensmd.<br />
The link is configured, properly connected, and ready.<br />
Signal detected. Ready to talk to an SM to bring the link<br />
fully up.<br />
The link is configured. Properly connected and ready to<br />
receive data and link packets.<br />
Indicates traffic<br />
Indication<br />
Locates the adapter<br />
This feature is controlled by ipath_control -b [On<br />
| Off]<br />
a<br />
This feature is available only on the QLE7340, QLE7342, QLE7240 and QLE7280 adapters<br />
BIOS Settings<br />
This section covers issues related to BIOS settings.The most important setting is<br />
Advanced Configuration and Power Interface (ACPI). This setting must be<br />
enabled. If ACPI has been disabled, it may result in initialization problems, as<br />
described in “InfiniPath Interrupts Not Working” on page F-3.<br />
You can check and adjust the BIOS settings using the BIOS Setup utility. Check<br />
the hardware documentation that came with your system for more information.<br />
Kernel and Initialization Issues<br />
Issues that may prevent the system from coming up properly are described in the<br />
following sections.<br />
F-2 D000046-005 B
F–Troubleshooting<br />
Kernel and Initialization Issues<br />
Driver Load Fails Due to Unsupported Kernel<br />
If you try to load the InfiniPath driver on a kernel that InfiniPath software does not<br />
support, the load fails. Error messages similar to this display:<br />
modprobe: error inserting<br />
’/lib/modules/2.6.3-1.1659-smp/updates/kernel/drivers/infiniband/h<br />
w/qib/ib_qib.ko’: -1 Invalid module format<br />
To correct this problem, install one of the appropriate supported Linux kernel<br />
versions, then reload the driver.<br />
Rebuild or Reinstall Drivers if Different Kernel Installed<br />
If you upgrade the kernel, then you must reboot and then rebuild or reinstall the<br />
InfiniPath kernel modules (drivers). <strong>QLogic</strong> recommends using the InfiniBand<br />
Fabric Suite <strong>Software</strong> Installation TUI to preform this rebuild or reinstall. Refer to<br />
the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong> for more information.<br />
InfiniPath Interrupts Not Working<br />
The InfiniPath driver cannot configure the InfiniPath link to a usable state unless<br />
interrupts are working. Check for this problem with the command:<br />
$ grep ib_qib /proc/interrupts<br />
Normal output is similar to this:<br />
CPU0 CPU1<br />
185: 364263 0 IO-APIC-level ib_qib<br />
NOTE:<br />
The output you see may vary depending on board type, distribution, or<br />
update level.<br />
If there is no output at all, the driver initialization failed. For more information on<br />
driver problems, see “Driver Load Fails Due to Unsupported Kernel” on page F-3<br />
or “InfiniPath ib_qib Initialization Failure” on page F-5.<br />
If the output is similar to one of these lines, then interrupts are not being delivered<br />
to the driver.<br />
66: 0 0 PCI-MSI ib_qib<br />
185: 0 0 IO-APIC-level ib_qib<br />
The following message appears when driver has initialized successfully, but no<br />
interrupts are seen within 5 seconds.<br />
ib_qib 0000:82:00.0: No interrupts detected.<br />
D000046-005 B F-3
F–Troubleshooting<br />
Kernel and Initialization Issues<br />
A zero count in all CPU columns means that no InfiniPath interrupts have been<br />
delivered to the processor.<br />
The possible causes of this problem are:<br />
• Booting the Linux kernel with ACPI disabled on either the boot command<br />
line or in the BIOS configuration<br />
• Other infinipath initialization failures<br />
To check if the kernel was booted with the noacpi or pci=noacpi option, use<br />
this command:<br />
$ grep -i acpi /proc/cmdline<br />
If output is displayed, fix the kernel boot command line so that ACPI is enabled.<br />
This command line can be set in various ways, depending on your distribution. If<br />
no output is displayed, check that ACPI is enabled in your BIOS settings.<br />
To track down other initialization failures, see “InfiniPath ib_qib Initialization<br />
Failure” on page F-5.<br />
The program ipath_checkout can also help flag these kinds of problems. See<br />
“ipath_checkout” on page I-10 for more information.<br />
OpenFabrics Load Errors if ib_qib Driver Load Fails<br />
When the ib_qib driver fails to load, the other OpenFabrics drivers/modules will<br />
load and be shown by lsmod, but commands like ibstatus, ibv_devinfo,<br />
and ipath_control -i will fail as follows:<br />
# ibstatus<br />
Fatal error: device ’*’: sys files not found<br />
(/sys/class/infiniband/*/ports)<br />
# ibv_devinfo<br />
libibverbs: Fatal: couldn’t read uverbs ABI version.<br />
No IB devices found<br />
# ipath_control -i<br />
InfiniPath driver not loaded <br />
No InfiniPath info available<br />
F-4 D000046-005 B
F–Troubleshooting<br />
Kernel and Initialization Issues<br />
InfiniPath ib_qib Initialization Failure<br />
There may be cases where ib_qib was not properly initialized. Symptoms of this<br />
may show up in error messages from an MPI job or another program. Here is a<br />
sample command and error message:<br />
$ mpirun -np 2 -m ~/tmp/mbu13 osu_latency<br />
:ipath_userinit: assign_port command failed: Network is<br />
down<br />
:can’t open /dev/ipath, network down<br />
This will be followed by messages of this type after 60 seconds:<br />
MPIRUN: 1 rank has not yet exited 60 seconds<br />
after rank 0 (node ) exited without reaching<br />
MPI_Finalize().<br />
MPIRUN:Waiting at most another 60 seconds for<br />
the remaining ranks to do a clean shutdown before terminating 1<br />
node processes.<br />
If this error appears, check to see if the InfiniPath driver is loaded by typing:<br />
$ lsmod | grep ib_qib<br />
If no output is displayed, the driver did not load for some reason. In this case, try<br />
the following commands (as root):<br />
# modprobe -v ib_qib<br />
# lsmod | grep ib_qib<br />
# dmesg | grep -i ib_qib | tail -25<br />
The output will indicate whether the driver has loaded. Printing out messages<br />
using dmesg may help to locate any problems with ib_qib.<br />
If the driver loaded, but MPI or other programs are not working, check to see if<br />
problems were detected during the driver and <strong>QLogic</strong> hardware initialization with<br />
the command:<br />
$ dmesg | grep -i ib_qib<br />
This command may generate more than one screen of output.<br />
Also, check the link status with the commands:<br />
$ cat /sys/class/infiniband/ipath*/device/status_str<br />
These commands are normally executed by the ipathbug-helper script, but<br />
running them separately may help locate the problem.<br />
See also “status_str” on page I-19 and “ipath_checkout” on page I-10.<br />
D000046-005 B F-5
F–Troubleshooting<br />
OpenFabrics and InfiniPath Issues<br />
MPI Job Failures Due to Initialization Problems<br />
If one or more nodes do not have the interconnect in a usable state, messages<br />
similar to the following appear when the MPI program is started:<br />
userinit: userinit ioctl failed: Network is down [1]: device init<br />
failed<br />
userinit: userinit ioctl failed: Fatal Error in keypriv.c(520):<br />
device init failed<br />
These messages may indicate that a cable is not connected, the switch is down,<br />
SM is not running, or that a hardware error occurred.<br />
OpenFabrics and InfiniPath Issues<br />
The following sections cover issues related to OpenFabrics (including Subnet<br />
Managers) and InfiniPath.<br />
Stop Infinipath Services Before Stopping/Restarting<br />
InfiniPath<br />
The following Infinipath services must be stopped before<br />
stopping/starting/restarting InfiniPath:<br />
• <strong>QLogic</strong> Fabric Manager<br />
• OpenSM<br />
• <strong>QLogic</strong> MPI<br />
• VNIC<br />
• SRP<br />
Here is a sample command and the corresponding error messages:<br />
# /etc/init.d/openibd stop<br />
Unloading infiniband modules: sdp cm umad uverbs ipoib sa ipath mad<br />
coreFATAL:Module ib_umad is in use.<br />
Unloading infinipath modules FATAL: Module ib_qib is in use.<br />
[FAILED]<br />
F-6 D000046-005 B
F–Troubleshooting<br />
OpenFabrics and InfiniPath Issues<br />
Manual Shutdown or Restart May Hang if NFS in Use<br />
If you are using NFS over IPoIB and use the manual /etc/init.d/openibd<br />
stop (or restart) command, the shutdown process may silently hang on the<br />
fuser command contained within the script. This is because fuser cannot<br />
traverse down the tree from the mount point once the mount point has<br />
disappeared. To remedy this problem, the fuser process itself needs to be killed.<br />
Run the following command either as a root user or as the user who is running the<br />
fuser process:<br />
# kill -9 fuser<br />
The shutdown will continue.<br />
This problem is not seen if the system is rebooted or if the filesystem has already<br />
been unmounted before stopping infinipath.<br />
Load and Configure IPoIB Before Loading SDP<br />
SDP generates Connection Refused errors if it is loaded before IPoIB has been<br />
loaded and configured. To solve the problem, load and configure IPoIB first.<br />
Set $IBPATH for OpenFabrics Scripts<br />
The environment variable $IBPATH must be set to /usr/bin. If this has not been<br />
set, or if you have it set to a location other than the installed location, you may see<br />
error messages similar to the following when running some OpenFabrics scripts:<br />
/usr/bin/ibhosts: line 30: /usr/local/bin/ibnetdiscover: No such<br />
file or directory<br />
For the OpenFabrics commands supplied with this InfiniPath release, set the<br />
variable (if it has not been set already) to /usr/bin, as follows:<br />
$ export IBPATH=/usr/bin<br />
SDP Module Not Loading<br />
If the settings for debug level and the zero copy threshold from InfiniPath<br />
release 2.0 are present in the release 2.2 /etc/modprobe.conf file (RHEL) or<br />
/etc/modprobe.conf.local (SLES) file, the SDP module may not load:<br />
options ib_sdp sdp_debug_level=4<br />
sdp_zcopy_thrsh_src_default=10000000<br />
To solve the problem, remove this line.<br />
D000046-005 B F-7
F–Troubleshooting<br />
System Administration Troubleshooting<br />
ibsrpdm Command Hangs when Two <strong>Host</strong> Channel<br />
Adapters are Installed but Only Unit 1 is Connected<br />
to the Switch<br />
If multiple infiniband adapters (unit 0 and unit 1) are installed and only unit 1 is<br />
connected to the switch, the ibsrpdm command (to set up an SRP target) can<br />
hang. If unit 0 is connected and unit 1 is disconnected, the problem does not<br />
occur.<br />
When only unit 1 is connected to the switch, use the -d option with ibsrpdm. Then,<br />
using the output from the ibsrpdm command, echo the new target information into<br />
/sys/class/infiniband_srp/srp-ipath1-1/add_target.<br />
For example:<br />
# ibsrpdm -d /dev/infiniband/umad1 -c<br />
# echo \<br />
id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe800000000<br />
0000021000001ff040bf6,pkey=ffff,service_id=f60b04ff01000021 ><br />
/sys/class/infiniband_srp/srp-ipath0-1/add_target<br />
Outdated ipath_ether Configuration Setup Generates Error<br />
Ethernet emulation (ipath_ether) has been removed in this release, and, as a<br />
result, an error may be seen if the user still has an alias set previously by<br />
modprobe.conf (for example, alias eth2 ipath_ether).<br />
When ifconfig or ifup are run, the error will look similar to this (assuming<br />
ipath_ether was used for eth2):<br />
eth2: error fetching interface information: Device not found<br />
To prevent the error message, remove the following files (assuming<br />
ipath_ether was used for eth2):<br />
/etc/sysconfig/network-scripts/ifcfg-eth2 (for RHEL)<br />
/etc/sysconfig/network/ifcfg-eth2 (for SLES)<br />
<strong>QLogic</strong> recommends using the IP over InfiniBand protocol (IPoIB-CM), included in<br />
the standard OpenFabrics software releases, as a replacement for<br />
ipath_ether.<br />
System Administration Troubleshooting<br />
The following sections provide details on locating problems related to system<br />
administration.<br />
Broken Intermediate Link<br />
Sometimes message traffic passes through the fabric while other traffic appears<br />
to be blocked. In this case, MPI jobs fail to run.<br />
F-8 D000046-005 B
F–Troubleshooting<br />
Performance Issues<br />
In large cluster configurations, switches may be attached to other switches to<br />
supply the necessary inter-node connectivity. Problems with these inter-switch (or<br />
intermediate) links are sometimes more difficult to diagnose than failure of the<br />
final link between a switch and a node. The failure of an intermediate link may<br />
allow some traffic to pass through the fabric while other traffic is blocked or<br />
degraded.<br />
If you notice this behavior in a multi-layer fabric, check that all switch cable<br />
connections are correct. Statistics for managed switches are available on a<br />
per-port basis, and may help with debugging. See your switch vendor for more<br />
information.<br />
<strong>QLogic</strong> recommends using FastFabric to help diagnose this problem. If<br />
FastFabric is not installed in the fabric, there are two diagnostic tools, ibhosts<br />
and ibtracert, that may also be helpful. The tool ibhosts lists all the<br />
InfiniBand nodes that the subnet manager recognizes. To check the InfiniBand<br />
path between two nodes, use the ibtracert command.<br />
Performance Issues<br />
The following sections discuss known performance issues.<br />
Large Message Receive Side Bandwidth Varies with<br />
Socket Affinity on Opteron Systems<br />
On Opteron systems, when using the QLE7240 or QLE7280 in DDR mode, there<br />
is a receive side bandwidth bottleneck for CPUs that are not adjacent to the PCI<br />
Express root complex. This may cause performance to vary. The bottleneck is<br />
most obvious when using SendDMA with large messages on the farthest sockets.<br />
The best case for SendDMA is when both sender and receiver are on the closest<br />
sockets. Overall performance for PIO (and smaller messages) is better than with<br />
SendDMA.<br />
Erratic Performance<br />
Sometimes erratic performance is seen on applications that use interrupts. An<br />
example is inconsistent SDP latency when running a program such as netperf.<br />
This may be seen on AMD-based systems using the QLE7240 or QLE7280<br />
adapters. If this happens, check to see if the program irqbalance is running.<br />
This program is a Linux daemon that distributes interrupts across processors.<br />
However, it may interfere with prior interrupt request (IRQ) affinity settings,<br />
introducing timing anomalies. After stopping this process (as a root user), bind<br />
IRQ to a CPU for more consistent performance. First, stop irqbalance:<br />
# /sbin/chkconfig irqbalance off<br />
# /etc/init.d/irqbalance stop<br />
D000046-005 B F-9
F–Troubleshooting<br />
Performance Issues<br />
Next, find the IRQ number and bind it to a CPU. The IRQ number can be found in<br />
one of two ways, depending on the system used. Both methods are described in<br />
the following paragraphs.<br />
NOTE:<br />
Take care when cutting and pasting commands from PDF documents, as<br />
quotes are special characters and may not be translated correctly.<br />
Method 1<br />
Check to see if the IRQ number is found in /proc/irq/xxx, where xxx is the<br />
IRQ number in /sys/class/infiniband/ipath*/device/irq. Do this as a<br />
root user. For example:<br />
# my_irq=‘cat /sys/class/infiniband/ipath*/device/irq‘<br />
# ls /proc/irq<br />
If $my_irq can be found under /proc/irq/, then type:<br />
# echo 01 > /proc/irq/$my_irq/smp_affinity<br />
Method 2<br />
If command from Method 1, ls /proc/irq, cannot find $my_irq, then use the<br />
following commands instead:<br />
# my_irq=‘cat /proc/interrupts|grep ib_qib|awk \<br />
’{print $1}’|sed -e ’s/://’‘<br />
# echo 01 > /proc/irq/$my_irq/smp_affinity<br />
This method is not the first choice because, on some systems, there may be two<br />
rows of ib_qib output, and you will not know which one of the two numbers to<br />
choose. However, if you cannot find $my_irq listed under /proc/irq<br />
(Method 1), this type of system most likely has only one line for ib_qib listed in<br />
/proc/interrupts, so you can use Method 2.<br />
Here is an example:<br />
# cat /sys/class/infiniband/ipath*/device/irq<br />
98<br />
# ls /proc/irq<br />
0 10 11 13 15 233 4 50 7 8 90<br />
1 106 12 14 2 3 5 58 66 74 9<br />
(Note that you cannot find 98.)<br />
# cat /proc/interrupts|grep ib_qib|awk \<br />
’{print $1}’|sed -e ’s/://’<br />
106<br />
# echo 01 > /proc/irq/106/smp_affinity<br />
Using the echo command immediately changes the processor affinity of an IRQ.<br />
F-10 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
NOTE:<br />
• The contents of the smp_affinity file may not reflect the expected<br />
values, even though the affinity change has taken place.<br />
• If the driver is reloaded, the affinity assignment will revert to the default,<br />
so you will need to reset it to the desired value.<br />
You can look at the stats in /proc/interrupts while the adapter is active to<br />
observe which CPU is fielding ib_qib interrupts.<br />
Performance Warning if ib_qib Shares Interrupts with eth0<br />
When ib_qib shares interrupts with eth0, performance may be affected the<br />
OFED ULPs, such as IPoIB. A warning message appears in syslog, and also on<br />
the console or tty session where /etc/init.d/openibd start is run (if<br />
messages are set up to be displayed). Messages are in this form:<br />
Nov 5 14:25:43 infinipath: Shared interrupt will<br />
affect performance: vector 169: devices eth0, ib_qib<br />
Check /proc/interrupts: "169" is in the first column, and "devices" are shown<br />
in the last column.<br />
You can also contact your system vendor to see if the BIOS settings can be<br />
changed to avoid the problem.<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Problems specific to compiling and running MPI programs are described in the<br />
following sections.<br />
D000046-005 B F-11
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Mixed Releases of MPI RPMs<br />
Make sure that all of the MPI RPMs are from the same release. When using<br />
mpirun, an error message will occur if different components of the MPI RPMs are<br />
from different releases. In the following example, mpirun from release 2.1 is<br />
being used with a 2.2 library.<br />
$ mpirun -np 2 -m ~/tmp/x2 osu_latency<br />
MPI_runscript-xqa-14.0: ssh -x> Cannot detect InfiniPath<br />
interconnect.<br />
MPI_runscript-xqa-14.0: ssh -x> Seek help on loading InfiniPath<br />
interconnect driver.<br />
MPI_runscript-xqa-15.1: ssh -x> Cannot detect InfiniPath<br />
interconnect.<br />
MPI_runscript-xqa-15.1: ssh -x> Seek help on loading InfiniPath<br />
interconnect driver.<br />
MPIRUN: Node program(s) exitted during connection setup<br />
$ mpirun -v<br />
MPIRUN:Infinipath Release2.3: Built on Wed Nov 6 17:28:58 PDT 2008<br />
by mee<br />
The following example is the error that occurs when mpirun from the 2.2 release<br />
is being used with the 2.1 libraries.<br />
$ mpirun-ipath-ssh -np 2 -ppn 1 -m ~/tmp/idev osu_latency<br />
MPIRUN: mpirun from the 2.3 software distribution requires all<br />
node processes to be running 2.3 software. At least node <br />
uses non-2.3 MPI libraries<br />
The following string means that either an incompatible non-<strong>QLogic</strong> mpirun binary<br />
has been found or that the binary is from an InfiniPath release prior to 2.3.<br />
Found incompatible non-InfiniPath or pre-2.3<br />
InfiniPath mpirun-ipath-ssh (exec=/usr/bin/mpirun-ipath-ssh)<br />
Missing mpirun Executable<br />
When the mpirun executable is missing, the following error appears:<br />
Please install mpirun on or provide a path to<br />
mpirun-ipath-ssh<br />
(not found in $MPICH_ROOT/bin, $PATH<br />
or path/to/mpirun-ipath-ssh/on/the/head/node) or run with<br />
mpirun -distributed=off<br />
This error string means that an mpirun executable (mpirun-ipath-ssh) was<br />
not found on the computation nodes. Make sure that the mpi-frontend-* RPM<br />
is installed on all nodes that will use mpirun.<br />
F-12 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Resolving <strong>Host</strong>name with Multi-Homed Head Node<br />
By default, mpirun assumes that ranks can independently resolve the hostname<br />
obtained on the head node with gethostname. However, the hostname of a<br />
multi-homed head node may not resolve on the compute nodes. To address this<br />
problem, the following new option has been added to mpirun:<br />
-listen-addr <br />
This address will be forwarded to the ranks. To change the default, put this option<br />
in the global mpirun.defaults file or in a user-local file.<br />
If the address on the frontend cannot be resolved, then a warning is sent to the<br />
console and to syslog. If you use the following command line, you may see<br />
messages similar to this:<br />
% mpirun-ipath-ssh -np 2 -listen-addr foo -m ~/tmp/hostfile-idev<br />
osu_bcast<br />
MPIRUN.: Warning: Couldn’t resolve listen address ’foo’<br />
on head node<br />
(Unknown host), using it anyway...<br />
MPIRUN.: No node programs have connected within 60<br />
seconds.<br />
This message occurs if none of the ranks can connect back to the head node.<br />
The following message may appear if some ranks cannot connect back:<br />
MPIRUN.: Not all node programs have connected within<br />
60 seconds.<br />
MPIRUN.: No connection received from 1 node process on<br />
node <br />
Cross-Compilation Issues<br />
The GNU 4.x environment is supported in the PathScale Compiler Suite 3.x<br />
release.<br />
However, the 2.x <strong>QLogic</strong> PathScale compilers are not currently supported on<br />
SLES 10 systems that use the GNU 4.x compilers and compiler environment<br />
(header files and libraries).<br />
<strong>QLogic</strong> recommends installing the PathScale 3.1 release.<br />
D000046-005 B F-13
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Compiler/Linker Mismatch<br />
If the compiler and linker do not match in C and C++ programs, the following error<br />
message appears:<br />
$ export MPICH_CC=gcc<br />
$ mpicc mpiworld.c<br />
/usr/bin/ld: cannot find -lmpichabiglue_gcc3<br />
collect2: ld returned 1 exit status<br />
Compiler Cannot Find Include, Module, or Library Files<br />
RPMs can be installed in any location by using the --prefix option. This can<br />
introduce errors when compiling, if the compiler cannot find the include files (and<br />
module files for Fortran 90 and Fortran 95) from mpi-devel*, and the libraries<br />
from mpi-libs*, in the new locations. Compiler errors similar to the following<br />
appear:<br />
$ mpicc myprogram.c<br />
/usr/bin/ld: cannot find -lmpich<br />
collect2: ld returned 1 exit status<br />
NOTE:<br />
As noted in the <strong>QLogic</strong> Fabric <strong>Software</strong> Installation <strong>Guide</strong>, all development<br />
files now reside in specific *-Devel subdirectories.<br />
On development nodes, programs must be compiled with the appropriate options<br />
so that the include files and the libraries can be found in the new locations. In<br />
addition, when running programs on compute nodes, you need to ensure that the<br />
run-time library path is the same as the path that was used to compile the<br />
program.<br />
The following examples show what compiler options to use for include files and<br />
libraries on the development nodes, and how to specify the new library path on<br />
the compute nodes for the runtime linker. The affected RPMs are:<br />
• mpi-devel* (on the development nodes)<br />
• mpi-libs* (on the development or compute nodes)<br />
For the examples in “Compiling on Development Nodes” on page F-15, it is<br />
assumed that the new locations are:<br />
/path/to/devel (for mpi-devel-*)<br />
/path/to/libs (for mpi-libs-*)<br />
F-14 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Compiling on Development Nodes<br />
If the mpi-devel-* RPM is installed with the --prefix /path/to/devel<br />
option, then mpicc, etc. must be passed in -I/path/to/devel/include for<br />
the compiler to find the MPI include files, as in this example:<br />
$ mpicc myprogram.c -I/path/to/devel/include<br />
If you are using Fortran 90 or Fortran 95, a similar option is needed for the<br />
compiler to find the module files:<br />
$ mpif90 myprogramf90.f90 -I/path/to/devel/include<br />
If the mpi-lib-* RPM is installed on these development nodes with the<br />
--prefix /path/to/libs option, then the compiler needs the<br />
-L/path/to/libs option so it can find the libraries. Here is the example for<br />
mpicc:<br />
$ mpicc myprogram.c -L/path/to/libs/lib (for 32 bit)<br />
$ mpicc myprogram.c -L/path/to/libs/lib64 (for 64 bit)<br />
To find both the include files and the libraries with these non-standard locations,<br />
type:<br />
$ mpicc myprogram.c -I/path/to/devel/include -L/path/to/libs/lib<br />
Specifying the Run-time Library Path<br />
There are several ways to specify the run-time library path so that when the<br />
programs are run, the appropriate libraries are found in the new location. There<br />
are three different ways to do this:<br />
• Use the -Wl,-rpath, option when compiling on the development node.<br />
• Update the /etc/ld.so.conf file on the compute nodes to include the path.<br />
• Export the path in the .mpirunrc file.<br />
These methods are explained in more detail in the following paragraphs.<br />
An additional linker option, -Wl,-rpath, supplies the run-time library path when<br />
compiling on the development node. The compiler options now look like this:<br />
$ mpicc myprogram.c -I/path/to/devel/include -L/path/to/libs/lib<br />
-Wl,-rpath,/path/to/libs/lib<br />
The above compiler command ensures that the program will run using this path on<br />
any machine.<br />
D000046-005 B F-15
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
For the second option, change the file /etc/ld.so.conf on the compute<br />
nodes rather than using the -Wl,-rpath, option when compiling on the<br />
development node. It is assumed that the mpi-lib-* RPM is installed on the<br />
compute nodes with the same --prefix /path/to/libs option as on the<br />
development nodes. Then, on the computer nodes, add the following lines to the<br />
file /etc/ld.so.conf:<br />
/path/to/libs/lib<br />
/path/to/libs/lib64<br />
To make sure that the changes take effect, run (as a root user):<br />
# /etc/ldconfig<br />
The libraries can now be found by the runtime linker on the compute nodes. The<br />
advantage to this method is that it works for all InfiniPath programs, without<br />
having to remember to change the compile/link lines.<br />
Instead of either of the two previous mechanisms, you can also put the following<br />
line in the ~/.mpirunrc file:<br />
export LD_LIBRARY_PATH=/path/to/libs/{lib,lib64}<br />
See “Environment for Node Programs” on page 4-19 for more information on<br />
using the -rcfile option with mpirun.<br />
Choices between these options are left up to the cluster administrator and the MPI<br />
developer. See the documentation for your compiler for more information on the<br />
compiler options.<br />
Problem with Shell Special Characters and Wrapper Scripts<br />
Be careful when dealing with shell special characters, especially when using the<br />
mpicc, etc. wrapper scripts. These characters must be escaped to avoid the shell<br />
interpreting them.<br />
For example, when compiling code using the -D compiler flag, mpicc (and other<br />
wrapper scripts) will fail if the defined variable contains a space, even when<br />
surrounded by double quotes. In the following example, the result of the -show<br />
option reveals what happens to the variable:<br />
$ mpicc -show -DMYDEFINE="some value" test.c<br />
gcc -c -DMYDEFINE=some value test.c<br />
gcc -Wl,--export-dynamic,--allow-shlib-undefined test.o -lmpich<br />
F-16 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The shell strips off the double quotes before handing the arguments to the mpicc<br />
script, thus causing the problem. The workaround is to escape the double quotes<br />
and white space by using backslashes, so that the shell does not process them.<br />
(Also note the single quote (‘) around the -D, since the scripts do an eval rather<br />
than directly invoking the underlying compiler.) Use this command instead:<br />
$ mpicc -show -DMYDEFINE=\"some\ value\" test.c<br />
gcc -c ‘-DMYDEFINE="some value"‘ test.c<br />
gcc -Wl,--export-dynamic,--allow-shlib-undefined test.o -lmpich<br />
Run Time Errors with Different MPI Implementations<br />
It is now possible to run different implementations of MPI, such as HP-MPI, over<br />
InfiniPath. Many of these implementations share command (such as mpirun) and<br />
library names, so it is important to distinguish which MPI version is in use. This is<br />
done primarily through careful programming practices.<br />
Examples are provided in the following paragraphs.<br />
In the following command, the HP-MPI version of mpirun is invoked by the full<br />
path name. However, the program mpi_nxnlatbw was compiled with the <strong>QLogic</strong><br />
version of mpicc. The mismatch produces errors similar this:<br />
$ /opt/hpmpi/bin/mpirun -hostlist "bbb-01,bbb-02,bbb-03,bbb-04"<br />
-np 4 /usr/bin/mpi_nxnlatbw<br />
bbb-02: Not running from mpirun.<br />
MPI Application rank 1 exited before MPI_Init() with status 1<br />
bbb-03: Not running from mpirun.<br />
MPI Application rank 2 exited before MPI_Init() with status 1<br />
bbb-01: Not running from mpirun.<br />
bbb-04: Not running from mpirun.<br />
MPI Application rank 3 exited before MPI_Init() with status 1<br />
MPI Application rank 0 exited before MPI_Init() with status 1<br />
D000046-005 B F-17
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
In the next case, mpi_nxnlatbw.c is compiled with the HP-MPI version of<br />
mpicc, and given the name hpmpi-mpi_nxnlatbw, so that it is easy to see<br />
which version was used. However, it is run with the <strong>QLogic</strong> mpirun, which<br />
produces errors similar to this:<br />
$ /opt/hpmpi/bin/mpicc \<br />
/usr/share/mpich/examples/performance/mpi_nxnlatbw.c -o<br />
hpmpi-mpi_nxnlatbw<br />
$ mpirun -m ~/host-bbb -np 4 ./hpmpi-mpi_nxnlatbw<br />
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />
libmpio.so.1: cannot open shared object file: No such file or<br />
directory<br />
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />
libmpio.so.1: cannot open shared object file: No such file or<br />
directory<br />
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />
libmpio.so.1: cannot open shared object file: No such file or<br />
directory<br />
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:<br />
libmpio.so.1: cannot open shared object file: No such file or<br />
directory<br />
MPIRUN: Node program(s) exitted during connection setup<br />
The following two commands will work properly.<br />
<strong>QLogic</strong> mpirun and executable used together:<br />
$ mpirun -m ~/host-bbb -np 4 /usr/bin/mpi_nxnlatbw<br />
The HP-MPI mpirun and executable used together:<br />
$ /opt/hpmpi/bin/mpirun -hostlist \<br />
"bbb-01,bbb-02,bbb-03,bbb-04" -np 4 ./hpmpi-mpi_nxnlatbw<br />
Hints<br />
• Use the rpm command to find out which RPM is installed in the standard<br />
installed layout. For example:<br />
# rpm -qf /usr/bin/mpirun<br />
mpi-frontend-2.3-5314.919_sles10_qlc<br />
• Check all rcfiles and /opt/infinipath/etc/mpirun.defaults to<br />
make sure that the paths for binaries and libraries ($PATH and<br />
$LD_LIBRARY _PATH) are consistent.<br />
• When compiling, use descriptive names for the object files.<br />
F-18 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
See “Compiler Cannot Find Include, Module, or Library Files” on page F-14,<br />
“Compiling on Development Nodes” on page F-15, and “Specifying the Run-time<br />
Library Path” on page F-15 for additional information.<br />
Process Limitation with ssh<br />
MPI jobs that use more than eight processes per node may encounter an ssh<br />
throttling mechanism that limits the amount of concurrent per-node connections<br />
to 10. If you have this problem, a message similar to this appears when using<br />
mpirun:<br />
$ mpirun -m tmp -np 11 ~/mpi/mpiworld/mpiworld<br />
ssh_exchange_identification: Connection closed by remote host<br />
MPIRUN: Node program(s) exitted during connection setup<br />
If you encounter a message like this, you or your system administrator should<br />
increase the value of MaxStartups in your sshd configurations.<br />
NOTE:<br />
This limitation applies only if -distributed=off is specified. By default,<br />
with -distributed=on, you will not normally have this problem.<br />
Number of Processes Exceeds ulimit for Number of Open<br />
Files<br />
When users scale up the number of processes beyond the number of open files<br />
allowed by ulimit, mpirun will print an error message. The ulimit for the<br />
number of open files is typically 1024 on both Red Hat and SLES systems. The<br />
message will look similar to:<br />
MPIRUN.up001: Warning: ulimit for the number of open files is only<br />
1024, but this mpirun request requires at least <br />
open files (sockets). The shell ulimit for open files needs to be<br />
increased.<br />
This is due to limit:<br />
descriptors 1024<br />
The ulimit can be increased; <strong>QLogic</strong> recommends an increase of<br />
approximately 20 percent over the number of CPUs. For example, in the case of<br />
2048 CPUs, ulimit can be increased to 2500:<br />
ulimit -n 2500<br />
The ulimit needs to be increased only on the host where mpirun was started,<br />
unless the mode of operation allows mpirun from any node.<br />
D000046-005 B F-19
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Using MPI.mod Files<br />
MPI.mod (or mpi.mod) are the Fortran 90/Fortran 95 mpi modules files. These<br />
files contain the Fortran 90/Fortran 95 interface to the platform-specific MPI<br />
library. The module file is invoked by ‘USE MPI’ or ‘use mpi’ in your application. If<br />
the application has an argument list that does not match what mpi.mod expects,<br />
errors such as this can occur:<br />
$ mpif90 -O3 -OPT:fast_math -c communicate.F<br />
call mpi_recv(nrecv,1,mpi_integer,rpart(nswap),0,<br />
^<br />
pathf95-389 pathf90: ERROR BORDERS, File = communicate.F, Line =<br />
407, Column = 18<br />
No specific match can be found for the generic subprogram call<br />
"MPI_RECV".<br />
If it is necessary to use a non-standard argument list, create your own MPI<br />
module file and compile the application with it, rather than using the standard MPI<br />
module file that is shipped in the mpi-devel-* RPM.<br />
The default search path for the module file is:<br />
/usr/include<br />
To include your own MPI.mod rather than the standard version, use<br />
-I/your/search/directory, which causes /your/search/directory to<br />
be checked before /usr/include. For example:<br />
$ mpif90 -I/your/search/directory myprogram.f90<br />
Usage for Fortran 95 will be similar to the example for Fortran 90.<br />
Extending MPI Modules<br />
MPI implementations provide procedures that accept an argument having any<br />
data type, any precision, and any rank. However, it is not practical for an MPI<br />
module to enumerate every possible combination of type, kind, and rank.<br />
Therefore, the strict type checking required by Fortran 90 may generate errors.<br />
For example, if the MPI module tells the compiler that mpi_bcast can operate on<br />
an integer but does not also say that it can operate on a character string, you may<br />
see a message similar to the following:<br />
pathf95: ERROR INPUT, File = input.F, Line = 32, Column = 14<br />
No specific match can be found for the generic subprogram call<br />
"MPI_BCAST".<br />
F-20 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
If you know that an argument can accept a data type that the MPI module does<br />
not explicitly allow, you can extend the interface for yourself. For example, the<br />
following program shows how to extend the interface for mpi_bcast so that it<br />
accepts a character type as its first argument, without losing the ability to accept<br />
an integer type as well:<br />
module additional_bcast<br />
use mpi<br />
implicit none<br />
interface mpi_bcast<br />
module procedure additional_mpi_bcast_for_character<br />
end interface mpi_bcast<br />
contains<br />
subroutine additional_mpi_bcast_for_character(buffer, count,<br />
datatype, & root, comm, ierror)<br />
character*(*) buffer<br />
integer count, datatype, root, comm, ierror<br />
! Call the Fortran 77 style implicit interface to "mpi_bcast"<br />
external mpi_bcast<br />
call mpi_bcast(buffer, count, datatype, root, comm, ierror)<br />
end subroutine additional_mpi_bcast_for_character<br />
end module additional_bcast<br />
program myprogram<br />
use mpi<br />
use additional_bcast<br />
implicit none<br />
character*4 c<br />
integer master, ierr, i<br />
! Explicit integer version obtained from module "mpi"<br />
call mpi_bcast(i, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr)<br />
! Explicit character version obtained from module<br />
"additional_bcast"<br />
call mpi_bcast(c, 4, MPI_CHARACTER, master, MPI_COMM_WORLD,<br />
ierr)<br />
end program myprogram<br />
D000046-005 B F-21
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
This is equally applicable if the module mpi provides only a lower-rank interface<br />
and you want to add a higher-rank interface, for example, when the module<br />
explicitly provides for 1-D and 2-D integer arrays, but you need to pass a 3-D<br />
integer array. Add a higher-rank interface only under the following conditions:<br />
• The module mpi provides an explicit Fortran 90 style interface for<br />
mpi_bcast. If the module mpi does not have this interface, the program<br />
uses an implicit Fortran 77 style interface, which does not perform any type<br />
checking. Adding an interface will cause type-checking error messages<br />
where there previously were none.<br />
• The underlying function accepts any data type. It is appropriate for the first<br />
argument of mpi_bcast because the function operates on the underlying<br />
bits, without attempting to interpret them as integer or character data.<br />
Lock Enough Memory on Nodes When Using a Batch<br />
Queuing System<br />
<strong>QLogic</strong> MPI requires the ability to lock (pin) memory during data transfers on each<br />
compute node. This is normally done via /etc/initscript, which is created or<br />
modified during the installation of the infinipath RPM (setting a limit of<br />
128 MB, with the command ulimit -l 131072).<br />
Some batch systems, such as SLURM, propagate the user’s environment from<br />
the node where you start the job to all the other nodes. For these batch systems,<br />
you may need to make the same change on the node from which you start your<br />
batch jobs.<br />
If this file is not present or the node has not been rebooted after the infinipath<br />
RPM has been installed, a failure message similar to one of the following will be<br />
generated.<br />
The following message displays during installation:<br />
$ mpirun -np 2 -m ~/tmp/sm mpi_latency 1000 1000000<br />
iqa-19:0.ipath_userinit: mmap of pio buffers at 100000 failed:<br />
Resource temporarily unavailable<br />
iqa-19:0.Driver initialization failure on /dev/ipath<br />
iqa-20:1.ipath_userinit: mmap of pio buffers at 100000 failed:<br />
Resource temporarily unavailable<br />
iqa-20:1.Driver initialization failure on /dev/ipath<br />
F-22 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following message displays after installation:<br />
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000<br />
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory<br />
mpi_latency:<br />
/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src<br />
mq_ips.c:691:<br />
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program<br />
unexpectedly quit. Exiting.<br />
You can check the ulimit -l on all the nodes by running ipath_checkout. A<br />
warning similar to this displays if ulimit -l is less than 4096:<br />
!!!ERROR!!! Lockable memory less than 4096KB on x nodes<br />
To fix this error, install the infinipath RPM on the node, and reboot it to ensure<br />
that /etc/initscript is run.<br />
Alternately, you can create your own /etc/initscript and set the ulimit<br />
there.<br />
Error Creating Shared Memory Object<br />
<strong>QLogic</strong> MPI (and PSM) use Linux’s shared memory mapped files to share<br />
memory within a node. When an MPI job is started, a shared memory file is<br />
created on each node for all MPI ranks sharing memory on that one node. During<br />
job execution, the shared memory file remains in /dev/shm. At program exit, the<br />
file is removed automatically by the operating system when the <strong>QLogic</strong> MPI<br />
(InfiniPath) library properly exits. Also, as an additional backup in the sequence of<br />
commands invoked by mpirun during every MPI job launch, the file is explicitly<br />
removed at program termination.<br />
However, under circumstances such as hard and explicit program termination (i.e.<br />
kill -9 on the mpirun process PID), <strong>QLogic</strong> MPI cannot guarantee that the<br />
/dev/shm file is properly removed. As many stale files accumulate on each node,<br />
an error message like the following can appear at startup:<br />
node023:6.Error creating shared memory object in shm_open(/dev/shm<br />
may have stale shm files that need to be removed):<br />
If this occurs, administrators should clean up all stale files by running this<br />
command (as a root user):<br />
# rm -rf /dev/shm/psm_shm.*<br />
You can also selectively identify stale files by using a combination of the fuser,<br />
ps, and rm commands (all files start with the psm_shm prefix). Once identified,<br />
you can issue rm commands on the stale files that you own.<br />
D000046-005 B F-23
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
NOTE:<br />
It is important that /dev/shm be writable by all users, or else error<br />
messages like the ones in this section can be expected. Also, non-<strong>QLogic</strong><br />
MPIs that use PSM may be more prone to stale shared memory files when<br />
processes are abnormally terminated.<br />
gdb Gets SIG32 Signal Under mpirun -debug with the<br />
PSM Receive Progress Thread Enabled<br />
When you run mpirun -debug and the PSM receive progress thread is enabled,<br />
gdb (the GNU debugger) reports the following error:<br />
(gdb) run<br />
Starting program: /usr/bin/osu_bcast < /dev/null [Thread debugging<br />
using libthread_db enabled] [New Thread 46912501386816 (LWP<br />
13100)] [New Thread 1084229984 (LWP 13103)] [New Thread 1094719840<br />
(LWP 13104)]<br />
Program received signal SIG32, Real-time event 32.<br />
[Switching to Thread 1084229984 (LWP 22106)] 0x00000033807c0930 in<br />
poll () from /lib64/libc.so.6<br />
This signal is generated when the main thread cancels the progress thread. To fix<br />
this problem, disable the receive progress thread when debugging an MPI<br />
program. Add the following line to $HOME/.mpirunrc:<br />
export PSM_RCVTHREAD=0<br />
NOTE:<br />
Remove the above line from $HOME/.mpirunrc after you debug an MPI<br />
program. If this line is not removed, the PSM receive progress thread will be<br />
permanently disabled. To check if the receive progress thread is enabled,<br />
look for output similar to the following when using the mpirun -verbose<br />
flag:<br />
idev-17:0.env PSM_RCVTHREAD Recv thread flags<br />
0 disables thread) => 0x1<br />
The value 0x1 indicates that the receive thread is currently enabled. A value<br />
of 0x0 indicates that the receive thread is disabled.<br />
F-24 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
General Error Messages<br />
The following message may be generated by ipath_checkout or mpirun:<br />
PSM found 0 available contexts on InfiniPath device<br />
The most likely cause is that the cluster has processes using all the available<br />
PSM contexts.<br />
Error Messages Generated by mpirun<br />
The following sections describe the mpirun error messages. These messages<br />
are in one of these categories:<br />
• Messages from the <strong>QLogic</strong> MPI (InfiniPath) library<br />
• MPI messages<br />
• Messages relating to the InfiniPath driver and InfiniBand links<br />
Messages generated by mpirun follow this format:<br />
program_name: message<br />
function_name: message<br />
Messages can also have different prefixes, such as ipath_ or psm_, which<br />
indicate where in the software the errors are occurring.<br />
Messages from the <strong>QLogic</strong> MPI (InfiniPath) Library<br />
Messages from the <strong>QLogic</strong> MPI (InfiniPath) library appear in the mpirun output.<br />
The following example contains rank values received during connection setup that<br />
were higher than the number of ranks (as indicated in the mpirun startup code):<br />
sender rank rank is out of range (notification)<br />
sender rank rank is out of range (ack)<br />
The following are error messages that indicate internal problems and must be<br />
reported to Technical Support.<br />
unknown frame type type<br />
[n] Src lid error: sender: x, exp send: y<br />
Frame receive from unknown sender. exp. sender = x, came from y<br />
Failed to allocate memory for eager buffer addresses: str<br />
The following error messages usually indicate a hardware or connectivity<br />
problem:<br />
Failed to get IB Unit LID for any unit<br />
Failed to get our IB LID<br />
Failed to get number of Infinipath units<br />
In these cases, try to reboot. If that does not work, call Technical Support.<br />
D000046-005 B F-25
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following message indicates a mismatch between the <strong>QLogic</strong> interconnect<br />
hardware in use and the version where the software was compiled:<br />
Number of buffer avail registers is wrong; have n, expected m<br />
build mismatch, tidmap has n bits, ts_map m<br />
These messages indicate a mismatch between the InfiniPath software and<br />
hardware versions. Consult Technical Support after verifying that current drivers<br />
and libraries are installed.<br />
The following examples are all informative messages about driver initialization<br />
problems. They are not necessarily fatal themselves, but may indicate problems<br />
that interfere with the application. In the actual printed output, all of the messages<br />
are prefixed with the name of the function that produced them.<br />
assign_port command failed: str<br />
Failed to get LID for unit u: str<br />
Failed to get number of units: str<br />
GETPORT ioctl failed: str<br />
can't allocate memory for ipath_ctrl: str<br />
can't stat infinipath device to determine type: str<br />
file descriptor is not for a real device, failing<br />
get info ioctl failed: str<br />
ipath_get_num_units called before init<br />
ipath_get_unit_lid called before init<br />
mmap of egr bufs from h failed: str<br />
mmap of pio buffers at %llx failed: str<br />
mmap of pioavail registers (%llx) failed: str<br />
mmap of rcvhdr q failed: str<br />
mmap of user registers at %llx failed: str<br />
userinit command failed: str<br />
Failed to set close on exec for device: str<br />
NOTE:<br />
These messages should never occur. If they do, notify Technical Support.<br />
The following message indicates that a node program may not be processing<br />
incoming packets, perhaps due to a very high system load:<br />
eager array full after overflow, flushing (head h, tail t)<br />
The following error messages should rarely occur; they indicate internal software<br />
problems:<br />
ExpSend opcode h tid=j, rhf_error k: str<br />
Asked to set timeout w/delay l, gives time in past (t2 < t1)<br />
Error in sending packet: str<br />
In this case, str can give additional information about why the failure occurred.<br />
F-26 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following message usually indicates a node failure or malfunctioning link in<br />
the fabric:<br />
Couldn’t connect to (LID=::). Time<br />
elapsed 00:00:30. Still trying...<br />
IP is the MPI rank’s IP address, and are the rank’s lid,<br />
port, and subport.<br />
If messages similar to the following display, it may mean that the program is trying<br />
to receive to an invalid (unallocated) memory address, perhaps due to a logic<br />
error in the program, usually related to malloc/free:<br />
ipath_update_tid_err: Failed TID update for rendezvous, allocation<br />
problem<br />
kernel: infinipath: get_user_pages (0x41 pages starting at<br />
0x2aaaaeb50000<br />
kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages:<br />
errno 12<br />
TID is short for Token ID, and is part of the <strong>QLogic</strong> hardware. This error indicates<br />
a failure of the program, not the hardware or driver.<br />
MPI Messages<br />
Some MPI error messages are issued from the parts of the code inherited from<br />
the MPICH implementation. See the MPICH documentation for message<br />
descriptions. This section discusses the error messages specific to the <strong>QLogic</strong><br />
MPI implementation.<br />
These messages appear in the mpirun output. Most are followed by an abort,<br />
and possibly a backtrace. Each is preceded by the name of the function where the<br />
exception occurred.<br />
The following message is always followed by an abort. The processlabel is<br />
usually in the form of the host name followed by process rank:<br />
processlabel Fatal Error in filename line_no: error_string<br />
At the time of publication, the possible error_strings are:<br />
Illegal label format character.<br />
Memory allocation failed.<br />
Error creating shared memory object.<br />
Error setting size of shared memory object.<br />
Error mmapping shared memory.<br />
Error opening shared memory object.<br />
Error attaching to shared memory.<br />
Node table has inconsistent len! Hdr claims %d not %d<br />
Timeout waiting %d seconds to receive peer node table from mpirun<br />
D000046-005 B F-27
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following indicates an unknown host:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />
MPIRUN: Cannot obtain IP address of : Unknown host<br />
15:35_~.1019<br />
The following indicates that there is no route to a valid host:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />
ssh: connect to host port 22: No route to host<br />
MPIRUN: Some node programs ended prematurely without connecting to<br />
mpirun.<br />
MPIRUN: No connection received from 1 node process on node<br />
<br />
The following indicates that there is no route to any host:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />
ssh: connect to host port 22: No route to host<br />
ssh: connect to host port 22: No route to host<br />
MPIRUN: All node programs ended prematurely without connecting to<br />
mpirun.<br />
The following indicates that node jobs have started, but one host could not<br />
connect back to mpirun:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />
9139.psc_skt_connect: Error connecting to socket: No route to host<br />
. Cannot connect to spawner on host %s port %d<br />
within 60 seconds.<br />
MPIRUN: Some node programs ended prematurely without connecting to<br />
mpirun.<br />
MPIRUN: No connection received from 1 node process on node<br />
<br />
The following indicates that node jobs have started, but both hosts could not<br />
connect back to mpirun:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100<br />
9158.psc_skt_connect: Error connecting to socket: No route to host<br />
. Cannot connect to spawner on host %s port %d<br />
within 60 seconds.<br />
6083.psc_skt_connect: Error connecting to socket: No route to host<br />
. Cannot connect to spawner on host %s port %d<br />
within 60 seconds.<br />
MPIRUN: All node programs ended prematurely without connecting to<br />
mpirun.<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 1000000 1000000<br />
MPIRUN: node program unexpectedly quit: Exiting.<br />
F-28 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
The following indicates that one program on one node died:<br />
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100000 1000000<br />
MPIRUN: node program unexpectedly quit: Exiting.<br />
The quiescence detected message is printed when an MPI job is not making<br />
progress. The default timeout is 900 seconds. After this length of time, all the<br />
node processes are terminated. This timeout can be extended or disabled with the<br />
-quiescence-timeout option in mpirun.<br />
$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000<br />
MPIRUN: MPI progress Quiescence Detected after 9000 seconds.<br />
MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.<br />
MPIRUN: Per-rank details are the following:<br />
MPIRUN: Rank 0 ( ) caused MPI progress Quiescence.<br />
MPIRUN: Rank 1 ( ) caused MPI progress Quiescence.<br />
MPIRUN: both MPI progress and Ping Quiescence Detected after 120<br />
seconds.<br />
Occasionally, a stray process will continue to exist out of its context. mpirun<br />
checks for stray processes; they are killed after detection. The following code is<br />
an example of the type of message that displays in this case:<br />
$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000<br />
iqa-38: Received 1 out-of-context eager message(s) from stray<br />
process PID=29745<br />
running on host 192.168.9.218<br />
iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I<br />
am a stray process, exiting.<br />
2000 5.222116<br />
iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host<br />
IP=192.168.9.218 sent<br />
1 stray message(s) and was told so 1 time(s) (first stray message<br />
at 0.7s (13%),last at 0.7s (13%) into application run)<br />
The following message should never occur. If it does, notify Technical Support:<br />
Internal Error: NULL function/argument found:func_ptr(arg_ptr)<br />
Driver and Link Error Messages Reported by MPI Programs<br />
The following driver and link error messages are reported by MPI programs.<br />
When the InfiniBand link fails during a job, a message is reported once per<br />
occurrence. The message will be similar to:<br />
ipath_check_unit_status: IB Link is down<br />
D000046-005 B F-29
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
MPI Stats<br />
This message occurs when a cable is disconnected, a switch is rebooted, or when<br />
there are other problems with the link. The job continues retrying until the<br />
quiescence interval expires. See the mpirun -q option for information on<br />
quiescence.<br />
If a hardware problem occurs, an error similar to this displays:<br />
infinipath: [error strings ] Hardware error<br />
In this case, the MPI program terminates. The error string may provide additional<br />
information about the problem. To further determine the source of the problem,<br />
examine syslog on the node reporting the problem.<br />
Using the -print-stats option to mpirun provides a listing to stderr of<br />
various MPI statistics. Here is example output for the -print-stats option<br />
when used with an eight-rank run of the HPCC benchmark, using the following<br />
command:<br />
$ mpirun -np 8 -ppn 1 -m machinefile -M ./hpcc<br />
STATS: MPI Statistics Summary (max,min @ rank)<br />
STATS: Eager count sent (max=171.94K @ 0, min=170.10K @ 3, med=170.20K @ 5)<br />
STATS: Eager bytes sent (max=492.56M @ 5, min=491.35M @ 0, med=491.87M @ 1)<br />
STATS: Rendezvous count sent (max= 5735 @ 0, min= 5729 @ 3, med= 5731 @ 7)<br />
STATS: Rendezvous bytes sent (max= 1.21G @ 4, min= 1.20G @ 2, med= 1.21G @ 0)<br />
STATS: Expected count received(max=173.18K @ 4, min=169.46K @ 1, med=172.71K @ 7)<br />
STATS: Expected bytes received(max= 1.70G @ 1, min= 1.69G @ 2, med= 1.70G @ 7)<br />
STATS: Unexpect count received(max= 6758 @ 0, min= 2996 @ 4, med= 3407 @ 2)<br />
STATS: Unexpect bytes received(max= 1.48M @ 0, min=226.79K @ 5, med=899.08K @ 2)<br />
By default, -M assumes -M=mpi and that the user wants only mpi level statistics.<br />
The man page shows various other low-level categories of statistics that are<br />
provided. Here is another example:<br />
$ mpirun -np 8 -ppn 1 -m machinefile -M=mpi,ipath hpcc<br />
STATS: MPI Statistics Summary (max,min @ rank)<br />
STATS: Eager count sent (max=171.94K @ 0, min=170.10K @ 3, med=170.22K @ 1)<br />
STATS: Eager bytes sent (max=492.56M @ 5, min=491.35M @ 0, med=491.87M @ 1)<br />
STATS: Rendezvous count sent (max= 5735 @ 0, min= 5729 @ 3, med= 5731 @ 7)<br />
STATS: Rendezvous bytes sent (max= 1.21G @ 4, min= 1.20G @ 2, med= 1.21G @ 0)<br />
STATS: Expected count received(max=173.18K @ 4, min=169.46K @ 1, med=172.71K @ 7)<br />
STATS: Expected bytes received(max= 1.70G @ 1, min= 1.69G @ 2, med= 1.70G @ 7)<br />
STATS: Unexpect count received(max= 6758 @ 0, min= 2996 @ 4, med= 3407 @ 2)<br />
STATS: Unexpect bytes received(max= 1.48M @ 0, min=226.79K @ 5, med=899.08K @ 2)<br />
STATS: InfiniPath low-level protocol stats<br />
STATS: pio busy count (max=190.01K @ 0, min=155.60K @ 1, med=160.76K @ 5)<br />
STATS: scb unavail exp count (max= 9217 @ 0, min= 7437 @ 7, med= 7727 @ 4)<br />
STATS: tid update count (max=292.82K @ 6, min=290.59K @ 2, med=292.55K @ 4)<br />
STATS: interrupt thread count (max= 941 @ 0, min= 335 @ 7, med= 439 @ 2)<br />
STATS: interrupt thread success(max= 0.00 @ 3, min= 0.00 @ 1, med= 0.00 @ 0)<br />
F-30 D000046-005 B
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
Statistics other than MPI-level statistics are fairly low level; most users will not<br />
understand them. Contact <strong>QLogic</strong> Technical Support for more information.<br />
Message statistics are available for transmitted and received messages. In all<br />
cases, the MPI rank number responsible for a minimum or maximum value is<br />
reported with the relevant value. For application runs of at least three ranks, a<br />
median is also available.<br />
Since transmitted messages employ either an Eager or a Rendezvous protocol,<br />
results are available relative to both message count and aggregated bytes.<br />
Message count represents the amount of messages transmitted by each protocol<br />
on a per-rank basis. Aggregated amounts of message bytes indicate the total<br />
amount of data that was moved on each rank by a particular protocol.<br />
On the receive side, messages are split into expected or unexpected messages.<br />
Unexpected messages cause the MPI implementation to buffer the transmitted<br />
data until the receiver can produce a matching MPI receive buffer. Expected<br />
messages refer to the inverse case, which is the common case in most MPI<br />
applications. An additional metric, Unexpected count %, representing the<br />
proportion of unexpected messages in relation to the total number of messages<br />
received, is also shown because of the notable effect unexpected messages have<br />
on performance.<br />
For more detailed information, use MPI profilers such as mpiP. For more<br />
information on mpiP, see: http://mpip.sourceforge.net/<br />
For information about the HPCC benchmark, see: http://icl.cs.utk.edu/hpcc/<br />
D000046-005 B F-31
F–Troubleshooting<br />
<strong>QLogic</strong> MPI Troubleshooting<br />
F-32 D000046-005 B
G ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware<br />
Issues<br />
To verify that an InfiniBand host can access an Ethernet system through the EVIC,<br />
issue a ping command to the Ethernet system from the InfiniBand host. Make<br />
certain that the route to the Ethernet system is using the VIO hardware by using<br />
the Linux route command on the InfiniBand host, then verify that the route to the<br />
subnet is using one of the virtual Ethernet interfaces (i.e., an EIOC).<br />
NOTE:<br />
If the ping command fails, check the following:<br />
• The logical connection between the InfiniBand host and the EVIC<br />
Checking the logical connection between the InfiniBand <strong>Host</strong> and the<br />
VIO hardware.<br />
• The interface definitions on the host Checking the interface definitions<br />
on the host.<br />
• The physical connection between the VIO hardware and the Ethernet<br />
network Verify the physical connection between the VIO hardware and<br />
the Ethernet network.<br />
Checking the logical connection between the InfiniBand <strong>Host</strong><br />
and the VIO hardware<br />
To determine if the logical connection between the InfiniBand host and the VIO<br />
hardware is correct, check the following:<br />
• The correct VirtualNIC driver is running.<br />
• The /etc/infiniband/qlgc_vnic.cfg file contains the desired<br />
information.<br />
• The host can communicate with the I/O Controllers (IOCs) of the VIO<br />
hardware.<br />
D000046-005 B G-1
G–ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues<br />
Verify that the proper VirtualNIC driver is running<br />
Check that a VirtualNIC driver is running by issuing an lsmod command on the<br />
InfiniBand host. Make sure that the qlgc_vnic is displayed on the list of<br />
modules. Following is an example:<br />
st186:~ # lsmod<br />
Module<br />
Size Used by<br />
cpufreq_ondemand 25232 1<br />
cpufreq_userspace 23552 0<br />
cpufreq_powersave 18432 0<br />
powernow_k8 30720 2<br />
freq_table<br />
22400 1 powernow_k8<br />
qlgc_srp 93876 0<br />
qlgc_vnic 116300 0#<br />
Verifying that the qlgc_vnic.cfg file contains the correct<br />
information<br />
Use the following scenarios to verify that the qlgc_vnic.cfg file contains a<br />
definition for the applicable virtual interface:<br />
Issue the command ib_qlgc_vnic_query to get the list of IOCs the host<br />
can see.<br />
If the list is empty, there may be a syntax error in the qlgc_vnic.cfg file (e.g., a<br />
missing semicolon). Look in /var/log/messages at the time qlgc_vnic was<br />
last started to see if any error messages were put in the log at that time.<br />
If the qlgc_vnic.cfg file has been edited since the last time the VirtualNIC<br />
driver was started, the driver needs restarted. To restart the driver, so that it uses<br />
the current qlgc_vnic.cfg file, issue a /etc/init.d/qlgc_vnic restart.<br />
G-2 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues<br />
Verifying that the host can communicate with the I/O<br />
Controllers (IOCs) of the VIO hardware<br />
To display the Ethernet VIO cards that the host can see and communicate with,<br />
issue the command ib_qlgc_vnic_query. The system returns information<br />
similar to the following:<br />
IO Unit Info:<br />
port LID: 0003<br />
port GID: fe8000000000000000066a0258000001<br />
change ID: 0009<br />
max controllers: 0x03<br />
controller[ 1]<br />
GUID: 00066a0130000001<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
ID: Chassis 0x00066A00010003F2, Slot 1, IOC 1<br />
service entries: 2<br />
service[ 0]: 1000066a00000001 /<br />
InfiniNIC.InfiniConSys.Control:01<br />
service[ 1]: 1000066a00000101 /<br />
InfiniNIC.InfiniConSys.Data:01<br />
controller[ 2]<br />
GUID: 00066a0230000001<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
ID: Chassis 0x00066A00010003F2, Slot 1, IOC 2<br />
service entries: 2<br />
service[ 0]: 1000066a00000002 /<br />
InfiniNIC.InfiniConSys.Control:02<br />
service[ 1]: 1000066a00000102 /<br />
InfiniNIC.InfiniConSys.Data:02<br />
controller[ 3]<br />
GUID: 00066a0330000001<br />
vendor ID: 00066a<br />
device ID: 000030<br />
IO class : 2000<br />
D000046-005 B G-3
G–ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues<br />
ID: Chassis 0x00066A00010003F2, Slot 1, IOC 3<br />
service entries: 2<br />
service[ 0]: 1000066a00000003 /<br />
InfiniNIC.InfiniConSys.Control:03<br />
service[ 1]: 1000066a00000103 /<br />
InfiniNIC.InfiniConSys.Data:03<br />
When ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID<br />
information. With the -s option it reports the IOCSTRING information for the<br />
Virtual I/O hardware IOCs present on the fabric. Following is an example:<br />
# ib_qlgc_vnic_query -e<br />
ioc_guid=00066a0130000001,dgid=fe8000000000000000066a02580000<br />
01,pkey=ffff<br />
ioc_guid=00066a0230000001,dgid=fe8000000000000000066a02580000<br />
01,pkey=ffff<br />
ioc_guid=00066a0330000001,dgid=fe8000000000000000066a02580000<br />
01,pkey=ffff<br />
#ib_qlgc_vnic_query -s<br />
"Chassis 0x00066A00010003F2, Slot 1, IOC 1"<br />
"Chassis 0x00066A00010003F2, Slot 1, IOC 2"<br />
"Chassis 0x00066A00010003F2, Slot 1, IOC 3"<br />
#ib_qlgc_vnic_query -es<br />
ioc_guid=00066a0130000001,dgid=fe8000000000000000066a02580000<br />
01,pkey=ffff,"Chassis 0x00066A00010003F2, Slot 1, IOC 1"<br />
ioc_guid=00066a0230000001,dgid=fe8000000000000000066a02580000<br />
01,pkey=ffff,"Chassis 0x00066A00010003F2, Slot 1, IOC 2"<br />
ioc_guid=00066a0330000001,dgid=fe8000000000000000066a02580000<br />
01,pkey=ffff,"Chassis 0x00066A00010003F2, Slot 1, IOC 3"<br />
G-4 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues<br />
If the host can not see applicable IOCs, there are two things to check. First, verify<br />
that the adapter port specified in the eioc definition of the<br />
/etc/infiniband/qlgc_vnic.cfg file is active. This is done using the<br />
ibv_devinfo commands on the host, then checking the value of state. If the<br />
state is not Port_Active, the adapter port is not logically connected to the<br />
fabric. It is possible that one of the adapter ports is not physically connected to an<br />
InfiniBand switch. For example:<br />
st139:~ # ibv_devinfo<br />
hca_id: mlx4_0<br />
fw_ver: 2.2.000<br />
node_guid:<br />
0002:c903:0000:0f80<br />
sys_image_guid:<br />
0002:c903:0000:0f83<br />
vendor_id:<br />
0x02c9<br />
vendor_part_id: 25418<br />
hw_ver:<br />
0xA0<br />
board_id:<br />
MT_04A0110002<br />
phys_port_cnt: 2<br />
port: 1<br />
state:<br />
PORT_ACTIVE<br />
(4)<br />
max_mtu: 2048 (4)<br />
active_mtu: 2048 (4)<br />
sm_lid: 1<br />
port_lid: 8<br />
port_lmc:<br />
0x00<br />
(4)<br />
port: 2<br />
state:<br />
PORT_ACTIVE<br />
max_mtu: 2048 (4)<br />
active_mtu: 2048 (4)<br />
sm_lid: 1<br />
port_lid: 9<br />
port_lmc: 0x00#<br />
Second, verify that the adapter port specified in the EIOC definition is the correct<br />
port. The host sees the IOCs, but not over the adapter port in the definition of the<br />
IOC. For example, the host may see the IOCs over adapter Port 1, but the eioc<br />
definition in the /etc/infiniband/qlgc_vnic.cfg file specifies PORT=2.<br />
D000046-005 B G-5
G–ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues<br />
Another reason why the host might not be able to see the necessary IOCs is that<br />
the subnet manager has gone down. Issue an iba_saquery command to make<br />
certain that the response shows all of the nodes in the fabric. If an error is<br />
returned and the adapter is physically connected to the fabric, then the subnet<br />
manager has gone down, and this situation needs to be corrected.<br />
Checking the interface definitions on the host<br />
If it is not possible to ping from an InfiniBand host to the Ethernet host, and the<br />
ViPort State of the interface is VIPORT_CONNECTED, then issue an<br />
ifconfig command. The interfaces defined in the configuration files listed in<br />
/etc/sysconfig/network directory for SLES hosts or the<br />
/etc/sysconfig/network-scripts for Red Hat hosts should be displayed in<br />
the list of interfaces in the ifconfig output. For example, the ifconfig file<br />
should show an interface for each EIOC configuration file in the following list:<br />
# ls /etc/sysconfig/network-scripts<br />
ifcfg-eioc1<br />
ifcfg-eioc2<br />
ifcfg-eioc3<br />
ifcfg-eioc4<br />
ifcfg-eioc5<br />
ifcfg-eioc6<br />
Interface does not show up in output of 'ifconfig'<br />
If an interface is not displayed in the output of an ifconfig command, there is<br />
most likely a problem in the definition of that interface in the<br />
/etc/sysconfig/network-scripts/ifcfg- (for RedHat systems)<br />
or /etc/sysconfig/network/ifcfg- (for SuSE systems) file, where<br />
is the name of the virtual interface (e.g., eioc1).<br />
NOTE:<br />
For the remainder of this section, ifcfg directory refers to<br />
/etc/sysconfig/network-scripts/ on RedHat systems, and<br />
/etc/sysconfig/network on SuSE systems.<br />
Issue an ifup command. If the interface is displayed when issuing an<br />
ifconfig command, there may be a problem with the way the interface startup<br />
is defined in the ifcfg directory'/ifcfg- file that is preventing the<br />
interface from coming up automatically.<br />
If the interface does not come up, check the interface definitions in the ifcfg<br />
directory. Make certain that there are no misspellings in the ifcfg- file.<br />
G-6 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting VirtualNIC and VIO Hardware Issues<br />
Example of ifcfg-eiocx setup for RedHat systems:<br />
DEVICE=eioc1<br />
BOOTPROTO=static<br />
IPADDR=172.26.48.132<br />
BROADCAST=172.26.63.130<br />
NETMASK=255.255.240.0<br />
NETWORK=172.26.48.0<br />
ONBOOT=yes<br />
TYPE=Ethernet<br />
Example of ifcfg-eiocx setup for SuSE and SLES systems:<br />
BOOTPROTO='static'<br />
IPADDR='172.26.48.130'<br />
BROADCAST='172.26.63.255'<br />
NETMASK='255.255.240.0'<br />
NETWORK='172.26.48.0'<br />
STARTMODE='hotplug'<br />
TYPE='Ethernet'<br />
Verify the physical connection between the VIO hardware and<br />
the Ethernet network<br />
If the interface is displayed in an ifconfig and a ping between the InfiniBand<br />
host and the Ethernet host is still unsuccessful, verify that the VIO hardware<br />
Ethernet ports are physically connected to the correct Ethernet network. Verify<br />
that the Ethernet port corresponding to the IOCGUID for the interface to be used<br />
is connected to the expected Ethernet network.<br />
There are up to 6 IOC GUIDs on each VIO hardware module (6 for the IB/Ethernet<br />
Bridge Module, 2 for the EVIC), one for each Ethernet port. If a VIO hardware<br />
module can be seen from a host, the ib_qlgc_vnic_query -s file displays<br />
information similar to:<br />
EVIC in Chassis 0x00066a000300012a, Slot 19, Ioc 1<br />
EVIC in Chassis 0x00066a000300012a, Slot 19, Ioc 2<br />
EVIC in Chassis 0x00066a000300012a, Slot 8, Ioc 1<br />
EVIC in Chassis 0x00066a000300012a, Slot 8, Ioc 2<br />
EVIC in Chassis 0x00066a00da000100, Slot 2, Ioc 1<br />
EVIC in Chassis 0x00066a00da000100, Slot 2, Ioc 2<br />
D000046-005 B G-7
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
Troubleshooting SRP Issues<br />
ib_qlgc_srp_stats showing session in disconnected state<br />
Problem:<br />
If the session is part of a multi-session adapter, ib_qlgc_srp_stats will show it<br />
to be in the disconnected state. For example:<br />
SCSI <strong>Host</strong> # : 17 | Mode : ROUNDROBIN<br />
Trgt Adapter Depth : 1000 | Verify Target : Yes<br />
Rqst Adapter Depth : 1000 | Rqst LUN Depth : 16<br />
Tot Adapter Depth : 1000 | Tot LUN Depth : 16<br />
Act Adapter Depth : 998 | Act LUN Depth : 16<br />
Max LUN Scan : 512 | Max IO : 131072 (128 KB)<br />
Max Sectors : 256 | Max SG Depth : 33<br />
Session Count : 2 | No Connect T/O : 60 Second(s)<br />
Register In Order : ON | Dev Reqst T/O : 2 Second(s)<br />
Description : SRP Virtual HBA 1<br />
Session : Session 1 | State : Disconnected<br />
Source GID<br />
: 0xfe8000000000000000066a000100d051<br />
Destination GID : 0xfe8000000000000000066a0260000165<br />
SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 1<br />
SRP Target IOClass : 0xFF00<br />
| SRP Target SID : 0x0000494353535250<br />
SRP IPI Guid : 0x00066a000100d051 | SRP IPI Extnsn : 0x0000000000000001<br />
SRP TPI Guid : 0x00066a0138000165 | SRP TPI Extnsn : 0x0000000000000001<br />
Source LID : 0x000b | Dest LID : 0x0004<br />
Completed Sends : 0x00000000000002c0 | Send Errors : 0x0000000000000000<br />
Completed Receives : 0x00000000000002c0 | Receive Errors : 0x0000000000000000<br />
Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />
Total SWUs<br />
: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />
Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />
SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />
<strong>Host</strong> Busys<br />
: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />
Session : Session 2 | State : Disconnected<br />
Source GID<br />
: 0xfe8000000000000000066a000100d052<br />
Destination GID : 0xfe8000000000000000066a0260000165<br />
SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 2<br />
SRP Target IOClass : 0xFF00<br />
| SRP Target SID : 0x0000494353535250<br />
SRP IPI Guid : 0x00066a000100d052 | SRP IPI Extnsn : 0x0000000000000001<br />
SRP TPI Guid : 0x00066a0238000165 | SRP TPI Extnsn : 0x0000000000000001<br />
Source LID : 0x000c | Dest LID : 0x0004<br />
Completed Sends : 0x00000000000001c8 | Send Errors : 0x0000000000000000<br />
Completed Receives : 0x00000000000001c8 | Receive Errors : 0x0000000000000000<br />
Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />
Total SWUs<br />
: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />
Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />
SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />
<strong>Host</strong> Busys<br />
: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />
G-8 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
Solution:<br />
Perhaps an interswitch cable has been disconnected, or the VIO hardware is<br />
offline, or the Chassis/Slot does not contain a VIO hardware card. Instead of<br />
looking at this file, use the ib_qlgc_srp_query command to verify that the<br />
desired adapter port is in the active state.<br />
NOTE:<br />
It is normal to see the "Can not find a path" message when the system<br />
first boots up. Sometimes SRP comes up before the subnet manager has<br />
brought the port state of the adapter port to active. If the adapter port is not<br />
active, SRP will not be able to find the VIO hardware card. Use the<br />
appropriate OFED command to show the port state.<br />
Session in 'Connection Rejected' state<br />
Problem:<br />
The session is in the 'Connection Rejected' state according to<br />
/var/log/messages. If the session is part of a multi-session adapter,<br />
ib_qlgc_srp_stats shows it in the "Connection Rejected" state.<br />
A host displays<br />
"Connection Failed for Session X: IBT Code = 0x0<br />
"Connection Failed for Session X: SRP Code = 0x1003<br />
"Connection Rejected"<br />
D000046-005 B G-9
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
Following is an example:<br />
SCSI <strong>Host</strong> # : 17 | Mode : ROUNDROBIN<br />
Trgt Adapter Depth : 1000 | Verify Target : Yes<br />
Rqst Adapter Depth : 1000 | Rqst LUN Depth : 16<br />
Tot Adapter Depth : 1000 | Tot LUN Depth : 16<br />
Act Adapter Depth : 998 | Act LUN Depth : 16<br />
Max LUN Scan : 512 | Max IO : 131072 (128 KB)<br />
Max Sectors : 256 | Max SG Depth : 33<br />
Session Count : 2 | No Connect T/O : 60 Second(s)<br />
Register In Order : ON | Dev Reqst T/O : 2 Second(s)<br />
Description : SRP Virtual HBA 1<br />
Session : Session 1 | State : Disconnected<br />
Source GID<br />
: 0xfe8000000000000000066a000100d051<br />
Destination GID : 0xfe8000000000000000066a0260000165<br />
SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 1<br />
SRP Target IOClass : 0xFF00<br />
| SRP Target SID : 0x0000494353535250<br />
SRP IPI Guid : 0x00066a000100d051 | SRP IPI Extnsn : 0x0000000000000001<br />
SRP TPI Guid : 0x00066a0138000165 | SRP TPI Extnsn : 0x0000000000000001<br />
Source LID : 0x000b | Dest LID : 0x0004<br />
Completed Sends : 0x00000000000002c0 | Send Errors : 0x0000000000000000<br />
Completed Receives : 0x00000000000002c0 | Receive Errors : 0x0000000000000000<br />
Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />
Total SWUs<br />
: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />
Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />
SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />
<strong>Host</strong> Busys<br />
: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />
Session : Session 2 | State : Disconnected<br />
Source GID<br />
: 0xfe8000000000000000066a000100d052<br />
Destination GID : 0xfe8000000000000000066a0260000165<br />
SRP IOC Profile : Chassis 0x00066A0001000481, Slot 1, IOC 2<br />
SRP Target IOClass : 0xFF00<br />
| SRP Target SID : 0x0000494353535250<br />
SRP IPI Guid : 0x00066a000100d052 | SRP IPI Extnsn : 0x0000000000000001<br />
SRP TPI Guid : 0x00066a0238000165 | SRP TPI Extnsn : 0x0000000000000001<br />
Source LID : 0x000c | Dest LID : 0x0004<br />
Completed Sends : 0x00000000000001c8 | Send Errors : 0x0000000000000000<br />
Completed Receives : 0x00000000000001c8 | Receive Errors : 0x0000000000000000<br />
Connect Attempts : 0x0000000000000000 | Test Attempts : 0x0000000000000000<br />
Total SWUs<br />
: 0x00000000000003e8 | Available SWUs : 0x00000000000003e8<br />
Busy SWUs : 0x0000000000000000 | SRP Req Limit : 0x00000000000003e8<br />
SRP Max ITIU : 0x0000000000000140 | SRP Max TIIU : 0x0000000000000140<br />
<strong>Host</strong> Busys<br />
: 0x0000000000000000 | SRP Max SG Used : 0x000000000000000f<br />
AND<br />
The VIO hardware displays "Initiator Not Configured within IOU:<br />
initiator <br />
Initiatorport<br />
identifier ( is<br />
invalid/not allowed to use this FCIOU”.<br />
G-10 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
Solution 1:<br />
The host initiator has not been configured as an SRP initiator on the VIO<br />
hardware SRP Initiator Discovery screen. Via Chassis Viewer, bring up the SRP<br />
Initiator Discovery screen and either<br />
Click on 'Add New' to add a wildcarded entry with the initiator extension to<br />
match what is in the session entry in the qlgc_srp.cfg file, or<br />
Click on the Start button to discover the adapter port GUID, and then click<br />
'Configure' on the row containing the adapter port GUID and give the entry<br />
a name.<br />
Solution 2:<br />
Check the SRP map on the VIO hardware specified in the failing Session block of<br />
the qlgc_srp.cfg file. Make certain there is a map defined for the row specified<br />
by either the initiatorExtension in the failing Session block of the<br />
qlgc_srp.cfg file or the adapter port GUID specified in the failing Session block<br />
of the qlgc_srp.cfg file. Additionally, make certain that the map in that row is in<br />
the column of the IOC specified in the failing Session block of the qlgc_srp.cfg<br />
file.<br />
Attempts to read or write to disk are unsuccessful<br />
Problem:<br />
Attempts to read or write to the disk are unsuccessful when SRP comes up. About<br />
every five seconds the VIO hardware displays<br />
"fcIOStart Failed",<br />
"CMDisconnect called for Port: xxxxxxxx Initiator: Target: ",<br />
and<br />
"Target Port Deleted for Port: xxxxxxxx Initiator: Target: "<br />
The host log shows a session transitioning between Connected and Down. The<br />
host log also displays "Test Unit Ready has FAILED", "Abort Task has<br />
SUCCEEDED", and "Clear Task has FAILED".<br />
Solution:<br />
This indicates a problem in the path between the VIO hardware and the target<br />
storage device. After an SRP host has connected to the VIO hardware<br />
successfully, the host sends a "Test Unit Ready" command to the storage<br />
device. After five seconds, if that command is not responded to, the SRP host<br />
brings down the session and retries in five seconds. Verify that the status of the<br />
connection between the appropriate VIO hardware port and the target device is<br />
UP on the FCP Device Discovery screen.<br />
D000046-005 B G-11
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
Problem:<br />
Attempts to read or write to the disk are unsuccessful, when they were previously<br />
successful. The host displays 'Sense Data indicates recovery is<br />
necessary on Session' and the "Test Unit Ready has FAILED", "Abort<br />
Task has SUCCEEDED", "Clear Task has FAILED" messages.<br />
Solution:<br />
If there is a problem with communication between the VIO hardware and the<br />
storage device (e.g., the cable between the storage device and the Fibre Channel<br />
switch was pulled) the VIO hardware log will display a "Connection Lost to<br />
NPort Id" message. The next time the host tries to do an input/output (I/O), the<br />
'Sense Data indicates recovery is necessary' appears. Then SRP will<br />
recycle the session. As part of trying to move the session from 'Connected' to<br />
'Active', SRP will issue the 'Test Unit Ready' command.<br />
Verify that the status of the connection between the appropriate VIO hardware<br />
port and the target device is UP on the FCP Device Discovery screen.<br />
Additionally, there may occasionally be messages in the log such as:<br />
Connection Failed for Session X: IBT Code = 0x0<br />
Connection Failed for Session X: SRP Code = 0x0<br />
That may indicate a problem in the path between the VIO hardware and the target<br />
storage device.<br />
Four sessions in a round-robin configuration are active<br />
Problem:<br />
Four sessions in a round-robin configuration are active according to<br />
ib_qlgc_srp_stats. However, only one disk can be seen, although five should<br />
be seen.<br />
Solution 1:<br />
Make certain that Max LUNs Scanned is reporting the same value as<br />
adapterMaxLUNs is set to in qlgc_srp.cfg.<br />
Solution 2:<br />
Make certain that all sessions have a map to the same disk defined. The fact that<br />
the session is active means that the session can see a disk. However, if one of the<br />
sessions is using a map with the 'wrong' disk, then the round-robin method could<br />
lead to a disk or disks not being seen.<br />
Which port does a port GUID refer to<br />
Solution:<br />
A <strong>QLogic</strong> <strong>Host</strong> Channel Adapter Port GUID is of the form 00066appa0iiiiii<br />
G-12 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
where pp gives the port number (0 relative)<br />
and iiiiiii gives the individual id number of the adapter<br />
so 00066a00a0iiiiiii is the port guid of the 1st port of the adapter<br />
and 00066a01a0iiiiiii is the port guid of the 2nd port of the adapter.<br />
Similarly, a VFx Port GUID is of the form 00066app38iiiiii<br />
where pp gives the IOC number (1 or 2)<br />
and iiiiiii gives the individual ID number of the VIO hardware<br />
so 00066a0138iiiiiii is the port guid of IOC 1 of VIO hardware iiiiiii<br />
and 00066a0238iiiiiii is the port guid of IOC 2 of VIO hardware iiiiiii<br />
NOTE:<br />
After a virtual adapter has been successfully added (meaning at least 1<br />
session where part of the adapter has gone to the Active state) the SRP<br />
module will indicates what type of session was created in the mode variable<br />
(ib_qlgc_srp_stats) file, depending on whether "roundrobinmode:<br />
1" is set in the qlgc_srp.cfg file. In this case "X" is the virtual adapter<br />
number, with number 0 being the first one created.<br />
If no sessions were successfully brought to the Active state, then the<br />
roundrobin_X or failover_X file will not be created.<br />
In a round robin configuration, if everything is configured correctly, all sessions will<br />
be Active.<br />
In a failover configuration, if everything is configured correctly, one session will be<br />
Active and the rest will be Connected. The transition of a session from Connected<br />
to Active will not be attempted until that session needs to become Active, due to<br />
the failure of the previously Active session.<br />
How does the user find a <strong>Host</strong> Channel Adapter port GUID<br />
Solution:<br />
A <strong>Host</strong> Channel Adapter Port GUID is displayed by entering the following at any<br />
host prompt:<br />
ibv_devinfo -i 1 for port 1<br />
ibv_devinfo -i 2 for port 2<br />
D000046-005 B G-13
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
The system displays information similar to the following:<br />
st106:~ # ibv_devinfo -i 1<br />
hca_id: mthca0<br />
fw_ver: 5.1.9301<br />
node_guid:<br />
0006:6a00:9800:6c9f<br />
sys_image_guid:<br />
0006:6a00:9800:6c9f<br />
vendor_id:<br />
0x066a<br />
vendor_part_id: 25218<br />
hw_ver:<br />
0xA0<br />
board_id:<br />
SS_0000000005<br />
phys_port_cnt: 2<br />
port: 1<br />
state: PORT_ACTIVE (4)<br />
max_mtu: 2048 (4)<br />
active_mtu: 2048 (4)<br />
sm_lid: 71<br />
port_lid: 60<br />
port_lmc:<br />
0x00<br />
st106:~ # ibv_devinfo -i 2<br />
hca_id: mthca0<br />
fw_ver: 5.1.9301<br />
node_guid:<br />
0006:6a00:9800:6c9f<br />
sys_image_guid:<br />
0006:6a00:9800:6c9f<br />
vendor_id:<br />
0x066a<br />
vendor_part_id: 25218<br />
hw_ver:<br />
0xA0<br />
board_id:<br />
SS_0000000005<br />
phys_port_cnt: 2<br />
port: 2<br />
state: PORT_ACTIVE (4)<br />
max_mtu: 2048 (4)<br />
active_mtu: 2048 (4)<br />
sm_lid: 71<br />
port_lid: 64<br />
port_lmc:<br />
0x00<br />
G-14 D000046-005 B
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
Need to determine the SRP driver version.<br />
Solution:<br />
To determine the SRP driver version number, enter the command modinfo -d<br />
qlgc-srp, which returns information similar to the following:<br />
st159:~ # modinfo -d qlgc-srp<br />
<strong>QLogic</strong> Corp. Virtual HBA (SRP) SCSI Driver, version 1.0.0.0.3<br />
D000046-005 B G-15
G–ULP Troubleshooting<br />
Troubleshooting SRP Issues<br />
G-16 D000046-005 B
H<br />
Write Combining<br />
Introduction<br />
Write combining improves write bandwidth to the <strong>QLogic</strong> chip by writing multiple<br />
words in a single bus transaction (typically 64 bytes). Write combining applies only<br />
to x86_64 systems.<br />
The x86 Page Attribute Table (PAT) mechanism that allocates Write Combining<br />
(WC) mappings for the PIO buffers has been added and is now the default.<br />
If PAT is unavailable or PAT initialization fails, the code will generate a message in<br />
the log and fall back to the Memory Type Range Registers (MTRR) mechanism.<br />
If write combining is not working properly, lower than expected bandwidth may<br />
occur.<br />
The following sections provide instructions for checking write combining and for<br />
using PAT and MTRR.<br />
Verify Write Combining is Working<br />
To see if write combining is working correctly and to check the bandwidth, run the<br />
following command:<br />
$ ipath_pkt_test -B<br />
With write combining enabled, the QLE7140 and QLE7240 report in the range<br />
of 1150–1500 MBps. The QLE7280 reports in the range of 1950–3000 MBps.<br />
You can also use ipath_checkout (use option 5) to check bandwidth.<br />
Although the PAT mechanism should work correctly by default, increased latency<br />
and low bandwidth may indicate a problem. If so, the interconnect operates, but in<br />
a degraded performance mode, with latency increasing to several microseconds,<br />
and bandwidth decreasing to as little as 200 MBps.<br />
Upon driver startup, you may see these errors:<br />
ib_qib 0000:04:01.0: infinqib0: Performance problem: bandwidth to<br />
PIO buffers is only 273 MiB/sec<br />
.<br />
.<br />
.<br />
D000046-005 B H-1
H–Write Combining<br />
PAT and Write Combining<br />
If you do not see any of these messages on your console, but suspect this<br />
problem, check the /var/log/messages file. Some systems suppress driver<br />
load messages but still output them to the log file.<br />
Methods for enabling and disabling the two write combining mechanisms are<br />
described in the following sections. There are no conflicts between the two<br />
methods.<br />
PAT and Write Combining<br />
This is the default mechanism for allocating Write Combining (WC) mappings for<br />
the PIO buffers. It is set as a parameter in /etc/modprobe.conf (on Red Hat<br />
systems) or /etc/modprobe.conf.local (on SLES systems). The default is:<br />
option ib_qib wc_pat=1<br />
If PAT is unavailable or PAT initialization fails, the code generates a message in<br />
the log and falls back to the Memory Type Range Registers (MTRR) mechanism.<br />
To use MTRR, disable PAT by setting this module parameter to 0 (as a root user):<br />
option ib_qib wc_pat=0<br />
Then, revert to using MTRR-only behavior by following one of the two suggestions<br />
in “MTRR Mapping and Write Combining” on page H-2.<br />
The driver must be restarted after the changes have been made.<br />
NOTE:<br />
There will be no WC entry in /proc/mtrr when using PAT.<br />
MTRR Mapping and Write Combining<br />
.<br />
Two suggestions for properly enabling MTRR mapping for write combining are<br />
described in the following sections.<br />
See “Performance Issues” on page F-9 for more details on a related performance<br />
issue.<br />
Edit BIOS Settings to Fix MTRR Issues<br />
You can edit the BIOS setting for MTRR mapping. The BIOS setting looks similar<br />
to:<br />
MTRR Mapping<br />
[Discrete]<br />
For systems with very large amounts of memory (32GB or more), it may also be<br />
necessary to adjust the BIOS setting for the PCI hole granularity to 2GB. This<br />
setting allows the memory to be mapped with fewer MTRRs, so that there will be<br />
one or more unused MTRRs for the InfiniPath driver.<br />
H-2 D000046-005 B
H–Write Combining<br />
MTRR Mapping and Write Combining<br />
Some BIOS’ do not have the MTRR mapping option. It may have a different<br />
name, depending on the chipset, vendor, BIOS, or other factors. For example, it is<br />
sometimes referred to as 32 bit memory hole. This setting must be enabled.<br />
If there is no setting for MTRR mapping or 32 bit memory hole, and you have<br />
problems with degraded performance, contact your system or motherboard<br />
vendor and ask how to enable write combining.<br />
Use the ipath_mtrr Script to Fix MTRR Issues<br />
<strong>QLogic</strong> also provides a script, ipath_mtrr, which sets the MTRR registers,<br />
enabling maximum performance from the InfiniPath driver. This Python script is<br />
available as a part of the InfiniPath software download, and is contained in the<br />
infinipath* RPM. It is installed in /bin.<br />
To diagnose the machine, run it with no arguments (as a root user):<br />
# ipath_mtrr<br />
The test results will list any problems, if they exist, and provide suggestions on<br />
what to do.<br />
To fix the MTRR registers, use:<br />
# ipath_mtrr -w<br />
Restart the driver after fixing the registers.<br />
This script needs to be run after each system reboot. It can be set to run<br />
automatically upon restart by adding this line in /etc/sysconfig/infinipath:<br />
IPATH_MTRR_ACTIVE=1<br />
See the ipath_mtrr(8) man page for more information on other options.<br />
D000046-005 B H-3
H–Write Combining<br />
MTRR Mapping and Write Combining<br />
Notes<br />
H-4 D000046-005 B
I<br />
Useful Programs and Files<br />
The most useful programs and files for debugging, and commands for common<br />
tasks, are presented in the following sections. Many of these programs and files<br />
have been discussed elsewhere in the documentation. This information is<br />
summarized and repeated here for your convenience.<br />
Check Cluster Homogeneity with ipath_checkout<br />
Many problems can be attributed to the lack of homogeneity in the cluster<br />
environment. Use the following items as a checklist for verifying homogeneity. A<br />
difference in any one of these items in your cluster may cause problems:<br />
• Kernels<br />
• Distributions<br />
• Versions of the <strong>QLogic</strong> boards<br />
• Runtime and build environments<br />
• .o files from different compilers<br />
• Libraries<br />
• Processor/link speeds<br />
• PIO bandwidth<br />
• MTUs<br />
With the exception of finding any differences between the runtime and build<br />
environments, ipath_checkout will pick up information on all the above items.<br />
Other programs useful for verifying homogeneity are listed in Table I-1. More<br />
details on ipath_checkout are in “ipath_checkout” on page I-10.<br />
Restarting InfiniPath<br />
When the driver status appears abnormal on any node, you can try restarting (as<br />
a root user). Type:<br />
# /etc/init.d/openibd restart<br />
These two commands perform the same function as restart:<br />
# /etc/init.d/openibd stop<br />
# /etc/init.d/openibd start<br />
Also check the /var/log/messages file for any abnormal activity.<br />
D000046-005 B I-1
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
Summary and Descriptions of Useful Programs<br />
Useful programs are summarized in Table I-1. Names in blue text are linked to a<br />
corresponding section that provides further details. Check the man pages for<br />
more information on the programs.<br />
Table I-1. Useful Programs<br />
Program Name<br />
chkconfig<br />
dmesg<br />
Function<br />
Checks the configuration state and enables/disables services,<br />
including drivers. Can be useful for checking homogeneity.<br />
Prints out bootup messages. Useful for checking for initialization<br />
problems.<br />
ibhosts a<br />
ibstatus a<br />
ibtracert a<br />
ibv_devinfo a<br />
ident b<br />
ipathbug-helper c<br />
ipath_checkout c<br />
ipath_control c<br />
ipath_mtrr c<br />
ipath_pkt_test c<br />
Checks that all hosts in the fabric are up and visible to the<br />
subnet manager and to each other<br />
Checks the status of InfiniBand devices when OpenFabrics is<br />
installed<br />
Determines the path that InfiniBand packets travel between<br />
two nodes<br />
Lists information about InfiniBand devices in use. Use when<br />
OpenFabrics is enabled.<br />
Identifies RCS keyword strings in files. Can check for dates,<br />
release versions, and other identifying information.<br />
A shell script that gathers status and history information for<br />
use in analyzing InfiniPath problems<br />
A bash shell script that performs sanity testing on a cluster<br />
using <strong>QLogic</strong> hardware and InfiniPath software. When the<br />
program runs without errors, the node is properly configured.<br />
A shell script that manipulates various parameters for the<br />
InfiniPath driver.<br />
This script gathers the same information contained in boardversion,<br />
status_str, and version.<br />
A Python script that sets the MTRR registers.<br />
Tests the InfiniBand link and bandwidth between two <strong>QLogic</strong><br />
infiniband adapters, or, using an InfiniBand loopback connector,<br />
tests within a single <strong>QLogic</strong> infiniband adapter<br />
I-2 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
Table I-1. Useful Programs (Continued)<br />
Program Name<br />
ipathstats c<br />
lsmod<br />
modprobe<br />
mpi_stress<br />
mpirun d<br />
ps<br />
rpm<br />
strings e<br />
Table Notes<br />
Function<br />
Displays driver statistics and hardware counters, including<br />
performance and "error" (including status) counters<br />
Shows status of modules in the Linux kernel. Use to check<br />
whether drivers are loaded.<br />
Adds or removes modules from the Linux kernel.<br />
An MPI stress test program designed to load up an MPI interconnect<br />
with point-to-point messages while optionally checking<br />
for data integrity.<br />
A front end program that starts an MPI job on an InfiniPath<br />
cluster. Use to check the origin of the drivers.<br />
Displays information on current active processes. Use to<br />
check whether all necessary processes have been started.<br />
Package manager to install, query, verify, update, or erase<br />
software packages. Use to check the contents of a package.<br />
Prints the strings of printable characters in a file. Useful for<br />
determining contents of non-text files such as date and version.<br />
a<br />
These programs are contained in the OpenFabrics openib-diags RPM.<br />
b<br />
These programs are contained within the rcs RPM for your distribution.<br />
c<br />
These programs are contained in the infinipath RPM. To use these programs, install the<br />
infinipath RPM on the nodes where you install the mpi-frontend RPM.<br />
d<br />
These programs are contained in the <strong>QLogic</strong> mpi-frontend RPM.<br />
e<br />
These programs are contained within the binutils RPM for your distribution.<br />
dmesg<br />
dmesg prints out bootup messages. It is useful for checking for initialization<br />
problems. You can check to see if problems were detected during the driver and<br />
<strong>QLogic</strong> hardware initialization with the command:<br />
$ dmesg|egrep -i infinipath|qib<br />
This command may generate more than one screen of output.<br />
D000046-005 B I-3
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
iba_opp_query<br />
This command retrieves path records from the Distributed SA and is somewhat<br />
similar to iba_saquery. It is intended for testing the Distributed SA<br />
(qlogic_sa) and for verifying connectivity between nodes in the fabric. For<br />
information on configuring and using the Distributed SA, refer to “<strong>QLogic</strong><br />
Distributed Subnet Administration” on page 3-13.<br />
iba_opp_query does not access the SM when doing queries, it only accesses<br />
the local Distributed SA database. For that reason, the kinds of queries that can<br />
be done are much more limited than with iba_saquery. In particular, it can only<br />
find paths that start on the machine where the command is run. (In other words,<br />
the source LID or source GID must be on the local node.) In addition, queries<br />
must supply either a source and destination LID, or a source and destination GID.<br />
They cannot be mixed. In addition, you will usually need to provide either a SID<br />
that was specified in Distributed SA configuration file, or a pkey that matches such<br />
a SID.<br />
Usage<br />
iba_opp_query [-v level] [-hca hca] [-p port] [-s LID] [-d<br />
LID] [-S GID] [-D GID] [-k pkey] [-i sid] [-H]<br />
Options<br />
-v/--verbose level — Debug level. Should be a number between 1<br />
and 7. Default is 5.<br />
-s/--slid LID — Source LID. Can be in decimal, hex (0x##) or octal<br />
(0##)<br />
-d/--dlid LID — Destination LID. Can be in decimal, hex (0x##) or octal<br />
(0##)<br />
-S/--sgid GID — Source GID. (Can be in GID<br />
("0x########:0x########") or inet6 format ("##:##:##:##:##:##:##:##"))<br />
-D/--dgid GID — Destination GID. (Can be in GID<br />
("0x########:0x########") or inet6 format ("##:##:##:##:##:##:##:##"))<br />
-k/--pkey pkey — Partition Key<br />
-i/--sid sid — Service ID<br />
-h/--hca hca — The <strong>Host</strong> Channel Adapter to use. (Defaults to the first<br />
<strong>Host</strong> Channel Adapter.) The <strong>Host</strong> Channel Adapter can be identified by<br />
name ("mthca0", “qib1”, et cetera) or by number (1, 2, 3, et cetera).<br />
-p/--port port — The port to use. (Defaults to the first port)<br />
-H/--help — Provides this help text.<br />
All arguments are optional, but ill-formed queries can be expected to fail. You<br />
must provide at least a pair of LIDs, or a pair of GIDs.<br />
I-4 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
Sample output:<br />
# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107<br />
Query Parameters:<br />
resv1<br />
0x0000000000000107<br />
dgid ::<br />
sgid ::<br />
dlid<br />
0x75<br />
slid<br />
0x31<br />
hop<br />
0x0<br />
flow<br />
0x0<br />
tclass<br />
0x0<br />
num_path<br />
0x0<br />
pkey<br />
0x0<br />
qos_class<br />
0x0<br />
sl<br />
0x0<br />
mtu<br />
0x0<br />
rate<br />
0x0<br />
pkt_life<br />
0x0<br />
preference<br />
0x0<br />
resv2<br />
0x0<br />
resv3<br />
0x0<br />
Using HCA qib0<br />
Result:<br />
resv1<br />
0x0000000000000107<br />
dgid<br />
fe80::11:7500:79:e54a<br />
sgid<br />
fe80::11:7500:79:e416<br />
dlid<br />
0x75<br />
slid<br />
0x31<br />
hop<br />
0x0<br />
flow<br />
0x0<br />
tclass<br />
0x0<br />
num_path<br />
0x0<br />
pkey<br />
0xffff<br />
qos_class<br />
0x0<br />
sl<br />
0x1<br />
mtu<br />
0x4<br />
rate<br />
0x6<br />
pkt_life<br />
0x10<br />
preference<br />
0x0<br />
D000046-005 B I-5
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
resv2<br />
resv3<br />
0x0<br />
0x0<br />
Explanation of Sample Output:<br />
This is a simple query, specifying the source and destination LIDs and the<br />
desired SID. The first half of the output shows the full “query” that will be<br />
sent to the Distributed SA. Unused fields are set to zero or are blank.<br />
In the center, the line “Using HCA qib0” tells us that, because we did not<br />
specify which <strong>Host</strong> Channel Adapter to query against, the tool chose one for<br />
us. (Normally, the user will never have to specify which <strong>Host</strong> Channel<br />
Adapter to use. This is only relevant in the case where a single node is<br />
connected to multiple physical IB fabrics.)<br />
Finally, the bottom half of the output shows the result of the query. Note that,<br />
if the query had failed (because the destination does not exist or because<br />
the SID is not found in the Distributed SA) you will receive and error instead:<br />
# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x108<br />
Query Parameters:<br />
resv1<br />
0x0000000000000108<br />
dgid ::<br />
sgid ::<br />
dlid<br />
0x75<br />
slid<br />
0x31<br />
hop<br />
0x0<br />
flow<br />
0x0<br />
tclass<br />
0x0<br />
num_path<br />
0x0<br />
pkey<br />
0x0<br />
qos_class<br />
0x0<br />
sl<br />
0x0<br />
mtu<br />
0x0<br />
rate<br />
0x0<br />
pkt_life<br />
0x0<br />
preference<br />
0x0<br />
resv2<br />
0x0<br />
resv3<br />
0x0<br />
Using HCA qib0<br />
******<br />
Error: Get Path returned 22 for query: Invalid argument<br />
******<br />
I-6 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
ibhosts<br />
ibstatus<br />
Examples:<br />
Query by LID and SID:<br />
iba_opp_query -s 0x31 -d 0x75 -i 0x107<br />
iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107<br />
Queries using octal or decimal numbers:<br />
iba_opp_query --slid 061 --dlid 0165 --sid 0407 (using octal<br />
numbers)<br />
iba_opp_query –slid 49 –dlid 113 –sid 263 (using decimal<br />
numbers)<br />
Note that these queries are the same as the first two, only the base of the<br />
numbers has changed.<br />
Query by LID and PKEY:<br />
iba_opp_query --slid 0x31 --dlid 0x75 –pkey 0x8002<br />
Query by GID:<br />
iba_opp_query -S fe80::11:7500:79:e416 -D<br />
fe80::11:7500:79:e54a --sid 0x107<br />
iba_opp_query -S 0xfe80000000000000:0x001175000079e416 -D<br />
0xfe80000000000000:0x001175000079e394 --sid 0x107<br />
As before, these queries are identical to the first two queries – they are just<br />
using the GIDs instead of the LIDs to specify the ports involved.<br />
This tool determines if all the hosts in your InfiniBand fabric are up and visible to<br />
the subnet manager and to each other. It is installed from the openib-diag<br />
RPM. Running ibhosts (as a root user) produces output similar to this when run<br />
from a node on the InfiniBand fabric:<br />
# ibhosts<br />
Ca : 0x0008f10001280000 ports 2 "Voltaire InfiniBand<br />
Fiber-Channel Router"<br />
Ca : 0x0011750000ff9869 ports 1 "idev-11"<br />
Ca : 0x0011750000ff9878 ports 1 "idev-05"<br />
Ca : 0x0011750000ff985c ports 1 "idev-06"<br />
Ca : 0x0011750000ff9873 ports 1 "idev-04"<br />
This program displays basic information on the status of InfiniBand devices that<br />
are currently in use when OpenFabrics RPMs are installed. It is installed from the<br />
openib-diag RPM.<br />
D000046-005 B I-7
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
ibtracert<br />
Following is a sample output for the SDR adapters:<br />
$ ibstatus<br />
Infiniband device ’qib0’ port 1 status:<br />
default gid: fe80:0000:0000:0000:0011:7500:0005:602f<br />
base lid: 0x35<br />
sm lid:<br />
0x2<br />
state:<br />
4: ACTIVE<br />
phys state: 5: LinkUp<br />
rate:<br />
10 Gb/sec (4X)<br />
Following is a sample output for the DDR adapters; note the difference in rate:<br />
$ ibstatus<br />
Infiniband device ’qib0’ port 1 status:<br />
default gid: fe80:0000:0000:0000:0011:7500:00ff:9608<br />
base lid: 0xb<br />
sm lid:<br />
0x1<br />
state:<br />
4: ACTIVE<br />
phys state: 5: LinkUp<br />
rate:<br />
20 Gb/sec (4X DDR)<br />
The tool ibtracert determines the path that InfiniBand packets travel between<br />
two nodes. It is installed from the openib-diag RPM. The InfiniBand LIDs of the<br />
two nodes in this example are determined by using the ipath_control -i<br />
command on each node. The ibtracert tool produces output similar to the<br />
following when run (as a root user) from a node on the InfiniBand fabric:<br />
# ibtracert 0xb9 0x9a<br />
From ca {0x0011750000ff9886} portnum 1 lid 0xb9-0xb9 "iqa-37"<br />
[1] -> switch port {0x0002c9010a19bea0}[1] lid 0x14-0x14<br />
"MT47396 Infiniscale-III"<br />
[24] -> switch port {0x00066a0007000333}[8] lid 0xc-0xc<br />
"SilverStorm 9120 GUID=0x00066a000200016c Leaf 6, Chip A"<br />
[6] -> switch port {0x0002c90000000000}[15] lid 0x9-0x9<br />
"MT47396 Infiniscale-III"<br />
[7] -> ca port {0x0011750000ff9878}[1] lid 0x9a-0x9a "idev-05"<br />
To ca {0x0011750000ff9878} portnum 1 lid 0x9a-0x9a "idev-05"<br />
I-8 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
ibv_devinfo<br />
This program displays information about InfiniBand devices, including various<br />
kinds of identification and status data. It is installed from the openib-diag RPM.<br />
Use this program when OpenFabrics is enabled. ibv_devinfo queries RDMA<br />
devices. Use the -v option to see more information. For example:<br />
$ ibv_devinfo<br />
hca_id: qib0<br />
fw_ver: 0.0.0<br />
node_guid:<br />
0011:7500:00ff:89a6<br />
sys_image_guid:<br />
0011:7500:00ff:89a6<br />
vendor_id:<br />
0x1175<br />
vendor_part_id: 29216<br />
hw_ver:<br />
0x2<br />
board_id:<br />
InfiniPath_QLE7280<br />
phys_port_cnt: 1<br />
port: 1<br />
state: PORT_ACTIVE (4)<br />
max_mtu: 4096 (5)<br />
active_mtu: 4096 (5)<br />
sm_lid: 1<br />
port_lid: 31<br />
port_lmc:<br />
0x00<br />
ident<br />
The ident strings are available in ib_qib.ko. Running ident provides driver<br />
information similar to the following. For <strong>QLogic</strong> RPMs, it will look like the following<br />
example:<br />
$ ident /lib/modules/$(uname-r)/updates/kernel/drivers/<br />
infiniband/hw/ib_qib.ko<br />
/lib/modules/$(uname-r)/updates/kernel/drivers/infiniband/hw/ib_qi<br />
b.ko:<br />
$Id: <strong>QLogic</strong> OFED Release 1.5 $<br />
$Date: 2010-02-17-18:51 $<br />
If the /lib/modules/$(uname -r)/updates directory is not present,<br />
then the driver in use is the one that comes with the core kernel.<br />
In this case, either the kernel-ib RPM is not installed or it is<br />
not configured for the current running kernel.<br />
D000046-005 B I-9
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
If the updates directory is present, but empty except for the subdirectory<br />
kernel, then an OFED install is probably being used, and the ident string will<br />
be empty. For example:<br />
$ cd /lib/modules/$(uname -r)/updates<br />
$ ls<br />
kernel<br />
$ cd kernel/drivers/infiniband/hw/qib/<br />
lib/modules/2.6.18-8.el5/updates/kernel/drivers/infiniband/hw/qib<br />
$ ident ib_qib.ko<br />
ib_qib.ko:<br />
ident warning: no id keywords in ib_qib.ko<br />
ipathbug-helper<br />
ipath_checkout<br />
NOTE:<br />
ident is in the optional rcs RPM, and is not always installed.<br />
The tool ipathbug-helper is useful for verifying homogeneity. It is installed<br />
from the infinipath RPM. Before contacting <strong>QLogic</strong> Technical Support, run this<br />
script on the head node of your cluster and the compute nodes that you suspect<br />
are having problems. Looking at the output often helps you find the problem. Run<br />
ipathbug-helper on several nodes and examine the output for differences.<br />
It is best to run ipathbug-helper with root privilege, since some of the queries<br />
it makes require this level of privilege. There is also a --verbose parameter that<br />
increases the amount of gathered information.<br />
If you cannot see the problem, send the stdout output to your reseller, along with<br />
information on the version of the InfiniPath software you are using.<br />
The ipath_checkout tool is a bash script that verifies that the installation is<br />
correct and that all the nodes of the network are functioning and mutually<br />
connected by the InfiniPath fabric. It is installed from the infinipath RPM. It<br />
must be run on a front end node, and requires specification of a nodefile. For<br />
example:<br />
$ ipath_checkout [options] nodefile<br />
The nodefile lists the hostnames of the nodes of the cluster, one hostname per<br />
line. The format of nodefile is as follows:<br />
hostname1<br />
hostname2<br />
...<br />
I-10 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
NOTE:<br />
• The hostnames in the nodefile are Ethernet hostnames, not IPv4<br />
addresses.<br />
• To create a nodefile, use the ibhosts program. It will generate a list<br />
of available nodes that are already connected to the switch.<br />
ipath_checkout performs the following seven tests on the cluster:<br />
1. Executes the ping command to all nodes to verify that they all are<br />
reachable from the front end.<br />
2. Executes the ssh command to each node to verify correct configuration of<br />
ssh.<br />
3. Gathers and analyzes system configuration from the nodes.<br />
4. Gathers and analyzes RPMs installed on the nodes.<br />
5. Verifies InfiniPath hardware and software status and configuration, including<br />
tests for link speed, PIO bandwidth (incorrect MTRR settings), and MTU<br />
size.<br />
6. Verifies the ability to mpirun jobs on the nodes.<br />
7. Runs a bandwidth and latency test on every pair of nodes and analyzes the<br />
results.<br />
The options available with ipath_checkout are shown in Table I-2.<br />
Table I-2. ipath_checkout Options<br />
Command<br />
-h, --help<br />
-v, --verbose<br />
-vv, --vverbose<br />
-vvv, --vvverbose<br />
-c, --continue<br />
Meaning<br />
These options display help messages describing how a command<br />
is used.<br />
These options specify three successively higher levels of<br />
detail in reporting test results. There are four levels of detail<br />
in all, including the case where none of these options are<br />
given.<br />
When this option is not specified, the test terminates when<br />
any test fails. When specified, the tests continue after a failure,<br />
with failing nodes excluded from subsequent tests.<br />
D000046-005 B I-11
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
Table I-2. ipath_checkout Options (Continued)<br />
Command<br />
-k, --keep<br />
--workdir=DIR<br />
--run=LIST<br />
--skip=LIST<br />
-d, --debug<br />
Meaning<br />
This option keeps intermediate files that were created while<br />
performing tests and compiling reports. Results are saved in<br />
a directory created by mktemp and named<br />
infinipath_XXXXXX or in the directory name given to<br />
--workdir.<br />
Use DIR to hold intermediate files created while running<br />
tests. DIR must not already exist.<br />
This option runs only the tests in LIST. See the seven tests<br />
listed previously. For example, --run=123 will run only<br />
tests 1, 2, and 3.<br />
This option skips the tests in LIST. See the seven tests listed<br />
previously. For example, --skip=2457 will skip tests 2, 4,<br />
5, and 7.<br />
This option turns on the -x and -v flags in bash(1).<br />
ipath_control<br />
In most cases of failure, the script suggests recommended actions. Also refer to<br />
the ipath_checkout man page.<br />
The ipath_control tool is a shell script that manipulates various parameters<br />
for the InfiniPath driver. It is installed from the infinipath RPM. Many of the<br />
parameters are used only when diagnosing problems, and may require special<br />
system configurations. Using these options may require restarting the driver or<br />
utility programs to recover from incorrect parameters.<br />
Most of the functionality is accessed via the /sys filesystem. This shell script<br />
gathers the same information contained in these files:<br />
/sys/class/infiniband/qib0/device/boardversion<br />
/sys/class/infiniband/qib0/ports/1/linkcontrol/status_str<br />
/sys/class/infiniband/qib0/device/driver/version<br />
These files are also documented in Table I-4 and Table I-5.<br />
Other than the -i option, this script must be run with root permissions. See the<br />
man pages for ipath_control for more details.<br />
I-12 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
Here is sample usage and output:<br />
% ipath_control -i<br />
$Id: <strong>QLogic</strong> OFED Release 1.5 $ $Date: 2010-03-01-23:28 $<br />
0: Version: ChipABI 2.0, InfiniPath_QLE7342, InfiniPath1 6.1, SW<br />
Compat 2<br />
0: Serial: RIB0941C00005 LocalBus: PCIe,5000MHz,x8<br />
0,1: Status: 0xe1 Initted Present IB_link_up IB_configured<br />
0,1: LID=0x1 GUID=0011:7500:0079:e574<br />
0,1: HRTBT:Auto LINK:40 Gb/sec (4X QDR)<br />
0,2: Status: 0x21 Initted Present [IB link not Active]<br />
0,2: LID=0xffff GUID=0011:7500:0079:e575<br />
The -i option combined with the -v option is very useful for looking at the<br />
InfiniBand width/rate and PCIe lanes/rate. For example:<br />
% ipath_control -iv<br />
$Id: <strong>QLogic</strong> OFED Release 1.5 $ $Date: 2010-03-01-23:28 $<br />
0: Version: ChipABI 2.0, InfiniPath_QLE7342, InfiniPath1 6.1, SW<br />
Compat 2<br />
0: Serial: RIB0941C00005 LocalBus: PCIe,5000MHz,x8<br />
0,1: Status: 0xe1 Initted Present IB_link_up IB_configured<br />
0,1: LID=0x1 GUID=0011:7500:0079:e574<br />
0,1: HRTBT:Auto LINK:40 Gb/sec (4X QDR)<br />
0,2: Status: 0x21 Initted Present [IB link not Active]<br />
0,2: LID=0xffff GUID=0011:7500:0079:e575<br />
0,2: HRTBT:Auto LINK:10 Gb/sec (4X)<br />
NOTE:<br />
On the first line, Release version refers to the current software release.<br />
The second line contains chip architecture version information.<br />
ipath_mtrr<br />
Another useful option blinks the LED on the InfiniPath adapter (QLE7240 and<br />
QLE7280 adapters). This is useful for finding an adapter within a cluster. Run the<br />
following as a root user:<br />
# ipath_control -b [On|Off]<br />
NOTE:<br />
Use ipath_mtrr if you are not using the default PAT mechanism to enable<br />
write combining.<br />
D000046-005 B I-13
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
MTRR is used by the InfiniPath driver to enable write combining to the <strong>QLogic</strong><br />
on-chip transmit buffers. This option improves write bandwidth to the <strong>QLogic</strong> chip<br />
by writing multiple words in a single bus transaction (typically 64 bytes). This<br />
option applies only to x86_64 systems. It can often be set in the BIOS.<br />
However, some BIOS’ do not have the MTRR mapping option. It may have a<br />
different name, depending on the chipset, vendor, BIOS, or other factors. For<br />
example, it is sometimes referred to as 32 bit memory hole. This setting must be<br />
enabled.<br />
If there is no setting for MTRR mapping or 32 bit memory hole, contact your<br />
system or motherboard vendor and ask how to enable write combining.<br />
You can check and adjust these BIOS settings using the BIOS Setup utility. For<br />
specific instructions, follow the hardware documentation that came with your<br />
system.<br />
<strong>QLogic</strong> also provides a script, ipath_mtrr, which sets the MTRR registers,<br />
enabling maximum performance from the InfiniPath driver. This Python script is<br />
available as a part of the InfiniPath software download, and is contained in the<br />
infinipath* RPM. It is installed in /bin.<br />
To diagnose the machine, run it with no arguments (as a root user):<br />
# ipath_mtrr<br />
The test results will list any problems, if they exist, and provide suggestions on<br />
what to do.<br />
To fix the MTRR registers, use:<br />
# ipath_mtrr -w<br />
Restart the driver after fixing the registers.<br />
This script needs to be run after each system reboot. It can be set to run<br />
automatically upon restart by adding this line in /etc/sysconfig/infinipath:<br />
IPATH_MTRR_ACTIVE=1<br />
See the ipath_mtrr(8) man page for more information on other options.<br />
ipath_pkt_test<br />
This program is installed from the infinipath RPM. Use ipath_pkt_test to<br />
do one of the following:<br />
• Test the InfiniBand link and bandwidth between two InfiniPath infiniband<br />
adapters.<br />
• Using an InfiniBand loopback connector, test the link and bandwidth within a<br />
single InfiniPath infiniband adapter.<br />
I-14 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
ipathstats<br />
The ipath_pkt_test program runs in either ping-pong mode (send a packet,<br />
wait for a reply, repeat) or in stream mode (send packets as quickly as possible,<br />
receive responses as they come back).<br />
Upon completion, the sending side prints statistics on the packet bandwidth,<br />
showing both the payload bandwidth and the total bandwidth (including InfiniBand<br />
and InfiniPath headers). See the man page for more information.<br />
The ipathstats program is useful for diagnosing InfiniPath problems,<br />
particularly those that are performance related. It is installed from the<br />
infinipath RPM. It displays both driver statistics and hardware counters,<br />
including both performance and "error" (including status) counters.<br />
Running ipathstats -c 10, for example, displays the number of packets and<br />
32-bit words of data being transferred on a node in each 10-second interval. This<br />
output may show differences in traffic patterns on different nodes, or at different<br />
stages of execution. See the man page for more information.<br />
lsmod<br />
When you need to find which InfiniPath and OpenFabrics modules are running,<br />
type the following command:<br />
# lsmod | egrep ’ipath_|ib_|rdma_|findex’<br />
modprobe<br />
Use this program to load/unload the drivers. You can check to see if the driver has<br />
loaded by using this command:<br />
# modprobe -v ib_qib<br />
The -v option typically only prints messages if there are problems.<br />
The configuration file that modprobe uses is /etc/modprobe.conf<br />
(/etc/modprobe.conf.local on SLES). In this file, various options and<br />
naming aliases can be set.<br />
mpirun<br />
mpirun determines whether the program is being run against a <strong>QLogic</strong> or<br />
non-<strong>QLogic</strong> driver. It is installed from the mpi-frontend RPM. Sample<br />
commands and results are shown in the following paragraphs.<br />
<strong>QLogic</strong>-built:<br />
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0<br />
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0<br />
(1 active chips)<br />
asus-01:0.ipath_userinit: Driver is <strong>QLogic</strong>-built<br />
D000046-005 B I-15
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Programs<br />
mpi_stress<br />
Non-<strong>QLogic</strong> built:<br />
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0<br />
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0<br />
(1 active chips)<br />
asus-01:0.ipath_userinit: Driver is not <strong>QLogic</strong>-built<br />
This is an MPI stress test program designed to load up an MPI interconnect with<br />
point-to-point messages while optionally checking for data integrity. By default, it<br />
runs with all-to-all traffic patterns, optionally including oneself and one’s local<br />
shared memory (shm) peers. It can also be set up with multi-dimensional grid<br />
traffic patterns; this can be parameterized to run rings, open 2D grids, closed<br />
2D grids, cubic lattices, hypercubes, and so on.<br />
Optionally, the message data can be randomized and checked using CRC<br />
checksums (strong but slow) or XOR checksums (weak but fast). The<br />
communication kernel is built out of non-blocking point-to-point calls to load up the<br />
interconnect. The program is not designed to exhaustively test out different MPI<br />
primitives. Performance metrics are displayed, but should be carefully interpreted<br />
in terms of the features enabled.<br />
This is an MPI application and should be run under mpirun or its equivalent.<br />
The following example runs 16 processes and a specified hosts file using the<br />
default options (all-to-all connectivity, 64 to 4MB messages in powers of two, one<br />
iteration, no data integrity checking):<br />
$ mpirun -np 16 -m hosts mpi_stress<br />
There are a number of options for mpi_stress; this one may be particularly useful:<br />
-P<br />
This option poisons receive buffers at initialization and after each receive;<br />
pre-initialize with random data so that any parts that are not being correctly<br />
updated with received data can be observed later.<br />
See the mpi_stress(1) man page for more information.<br />
rpm<br />
To check the contents of an installed RPM, use these commands:<br />
$ rpm -qa infinipath\* mpi-\*<br />
$ rpm -q --info infinipath # (etc)<br />
The option-q queries. The option --qa queries all. To query a package that has<br />
not yet been installed, use the -qpl option.<br />
I-16 D000046-005 B
I–Useful Programs and Files<br />
Common Tasks and Commands<br />
strings<br />
Use the strings command to determine the content of and extract text from a<br />
binary file.<br />
The command strings can also be used. For example, the command:<br />
$ strings -a /usr/lib/libinfinipath.so.4.0 | grep Date:<br />
produces this output:<br />
$Date: 2009-02-26 12:05 Release2.3 InfiniPath $<br />
NOTE:<br />
The strings command is part of binutils (a development RPM), and<br />
may not be available on all machines.<br />
Common Tasks and Commands<br />
Table I-3 lists some common commands that help with administration and<br />
troubleshooting. Note that mpirun in nonmpi mode can perform a number of<br />
checks.<br />
Table I-3. Common Tasks and Commands Summary<br />
Function<br />
Check the system state<br />
Verify hosts via an Ethernet<br />
ping<br />
Verify ssh<br />
Command<br />
ipath_checkout [options] hostsfile<br />
ipathbug-helper -m hostsfile \<br />
> ipath-info-allhosts<br />
mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi ipath_control -i<br />
Also see the file:<br />
/sys/class/infiniband/ipath*/device/status_str<br />
where * is the unit number. This file provides information<br />
about the link state, possible cable/switch problems,<br />
and hardware errors.<br />
ipath_checkout --run=1 hostsfile<br />
ipath_checkout --run=2 hostsfile<br />
Show uname -a for all hosts mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi uname -a<br />
D000046-005 B I-17
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Files<br />
Table I-3. Common Tasks and Commands Summary (Continued)<br />
Reboot hosts<br />
Function<br />
Command<br />
As a root user:<br />
mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi reboot<br />
Run a command on all hosts mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi <br />
Examples:<br />
mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi hostname<br />
mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi date<br />
Copy a file to all hosts<br />
Summarize the fabric components<br />
Show the status of host Infini-<br />
Band ports<br />
Verify that the hosts see each<br />
other<br />
Check MPI performance<br />
Using bash:<br />
$ for i in $( cat hostsfile )<br />
do<br />
scp $i:<br />
done<br />
ipathbug-helper -m hostsfile \<br />
> ipath-info-allhosts<br />
ipathbug-helper -m hostsfile \<br />
> ipath-info-allhosts<br />
mpirun -m hostsfile -ppn 1 \<br />
-np numhosts -nonmpi ipath_control -i<br />
ipath_checkout --run=5 hostsfile<br />
ipath_checkout --run=7 hostsfile<br />
Generate all hosts problem<br />
report information<br />
ipathbug-helper -m hostsfile \<br />
> ipath-info-allhosts<br />
Table Notes<br />
The " \ " indicates commands that are broken across multiple lines.<br />
Summary and Descriptions of Useful Files<br />
Useful files are summarized in Table I-4. Names in blue text are linked to a<br />
corresponding section that provides further details.<br />
I-18 D000046-005 B
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Files<br />
Table I-4. Useful Files<br />
File Name<br />
boardversion<br />
status_str<br />
/var/log/messages<br />
version<br />
Function<br />
File that shows the version of the chip architecture.<br />
File that verifies that the InfiniPath software is loaded and<br />
functioning<br />
Logfile where various programs write messages. Tracks<br />
activity on your system<br />
File that provides version information of installed software/drivers<br />
boardversion<br />
It is useful to keep track of the current version of the chip architecture. You can<br />
check the version by looking in this file:<br />
/sys/class/infiniband/qib0/device/boardversion<br />
Example contents are:<br />
ChipABI 2.0,InfiniPath_QLE7280,InfiniPath1 5.2,PCI 2,SW Compat 2<br />
This information is useful for reporting problems to Technical Support.<br />
status_str<br />
NOTE:<br />
This file returns information of where the form factor adapter is installed. The<br />
PCIe half-height, short form factor is referred to as the QLE7140, QLE7240,<br />
QLE7280, QLE7340, or QLE7342.<br />
Check the file status_str to verify that the InfiniPath software is loaded and<br />
functioning. The file is located here:<br />
/sys/class/infiniband/qib/device/status_str<br />
Table I-5 shows the possible contents of the file, with brief explanations of the<br />
entries.<br />
Table I-5. status_str File Contents<br />
Initted<br />
File Contents<br />
Description<br />
The driver has loaded and successfully initialized<br />
the IBA6110 or IBA7220 ASIC.<br />
D000046-005 B I-19
I–Useful Programs and Files<br />
Summary and Descriptions of Useful Files<br />
Table I-5. status_str File Contents (Continued)<br />
File Contents<br />
Present<br />
IB_link_up<br />
IB_configured<br />
NOIBcable<br />
Fatal_Hardware_Error<br />
Description<br />
The IBA6110 or IBA7220 ASIC has been detected<br />
(but not initialized unless Initted is also present).<br />
The InfiniBand link has been configured and is in<br />
the active state; packets can be sent and received.<br />
The InfiniBand link has been configured. It may or<br />
may not be up and usable.<br />
Unable to detect link present. This problem can be<br />
caused by one of the following problems with the<br />
QLE7140, QLE7240, or QLE7280 adapters:<br />
• No cable is plugged into the adapter.<br />
• The adapter is connected to something other<br />
than another InfiniBand device, or the connector<br />
is not fully seated.<br />
• The switch where the adapter is connected is<br />
down.<br />
Check the system log (default is /var/log/messages)<br />
for more information, then call Technical<br />
Support.<br />
This same directory contains other files with information related to status. These<br />
files are summarized in Table I-6.<br />
Table I-6. Status—Other Files<br />
File Name<br />
lid<br />
mlid<br />
guid<br />
nguid<br />
serial<br />
Contents<br />
InfiniBand LID. The address on the InfiniBand fabric, similar conceptually<br />
to an IP address for TCP/IP. Local refers to it being unique<br />
only within a single InfiniBand fabric.<br />
The Multicast Local ID (MLID), for InfiniBand multicast. Used for<br />
InfiniPath ether broadcasts, since InfiniBand has no concept of<br />
broadcast.<br />
The GUID for the InfiniPath chip, it is equivalent to a MAC address.<br />
The number of GUIDs that are used. If nguids == 2 and two chips<br />
are discovered, the first chip is assigned the requested GUID (from<br />
eeprom, or ipath_sma), and the second chip is assigned GUID+1.<br />
The serial number of the QLE7140, QLE7240, or QLE7280 adapter.<br />
I-20 D000046-005 B
I–Useful Programs and Files<br />
Summary of Configuration Files<br />
Table I-6. Status—Other Files (Continued)<br />
File Name<br />
unit<br />
status<br />
Contents<br />
A unique number for each card or chip in a system.<br />
The numeric version of the status_str file, described in Table I-5.<br />
version<br />
You can check the version of the installed InfiniPath software by looking in:<br />
/sys/class/infiniband/qib0/device/driver/version<br />
<strong>QLogic</strong>-built drivers have contents similar to:<br />
$Id: <strong>QLogic</strong> OFED Release 1.4.2$ $Date: Fri Feb 27 16:14:31 PST 2009<br />
$<br />
Non-<strong>QLogic</strong>-built drivers (in this case kernel.org) have contents similar to:<br />
$Id: <strong>QLogic</strong> kernel.org driver $<br />
Summary of Configuration Files<br />
Table I-7 contains descriptions of the configuration and configuration template<br />
files used by the InfiniPath and OpenFabrics software.<br />
Table I-7. Configuration Files<br />
Configuration File Name<br />
/etc/infiniband/qlgc_vnic.cfg<br />
/etc/modprobe.conf<br />
Description<br />
VirtualNIC configuration file. Create this file<br />
after running ib_qlgc_vnic_query to<br />
get the information you need. This file was<br />
named /etc/infiniband/qlogic_vnic.cfg<br />
or<br />
/etc/sysconfig/ics_inic.cfg in<br />
previous releases. See the sample file<br />
qlgc_vnic.cfg.sample (described<br />
later) to see how it should be set up.<br />
Specifies options for modules when added<br />
or removed by the modprobe command.<br />
Also used for creating aliases. The PAT<br />
write-combing option is set here.<br />
For Red Hat systems.<br />
D000046-005 B I-21
I–Useful Programs and Files<br />
Summary of Configuration Files<br />
Table I-7. Configuration Files (Continued)<br />
Configuration File Name<br />
/etc/modprobe.conf.local<br />
/etc/infiniband/openib.conf<br />
/etc/sysconfig/infinipath<br />
/etc/sysconfig/network/ifcfg-<br />
<br />
Description<br />
Specifies options for modules when added<br />
or removed by the modprobe command.<br />
Also used for creating aliases. The PAT<br />
write-combing option is set here.<br />
For SLES systems.<br />
The primary configuration file for Infini-<br />
Path, OFED modules, and other modules<br />
and associated daemons. Automatically<br />
loads additional modules or changes IPoIB<br />
transport type.<br />
Contains settings, including the one that<br />
sets the ipath_mtrr script to run on<br />
reboot.<br />
Network configuration file for network interfaces<br />
When used for VNIC configuration,<br />
is in the form eiocX, where X is<br />
the device number. There will be one<br />
interface configuration file for each interface<br />
defined in /etc/infiniband/qlgc_vnic.cfg.<br />
For SLES systems.<br />
/etc/sysconfig/network-scripts/ifcfg-<br />
Network configuration file for network interfaces<br />
When used for VNIC configuration,<br />
is in the form eiocX, where X is<br />
the device number. There will be one<br />
interface configuration file for each interface<br />
defined in /etc/infiniband/qlgc_vnic.cfg.<br />
For Red Hat systems.<br />
I-22 D000046-005 B
I–Useful Programs and Files<br />
Summary of Configuration Files<br />
Table I-7. Configuration Files (Continued)<br />
Sample and Template Files<br />
qlgc_vnic.cfg.sample<br />
/usr/share/doc/initscripts-*/<br />
sysconfig.txt<br />
Description<br />
Sample VNIC config file. It can be found<br />
with the OFED documentation, or in the<br />
qlgc_vnictools subdirectory of the<br />
<strong>QLogic</strong> OFED <strong>Host</strong> <strong>Software</strong> download. It<br />
is also installed in /etc/infiniband.<br />
File that explains many of the entries in the<br />
configuration files<br />
For Red Hat systems.<br />
D000046-005 B I-23
I–Useful Programs and Files<br />
Summary of Configuration Files<br />
Notes<br />
I-24 D000046-005 B
J<br />
Recommended Reading<br />
Reference material for further reading is provided in this appendix.<br />
References for MPI<br />
The MPI Standard specification documents are located at:<br />
http://www.mpi-forum.org/docs<br />
The MPICH implementation of MPI and its documentation are located at:<br />
http://www-unix.mcs.anl.gov/mpi/mpich/<br />
The ROMIO distribution and its documentation are located at:<br />
http://www.mcs.anl.gov/romio<br />
Books for Learning MPI Programming<br />
Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI, Second Edition,<br />
1999, MIT Press, ISBN 0-262-57134-X<br />
Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI-2, Second Edition,<br />
1999, MIT Press, ISBN 0-262-57133-1<br />
Pacheco, Parallel Programming with MPI, 1997, Morgan Kaufman Publishers,<br />
ISBN 1-55860<br />
Reference and Source for SLURM<br />
InfiniBand<br />
OpenFabrics<br />
The open-source resource manager designed for Linux clusters is located at:<br />
http://www.llnl.gov/linux/slurm/<br />
The InfiniBand specification can be found at the InfiniBand Trade Association site:<br />
http://www.infinibandta.org/<br />
Information about the Open InfiniBand Alliance is located at:<br />
http://www.openfabrics.org<br />
D000046-005 B J-1
J–Recommended Reading<br />
Clusters<br />
Clusters<br />
Networking<br />
Rocks<br />
Gropp, William, Ewing Lusk, and Thomas Sterling, Beowulf Cluster Computing<br />
with Linux, Second Edition, 2003, MIT Press, ISBN 0-262-69292-9<br />
The Internet Frequently Asked Questions (FAQ) archives contain an extensive<br />
Request for Command (RFC) section. Numerous documents on networking and<br />
configuration can be found at:<br />
http://www.faqs.org/rfcs/index.html<br />
Extensive documentation on installing Rocks and custom Rolls can be found at:<br />
http://www.rocksclusters.org/<br />
Other <strong>Software</strong> Packages<br />
Environment Modules is a popular package to maintain multiple concurrent<br />
versions of software packages and is available from:<br />
http://modules.sourceforge.net/<br />
J-2 D000046-005 B
Corporate Headquarters <strong>QLogic</strong> Corporation 26650 Aliso Viejo Parkway Aliso Viejo, CA 92656 949.389.6000 www.qlogic.com<br />
International Offices UK | Ireland | Germany | India | Japan | China | Hong Kong | Singapore | Taiwan<br />
© 2010 <strong>QLogic</strong> Corporation. Specifications are subject to change without notice. All rights reserved worldwide. <strong>QLogic</strong> and the <strong>QLogic</strong> logo are<br />
registered trademarks of <strong>QLogic</strong> Corporation. All other brand and product names are trademarks or registered trademarks of their respective owners.<br />
Information supplied by <strong>QLogic</strong> Corporation is believed to be accurate and reliable. <strong>QLogic</strong> Corporation assumes no responsibility for any errors in<br />
this brochure. <strong>QLogic</strong> Corporation reserves the right, without notice, to make changes in product design or specifications.