## **INTEL® SCALABLE SYSTEM FRAMEWORK** A CONFIGURABLE DESIGN PHILOSOPHY EXTENSIBLE TO A WIDE RANGE OF WORKLOADS



Small Clusters Through Supercomputers Compute and Data-Centric Computing Standards-Based Programmability On-Premise and Cloud-Based

Intel<sup>®</sup> Xeon<sup>®</sup> Processors Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessors Intel<sup>®</sup> Solutions for Lustre\* Intel<sup>®</sup> SSDs Intel<sup>®</sup> Optane<sup>™</sup> Technology 3D XPoint<sup>™</sup> Technology Intel® Omni-Path Architecture Intel® True Scale Fabric Intel® Ethernet Intel® Silicon Photonics HPC System Software Stack Intel® Software Tools Intel® Cluster Ready Program Intel® Visualization Toolkit

## INTEL<sup>®</sup> HPC DEVELOPER CONFERENCE





# INTEL 100G OMNI-PATH FABRIC - ITOC2016

**YANG YANGUO** 

May 2016

## Agenda

- **Quick Overview: HPC Fabrics**
- □ What is Intel<sup>®</sup> 100Gb Omni-Path Architecture(OPA)?
- □ Why is Intel 100Gb OPA
- □ Summary



# **QUICK OVERVIEW: HPC FABRICS**

## What is Different Between Networks and Fabrics?

**Network:** Universal interconnect designed to allow any-and-all systems to communicate

**HPC Fabric:** Optimized interconnect allows many nodes to perform as a single system



#### Key NETWORK (Ethernet) Attributes:

- Flexibility for any application
- Designed for universal communication
- Extensible configuration
- Multi-vendor components

#### **Key FABRIC Attributes:**

- Targeted for specific applications
- Optimized for performance and efficiency
- Engineered topologies
- Single-vendor solutions



## Fabric: InfiniBand\* and OPA

InfiniBand/OPA is a multi-lane, high-speed serial interconnect (Copper or Fiber)

- Typically presented as a 4x solution
- Speeds: 40Gb/s (M & Intel QDR), 56Gb/s (M FDR), 100Gb/s (EDR & Intel OPA)

High bandwidth, low latency HPC interconnect for commodity servers

- Ethernet switch latency is typically measured in µs, but InfiniBand/OPA is in <u>nanoseconds</u>
- Lower CPU load
- Lower cost than Ethernet
  - 100GbE measured in multiple \$1,000's per switch port
  - 100Gb OPA is ~\$1k per switch port (target for Intel<sup>®</sup> OPA list pricing)



## Major HPC Fabric Components

Host Channel Adapter (HCA) / Intel® OPA Card (Host Fabric Interface, HFI)

Terminates a Fabric link and executes transport-level functions

#### Switch

Routes packets from one link to another of the same Subnet

#### Cables

- Copper cables are typical, longer connections use optical/fiber cables
- Connectors are QSFP/QSFP28

#### Subnet Manager

 Discovers and configures attached devices and manages the fabric







Server nodes with

PCIe Cards

## **HPC Fabric Configurations**

## Fat Tree [most popular]:

Network supports Full Bisectional Bandwidth (FBB) between a pair of nodes



## Oversubscribed Fat Tree [next most popular]:

Constant Bisectional Bandwidth (CBB) can be less than FBB between a pair of nodes. **Oversubscribed Tree** 



Node BW > Core BW

## The Intel® Fabric Product Roadmap Vision



Forecast and Estimations, in Planning & Targets

## Establish in HPC with the first generation Intel<sup>®</sup> Omni-Path Architecture Expand to broader market segments in successive generations

Potential future options, subject to change without notice. All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification.



# INTEL® OMNI-PATH FABRIC: 100g opa

Intel Confidential - CNDA Required

# **INTEL® 100G OMNI-PATH** Evolutionay approach, revoluationary feathers, end-to-end products



空敕的逆到逆的产只线



# (intel) Fabric Solutions Powered by Intel® Omni-Path Architecture

#### Intel Part # 100HFA018LS 100HFA016LS 100HFA018FS 100HFA016FS Description Single-port PCIe x8 Adapter, Single-port PCIe x16 Adapter, Low Profile and Std Height Low Profile and Std Height Availability<sup>1</sup> Q2'16 Q2'16 Speed 58 Gbps 100 Gbps Ports. Media Single port, QSFP28 Single port, QSFP28 Form Factor Low profile PCIe Low profile PCIe Std Height PCIe Std Height PCIe Features Passive thermal – QSFP Passive thermal – QSFP heatsink, supports up to Class heatsink, supports up to Class 4 max optical transceivers 4 max optical transceivers Sandy Bridge Х Х Ivy Bridge Х Х Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-~ ~ 2600 v3 (Haswell-EP) Intel® Xeon® processor E5-2600 v4 (Broadwell-EP) ✓ ✓

|                           | Edge Switches                                                         |                                                                       |                                                                 | Director Switches                                              |                                         |                                                        |                                            |  |  |
|---------------------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------------------------|-----------------------------------------|--------------------------------------------------------|--------------------------------------------|--|--|
|                           | REAL PROPERTY.                                                        |                                                                       |                                                                 |                                                                |                                         |                                                        |                                            |  |  |
| Intel Part #              | 100SWE48UF2 / R2<br>100SWE48QF2 / R2                                  | 100SWE24UF2 / R2<br>100SWE24QF2 / R2                                  | 100SWD24B1N<br>100SWD24B1D<br>100SWD24B1A                       | 100SWD06B1N<br>100SWD06B1D<br>100SWD06B1A                      | 100SWDLF32Q                             | 100SWDSPINE                                            | 100SWDMGTSH                                |  |  |
| Description               | 48 Port Edge Switch<br>("Q" = mgmt card)                              | 24 Port Edge Switch<br>("Q" = mgmt card)                              | 24-slot Director Class<br>Switch, Base Config                   | 6-slot Director Class<br>Switch, Base Config                   | Director Class Switch<br>Leaf Module    | Director Class Switch<br>Spine Module                  | Director Class Switch<br>Management Module |  |  |
| Availability <sup>1</sup> | Q2'16                                                                 | Q2'16                                                                 | Q2'16                                                           | Q2'16                                                          | Q2'16                                   | Q2'16                                                  | Q2'16                                      |  |  |
| Speed                     | 100 Gbps                                                              | 100 Gbps                                                              | 100 Gbps                                                        | 100 Gbps                                                       | 100 Gbps                                | 100 Gbps                                               | 100 Gbps                                   |  |  |
| Max External<br>Ports     | 48                                                                    | 24                                                                    | 768                                                             | 192                                                            | 32                                      | N/A                                                    | N/A                                        |  |  |
| Media                     | QSFP28                                                                | QSFP28                                                                | 10/100/1000 Base-T<br>USB Gen2                                  | 10/100/1000 Base-T<br>USB Gen2                                 | QSFP28                                  | Internal high speed connections                        | 10/100/1000 Base-T<br>USB Gen2             |  |  |
| Form Factor               | 1U                                                                    | 1U                                                                    | 20U                                                             | 7U                                                             | Half-width module<br>2 modules per leaf | Full width module,<br>2 boards/module                  | Half-width module                          |  |  |
| Features                  | Forward / reverse<br>airflow and mgmt<br>card options,<br>up to 2 PSU | Forward / reverse<br>airflow and mgmt<br>card options,<br>up to 2 PSU | Up to 2 mgmt<br>modules, up to 12<br>PSUs, AC and DC<br>options | Up to 2 mgmt<br>modules, up to 6<br>PSUs, AC and DC<br>options | Hot swappable                           | 96 internal mid-plane<br>connections,<br>hot swappable | N+1 redundancy,<br>hot swappable           |  |  |

|                                        | Passive                                | e Copper                | Cables                 |                        |             |             | Ac          | tive Opt    | ical Cabl   | es          |             |             |
|----------------------------------------|----------------------------------------|-------------------------|------------------------|------------------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|
| 0.5M                                   | 1.0M                                   | 1.5M                    | 2.0M                   | 3.0M                   | 3.0M        | 5.0M        | 10M         | 15M         | 20M         | 30M         | 50M         | 100M        |
| 100CQQF3005<br>100CQQH3005<br>(30 AWG) | 100CQQF3010<br>100CQQH3010<br>(30 AWG) | 100CQQH2615<br>(26 AWG) | 100CQQH2620<br>(26AWG) | 100CQQH2630<br>(26AWG) | 100FRRF0030 | 100FRRF0050 | 100FRRF0100 | 100FRRF0150 | 100FRRF0200 | 100FRRF0300 | 100FRRF0500 | 100FRRF1000 |

<sup>1</sup> Production Readiness / General Availability dates

#### Intel Confidential – CNDA Required

# Intel<sup>®</sup> Omni-Path Edge Switch

## 100 Series 24/48 Port: Features<sup>1</sup>

### **Compact Space (1U)**

- 1.7"H x 17.3"W x 16.8"L

### Switching Capacity

- 4.8/9.6 Tb/s switching capability

### Line Speed

– 100Gb/s Link Rate

### **Standards-based Hardware Connections**

– QSFP28

### Redundancy

- N+N redundant Power Supplies (optional)
- N+1 Cooling –Fans (speed control, customer changeable forward/reverse airflow)

### Management Module (optional)

## No externally pluggable FRUs

<sup>1</sup>Specifications contained in public Product Briefs.

| Power    | Сор     | per     | Optical (3W QSFP) |         |  |
|----------|---------|---------|-------------------|---------|--|
| Model    | Typical | Maximum | Typical           | Maximum |  |
| 24-Ports | 146W    | 179W    | 231W              | 264W    |  |
| 48-Ports | 186W    | 238W    | 356W              | 408W    |  |

24-port Edge Switch





#### 48-port Edge Switch

This presentation discusses devices that have not been authorized as required by the rules of the Federal Communications Commission, including all Intel<sup>®</sup> Omni-Path Architecture devices. These devices are not, and may not be, offered for sale or lease, or sold or leased, until authorization is obtained.





# Intel® OPA Director Class Systems 100 Series

## 6-Slot/24-Slot Systems<sup>1</sup>

### **Highly Integrated**

- 7U/20U plus 1U Shelf

### **Switching Capacity**

- 38.4/153.6 Tb/s switching capability

### **Common Features**

- Intel<sup>®</sup> Omni-Path Fabric Switch Silicon 100 Series (100Gb/s)
- Standards-based Hardware Connections QSFP28
- Up to Full bisectional bandwidth Fat Tree internal topology
- Common Management Card w/Edge Switches
- 32-Port QSFP28-based Leaf Modules
- Air-cooled, front to back (cable side) air cooling
- Hot-Swappable Modules
  - Leaf, Spine, Management, Fan , Power Supply
- Module Redundancy
  - Management (N+1), Fan (N+1, Speed Controlled), PSU (DC, AC/DC)
- System Power : 180-240AC

| Power   | Cop     | per     | Optical (3 | SW QSFP) |
|---------|---------|---------|------------|----------|
| Model   | Typical | Maximum | Typical    | Maximum  |
| 6-Slot  | 1.6kW   | 2.3kW   | 2.4kW      | 3.0kW    |
| 24-Slot | 6.8kW   | 8.9kW   | 9.5kW      | 11.6kW   |

#### 6-Slot Director Switch





#### 24-Slot Director Switch

This presentation discusses devices that have not been authorized as required by the rules of the Federal Communications Commission, including all Intel® Omni-Path Architecture devices. These devices are not, and may not be, offered for sale or lease, or sold or leased, until authorization is obtained.



## Intel<sup>®</sup> Omni-Path Host Fabric Interface 100 Series Single Port<sup>1</sup>

#### Low Profile PCIe Card

- 2.71"x 6.6" max. Spec compliant.
- Standard and low profile brackets

#### Wolf River (WFR-B) HFI ASIC

#### PCle Gen3

### Single 100 Gb/s Intel® OPA port

- QSFP28 Form Factor
- Supports multiple optical transceivers
- Single Link status LED (Green)

| Power   | Сор     | per     | Optical (3W QSFP) |         |  |
|---------|---------|---------|-------------------|---------|--|
| Model   | Typical | Maximum | Typical           | Maximum |  |
| X16 HFI | 7.4W    | 11.7W   | 10.6W             | 14.9W   |  |
| X8 HFI  | 6.3W    | 8.3W    | 9.5W              | 11.5W   |  |

### Thermal

- Passive thermal QSFP Port Heatsink
- Standard 55C, 200lfm environment

<sup>1</sup>Specifications contained in public Product Briefs



#### x16 HFI (100Gb Throughput)

x8 HFI (~58Gb Throughput) PCIe Limited



This presentation discusses devices that have not been authorized as required by the rules of the Federal Communications Commission, including all Intel<sup>®</sup> Omni-Path Architecture devices. These devices are not, and may not be, offered for sale or lease, or sold or leased, until authorization is obtained.



Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. \*Other names and brands may be claimed as the property of others. All products, dates, and figures ar **preprint confidential** change without any notice. Copyright © 2015, Intel Corporation.

## Intel<sup>®</sup> Omni-Path Architecture Fabric Cabling Topology



## Host Layer Optimization: Optimize HPC Code Path and Generational Compatibility





## INTEL® HPC DEVELOPER CONFERENCE

# PERFORMANCE

## Intel<sup>®</sup> OPA MPI Performance Measurements

| Metric                                         | Intel® Xeon® CPU E5-269<br>Intel® Omni-Path Fa |                         |                                                  |
|------------------------------------------------|------------------------------------------------|-------------------------|--------------------------------------------------|
| LATENCY                                        |                                                |                         |                                                  |
| OSU Latency Test (8B)                          |                                                |                         |                                                  |
| Latency (one-way, b2b nodes) <sup>2</sup>      | 790 ns                                         |                         |                                                  |
| Latency (one-way, 1 switch) <sup>2</sup>       | 900 ns                                         |                         |                                                  |
| MESSAGING RATES (rank = rank pairs)            |                                                |                         |                                                  |
| OSU Message Bandwidth Test (8B, streaming)     |                                                |                         |                                                  |
| Message Rate (1 rank, uni-dir) <sup>3</sup>    | 5.3 M msg/s                                    |                         | CPU E5-2699 v4 with mni-Path Fabric <sup>4</sup> |
| Message Rate (1 rank, bi-dir) <sup>3</sup>     | 6.3 M msg/s                                    | Inter Onini-Fath Fabric |                                                  |
| Message Rate (max ranks, uni-dir) <sup>3</sup> | 108 M msg/s                                    | 143 M msg/s             |                                                  |
| Message Rate (max ranks, bi-dir) <sup>3</sup>  | 132 M msg/s                                    | 172                     | 2 M msg/s                                        |
| BANDWIDTH (rank = rank pairs)                  |                                                |                         |                                                  |
| OSU Message Bandwidth Test (512 KB, streaming) |                                                |                         |                                                  |
| BW (1 rank, 1 port, uni-dir) <sup>3</sup>      | 12.3 GB/s                                      |                         |                                                  |
| BW (1 rank, 1 port, bi-dir) <sup>3</sup>       | 24.5 GB/s                                      |                         |                                                  |

All tests performed by Intel with OSU OMB 4.4.1.

1 Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-2697 v3 with Intel<sup>®</sup> Turbo-Mode enabled. 8x8GB DDR4 RAM, 2133 MHz. RHEL7.0.

2 osu\_latency 1-8B msg. w/ and w/out switch. Open MPI 1.10.0-hfi packaged with IFS 10.0.0.0.625.

3 osu\_mbw\_mr modified for bi-directional bandwidth measurement. w/switch Open MPI 1.10.0-hfi packaged with IFS 10.0.0.0625. . IOU Non-Posted Prefetch disabled in BIOS. snp\_holdoff\_cnt=9 in BIOS. 4 Intel® Xeon® processor E5-2699v4 with Intel® Turbo-Mode enabled. 8x8GB DDR4 RAM, 2133 MHz. RHEL7.0. IFS 10.0.0.991.35. Open MPI 1.8.5-hfi. B0 Intel® OPA hardware and beta level software.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.



## Intel<sup>®</sup> OPA MPI Performance Improvements



Tests performed by Intel on Intel® Xeon® Processor E5-2697v3 dual-socket servers with 2133 MHz DDR4 memory. Turbo mode enabled and hyper-threading disabled. Ohio State Micro Benchmarks v. 4.4.1. Intel OPA: Open MPI 1.10.0 with PSM2. Intel Corporation Device 24f0 – Series 100 HFI ASIC. OPA Switch: Series 100 Edge Switch – 48 port. IOU Non-posted Prefetch disabled in BIOS. EDR: Open MPI 1.8-mellanox released with hpcx-v1.3.336-icc-MLNX\_OFED\_LINUX-3.0-1.0.1-redhat6.6-x86\_64.tbz. MXM\_TLS=self,rc tuning. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. Intel® True Scale: Open MPI. QLG-QLE-7342(A), 288 port True Scale switch. 1. osu\_latency 8 B message. 2. osu\_bw 1 MB message. 3. osu\_mbw\_mr, 8 B message (uni-directional), 28 MPI rank pairs

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.



#### Intel® Xeon® Processor E5-2600 v4 Product Family

## ANSYS\*

#### Fluent\* 17 Computational Fluid Dynamics

"Thanks to Intel® **OPA** and the latest Intel® **Xeon**® **E5-2600 v4** product family, ANSYS Fluent\* is able to achieve performance levels <u>beyond our expectations</u>. Its unrivaled performance enables our customers to simulate higher-fidelity models without having to expand their cluster nodes ."<sup>1</sup>

#### Dr. Wim Slagter - Director of HPC and cloud marketing, ANSYS

- Intel<sup>®</sup> Omni-Path Architecture (Intel<sup>®</sup> OPA) is a powerful low latency communications interface specifically designed for High Performance Computing.
- Cluster users will get better utilization of cluster nodes through better scaling.
- Cluster performance means better time-to-solution on CFD simulations.
- Coupled with Intel<sup>®</sup> MPI, and utilizing standard Fluent runtime options to access TMI, Fluent is ready and proven for out-of-the-box performance on Intel OPAready clusters.

Up to 55% performance advantage with Intel<sup>®</sup> OPA compared to FDR fabric on a 32 node cluster



#### www.ansys.com

#### **Technical Computing**



<sup>1 -</sup> Testing conducted on ISV\* software on 2S Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5-2697 v4 comparing Intel<sup>®</sup> OPA to FDR InfiniBand\* fabric. Testing done by Intel. For complete testing configuration details, <u>go here</u>. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

## Real Application Performance\* - Intel® OPA vs EDR/MXM-FCA

\*SPEC MPI2007 Intel internal measurements marked estimates until published



Tests performed by Intel on Intel® Xeon® Processor E5-2697v3 dual-socket servers with 2133 MHz DDR4 memory. 16 nodes/448 MPI ranks. Turbo mode and hyper-threading disabled. Intel® OPA: I

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.



# **COST BENEFITS**

# Intel<sup>®</sup> Omni-Path Fabric's **48 Radix Chip**

It's more than just a 33% increase in port count over a 36 Radix chip



1- Latency numbers based on Mellanox CS7500 Director Switch and Mellanox SB7700/SB7790 Edge switches. See www.Mellanox.com for more product information.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>http://www.intel.com/performance</u>.\*Other names and brands may be claimed as the property of others.



## Are You Leaving **Performance** on the Table?



<sup>1</sup> Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel<sup>®</sup> OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing from www.kernelsoftware.com, with prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of May 26, 2015. Intel<sup>®</sup> OPA pricing based on estimated reseller pricing based on projected Intel MSRP pricing at time of launch. \* Other names and brands may be claimed as property of others.



## **CPU-Fabric Integration**

with the Intel<sup>®</sup> Omni-Path Architecture



TIME

## Intel<sup>®</sup> OPA HFI Option Comparison

|                               | PCle Card x8<br>(Chippewa Forest)           | PCIe Card x16<br>(Chippewa Forest)                   | Knights Landing-F                              | Skylake-F<br>(single –F CPU<br>populated)      | Skylake-F<br>(two –F CPUs<br>populated)        | Note                                                                                                             | 25                                                    |
|-------------------------------|---------------------------------------------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|
| Ports per node                | 1                                           | 1                                                    | 2                                              | 1                                              | 2                                              | Assumes single CHF carc<br>multiple cards in a single                                                            |                                                       |
| Peak bandwidth                | 7.25 GB/s                                   | 12.5 GB/s                                            | 25 GB/s                                        | 12.5 GB/s                                      | 25 GB/s                                        | Total platform bandwidt                                                                                          | h                                                     |
| Latency                       | 1 us                                        | 1 us                                                 | 1 us                                           | 1 us                                           | 1 us                                           | No measurable differenc<br>since both use a PCIe inte                                                            | e in MPI latency expected<br>erface                   |
| CPU TDP adder                 | n/a                                         | n/a                                                  | 15W                                            | 0W, 10W, 15W                                   | 0W, 10W, or 15W                                | TDP adder per socket, de                                                                                         | ependent on SKL-F SKU                                 |
| Power                         | <ul><li>6.3W typ</li><li>8.3W max</li></ul> | <ul><li>7.4W typ</li><li>11.7W max</li></ul>         | n/a                                            | n/a                                            | n/a                                            | Estimated power numbe                                                                                            | rs with passive Cu cables                             |
| PCIe slot required?           | Yes                                         | Yes                                                  | No                                             | No                                             | No                                             | Custom mezz card mech or chassis. Requires pow                                                                   | anically attached to board<br>ver and sideband cables |
| PCIe slot option              |                                             | Low profile x16 PCIe<br>slot, or custom mezz<br>card | PCIe carrier card<br>with x4 PCIe<br>connector | PCIe carrier card<br>with x4 PCIe<br>connector | PCIe carrier card<br>with x4 PCIe<br>connector | SKL-F (dual –F CPU) can<br>carrier card, similar to KN<br>Carrier card requires a P(<br>power, but not necessari | IL PCIe carrier card                                  |
| PCIe lanes used (on<br>board) | 8                                           | 16                                                   | 32<br>[4 lanes available]                      | 0                                              | 0                                              | SKL-F includes dedicated<br>Assumes PCIe carrier car<br>routed for power and no                                  | d uses a x4 PCIe slot only                            |

# **TECHNOLOGY COMPARISONS**

## **Product Comparison Matrix**

| Feature                                                                                                                                | Intel <sup>®</sup> Omni-Path                         | EDR                                                                           | Notes                                                                                            |
|----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|-------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| Switch Specifications                                                                                                                  |                                                      |                                                                               |                                                                                                  |
| Link Speed (QSFP28)                                                                                                                    | 100Gb/s                                              | 100Gb/s                                                                       | Same Speed                                                                                       |
| Port Count: Director -<br>Edge -                                                                                                       | 192, <b>768 (66% more per 1U)</b><br>48, 24          | 216, 324, <b>648</b><br>36                                                    | + 18.5% Ports<br>+ 33% Ports                                                                     |
| Latency: Director -<br>Edge -                                                                                                          | 300-330ns (Includes PIP)<br>100-110ns (Includes PIP) | <500ns <sup>1</sup> (Should be 3 x 90ns?)<br>90ns <sup>1</sup> (FEC Disabled) | Up to 32% Advantage<br>FEC increases power up to 50% per port                                    |
| Redundant Power/Cooling                                                                                                                | Yes (Director AC and/or AC-DC Power)                 | Yes                                                                           |                                                                                                  |
| Packet Rate Per Port: Switch<br>Host                                                                                                   | 195M msg/sec<br>160M msg/sec (CPU Dependent)         | 150/195M msg/sec - Switch-IB/Switch-IB 2<br>150M msg/sec                      | Mellanox claims are not for MPI Messages.<br>Most HPC applications use MPI as transport          |
| Power Per Port <b>(<i>Typical Copper</i>)<sup>2</sup></b> :<br>– 24/18-Slot Director<br>– 48/36-Port Edge (M)<br>– 48/36-Port Edge (U) | ~8.85 Watts<br>3.87 W<br>3.48 W                      | 14.1 Watts<br>3.78 W<br>3.78 W                                                | <b>37.2% Lower Power</b><br>EDR Power for FEC and Mgmt Card missing<br>EDR Power for FEC missing |
| Director Leaf Module: Size/Qty                                                                                                         | 32 / <b>(24-Slot),</b> (6-Slot)                      | <b>36</b> / (18-Slot), (6-Slot)                                               | +33% modules in single large director                                                            |
| Largest 2 Tier Fabric (Edge/Director)                                                                                                  | 18,432                                               | 11,664                                                                        | ~1.6x (QSFP28)                                                                                   |
| Host Adapter Specifications                                                                                                            |                                                      |                                                                               |                                                                                                  |
| Host Adapter Model                                                                                                                     | Intel® OPA 100 Series (HFI)                          | HCA (ConnectX-4)                                                              |                                                                                                  |
| Protocol                                                                                                                               | Intel® OPA                                           | InfiniBand                                                                    |                                                                                                  |
| Speed Support (Host)                                                                                                                   | x16 = 100Gb/s – x8 = 58Gb/s                          | All Prior IB Speeds <sup>1</sup>                                              | CX4 includes a rate locked FDR version <sup>1</sup>                                              |
| Power Per Port <b>(Typical Copper)</b> <sup>2</sup> :<br>– 1-Port x16 HFI<br>– 1-Port x8 HFI                                           | 7.4 W Copper<br>6.3 W Copper                         | 13.9 W Copper                                                                 | 46.7% Lower Power                                                                                |

<sup>1</sup> Mellanox Datasheets: December, 19 2015 <sup>2</sup> Power ratings assume fully loaded systems



## Intel® Omni-Path High Level Feature Comparison Matrix

| Features                                                                                     | Intel <sup>®</sup> OPA        | EDR                          | Notes                                                                                                                                      |
|----------------------------------------------------------------------------------------------|-------------------------------|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| Link Speed                                                                                   | 100Gb/s 100Gb/s               |                              | Same Link Speed                                                                                                                            |
| Switch Latency – Edge/DCS                                                                    | 100-110ns/300-330ns           | 90ns/~500ns                  | Intel® OPA includes "Load-Free" error detection <ul> <li>Application Latency Most important</li> </ul>                                     |
| MPI Latency (OSU pt2pt)                                                                      | Less Than 1µs                 | ~1µs                         | <ul><li>Similar 1 Hop Latency</li><li>Intel's OPA HFI improves with each CPU generation</li></ul>                                          |
| Link Enhancements – Error Detection/Correction                                               | Packet Integrity Protection   | Stago Fabric                 | Intel OPA is a HW detection solution that adds <u>no</u><br><u>latency or BW penalty</u>                                                   |
| Link Enhancements<br>– Data Prioritization across VLs                                        |                               | Stage Fabric<br>y Protection | Over and above VL prioritization. Allows High priority traffic to preempt in-flight low priority traffic (~15% performance improvement)    |
| Link Enhancements<br>– Graceful Degradation                                                  | Dynamic Lane Scaling<br>(DLS) | No                           | Non-Disruptive Lane(s) failure. Supports asymmetrical traffic pattern. Avoids total shutdown,                                              |
| RDMA Support                                                                                 | Yes                           | Yes                          | RDMA underpins verbs. Intel® OPA supports verbs. TID<br>RDMA brings Send/Receive HW assists for RDMA for<br>larger messages                |
| Built for MPI Semantics                                                                      | Yes – PSM (10% of code)       | No - Verbs                   | Purpose designed for HPC                                                                                                                   |
| Switch Radix                                                                                 | 48 Ports                      | 36 Ports                     | Higher Radix means less switches, power, space etc.                                                                                        |
| Fabric Router                                                                                | No                            | Future                       | Limited need to connect to older fabric technologies except for storage – Still not available                                              |
| EDR Source: Publicall<br>Intel Confidential – NDA OPA Features: Based<br>Only Specifications |                               | service activation. Performa | and benefits depend on system configuration and may require enabled hardware, software or<br>nce varies depending on system configuration. |

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. \*Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2015, Intel Corporation. Potential future options, subject to change without notice. All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification.



# Switch Latency



## **Understanding Switch Latency Comparisons**



Tests performed by Intel on Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5-2697v3 dual-socket servers with 2133 MHz DDR4 memory. Turbo mode enabled and hyper-threading disabled. Ohio State Micro Benchmarks v. 4.4.1. Intel OPA: Open MPI 1.10.0 with PSM2. Intel Corporation Device 24f0 – Series 100 HFI ASIC. OPA Switch: Series 100 Edge Switch – 48 port. IOU Non-posted Prefetch disabled in BIOS. EDR: Open MPI 1.8-mellanox released with hpcx-v1.3.336-icc-MLNX\_OFED\_LINUX-3.0-1.0.1-redhat6.6-x86\_64.tbz. MXM\_TLS=self,rc tuning. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR InfiniBand switch 1. osu\_latency 8 B message.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>http://www.intel.com/performance.</u>



# **RDMA Support**



## Intel<sup>®</sup> Omni-Path Architecture (Intel<sup>®</sup> OPA) RDMA Support

Intel<sup>®</sup> OPA has always supported RDMA Functions for MPI-Based applications via PSM

- 16 Send DMA (SDMA) engines and Automatic Header Generation provide HW-assists for offloading large message processing from the CPU
- Intel® OPA supports RDMA for Verbs I/O
- RDMA is the underlying protocol for Verbs
- Storage runs over verbs
- Additional performance enhancements are coming
- 8K MTU supported to further reduce CPU interrupts for I/O



# **Power Usage**



## Intel® OPA vs. EDR: End-to-End Power Comparison:



<sup>1</sup>Assumes that all switch ports are utilized. All power measurements are typical. All Mellanox power from 12/23/15 documents located a <u>www.mellanox.com</u>. Mellanox Switch 7790 power from datasheet. Host Adapter power from ConnectX<sup>®</sup>-4 VPI Single and Dual Port QSFP28 Adapter Card User Manual page 45. CS7500 Director power from 648-Port EDR InfiniBand Switch-IB<sup>™</sup> Switch Platform Hardware User Manual page 75

Intel<sup>®</sup> Solutions Summit 2016





## Proven Technology Required for Today's Bids: Intel<sup>®</sup> OPA is the Future of High Performance Fabrics



**Highly Leverages** existing Intel, Aries and Intel<sup>®</sup> True Scale technologies



Innovative Features for high fabric performance, resiliency, and QoS



#### Leading Edge Integration with Intel<sup>®</sup> Xeon<sup>®</sup> processor and Intel<sup>®</sup> Xeon Phi<sup>™</sup> processor



**Robust Ecosystem** of trusted computing partners and providers



Open Source software and supports standards like the OpenFabrics Alliance\*

\*Other names and brands may be claimed as property of others.





## experience what's inside<sup>™</sup>