Performance Tuning Lab

Instructors:

Teaching Assistent:Ohad Barzilay

Bibliography

Getting the best possible performance out of the application is an important skill for every developer. Unfortunately many of the applications in the market are not optimized and as a result are less competitive. In the Open source community code that has big performance related errors (like using O(N**2) algorithm when O(N) can do the job) is less common because there are typically many developers that are reviewing the code. Still there is a lot of room for improvements.

In order to maximize the performance on code the programmer can take advantage of new HW capabilities. Recently, with the introduction of processors supporting thread-level parallelism, multiprocessing has become mainstream in desktops.  In the last several years multi core CPUs were introduced from all the major manufactures (IBM G5, Intel in both IA32 and IA64, AMD). Breaking the code to threads in order to maximize the parallelism becomes a general practice as part of the programming work in most of the domains and not just HPC (high performance computing) as it used to be in the past.

Processors that support thread-level parallelism contain multiple processors cores in one physical processor package. They might share the last level cache and busses. Given that processor resources are generally under-utilized, thread-level parallelism can improve overall application performance by running multiple threads in parallel to achieve higher utilization and increased throughput. Since logical processors share some of the underlying physical processor resources, interactions between multiple threads and the level of effective threading can positively (but in some instances negatively), affect the performance of the application. Taking the necessary steps allows an application to get the maximum benefits of thread-level parallelism.

Recently several extensions with an emphasis on SIMD (Single instruction multiple data) were added to the CPU architecture. The commercial compilers added the ability to take advantage of those extensions by providing mechanism for vector processing. By modifying the code in a way the enables the vector processing of the algorithms the programmer can gain significant speedup in the critical sections of his code.

In many cases better understanding of micro architectural aspects of specific CPU and fixing micro architectural related issues can have a big impact on the overall performance. Monitoring other events in addition to clock ticks and instructions retired can often reveal potential causes of what appear to be pipeline stalls. Some such events are those that map directly to common coding pitfalls. For example in the Intel's Core 2 duo processor events like RAT_STALLS.FLAGS, RAT_STALLS.PARTIAL_COUNT, LOAD_BLOCK.STORE_OVERLAP, LOAD_BLOCK.UNTIL_RETIRE or EXT_SNOOP.ALL_AGENTS.HITM. Each one of these events indicates that the source code contains certain sequences of instructions that are potentially unfriendly to the micro architecture in one way or another. LOAD_BLOCK.STORE_OVERLAP, for example, indicates that store-to-load forwarding restrictions are not being observed.

Analyzing memory behavior and cache utilization can yield in some cases huge opportunities for performance improvement by reorganizing the code in order to utilize better localization. Evaluating when coding pitfalls are causing performance hits is difficult to do without the help of an analyzer. To determine whether there is something in the implementation that can be done to help speed things up, the student will profile the application with all the events we want to monitor, using the Intel VTune performance analyzer.

In the performance tuning lab the students will select one Open source application and build it. They will learn how to analyze the performance of the application by using tools like VTune(tm) Performance analyzer which help developers identify performance bottlenecks in their code. With this tool they can improve the performance of an application by providing visibility into the inner workings of the application as it executes. After identifying the significant performance areas in the code they will use OpenMP or the threading Windows APIs in order to thread their code. In order to use OpenMP the students will use the Microsoft Visual studio 2005 that added OpenMP support or the Intel(r) C++ Compiler for Windows* or Intel(r) Fortran Compiler for Windows*. After threading the application the users will learn how to debug their threaded issues with the Intel(r) Thread Checker. This is a tool that automatically locates software threading issues, such as race conditions, stalls, and deadlocks. The Intel Thread Checker monitors your application's execution to detect the hard-to-find and nearly impossible-to-find intermittent errors.

In order to analyze the performance of the threaded application the students will use the Thread Profiler. This is a tool that shows thread workload imbalances to help you tune the performance of Windows API or OpenMP-threaded or applications. The Thread Profiler organizes the analysis into the following categories: Time spent in parallel code, in sequential code, waiting at barriers for other threads, waiting to enter critical sections or to access locks in critical sections and holding locks Thread workload imbalance Parallel overhead Sequential overhead.

It has a graphical display of the key parallel programming performance issues: time threads stay within parallel regions, sequential regions, waiting for locks, and in overhead. It helps you focus on the critical path and identify where lack of parallelism reduces the performance./p>

After gaining the performance improvement on the code by threading, by cleaning micro architectural bottlenecks  and/or by programming with SIMD instructions (SSE, SSE2 and/or Prescott new instruction) the students will submit  their code back to the open source tree and publish a paper describing what they did and the parallelism techniques they used.  In order to succeed in checking back code to the open source community the code needs to meet the high programming standard of the original program.

 

The Project Milestones

 

 Every team will have to follow the following milestones:

 

Selecting an open source program.

Tools training.

Develop a benchmark and set the baseline. The benchmark should be fully automated, repeatable, should represent the functionality of the app and should take significant amount of time (about one minute) to enable progress monitoring.

Analyze the performance, identify time breakdown hot spots.

Understand the reason for the performance (HW events).

Define optimization strategy.  (SIMD, threading ...) Optimization work and benchmarking the results.

Documentation of the work. Publishing a paper, documenting the code   changes and trying to submit the code back to the open source community.

 

 The Project  Estimated Schedules

#

Milestone

Description

Week (1-12)

Meeting Format

1

Introduction

The course goals and rules

2

Frontal Lecture

2

Application approval

Choosing the open source application Introduction to the optimization tools

 

4

Frontal Lecture

3

Initial Benchmark

Compile and run the initial benchmark. Learn the source code. Suggest optimizations 

5

Individual meeting

4

Optimization sessions

Use Intel tools to optimize the source code

5-10

by Email & scheduled Individual meeting

5

Deployment

Documentation and Web publishing

~11

 

6

Presentation

Present the published results

~12

Individual meeting

 

 

 

 

Bibliography

 

       The projects will be carried in teams of three student

Also see the Technion< link for sample projects >