Workshop In Algorithms And Data Structures
2009, Semester B

Cache Performance Of Algorithms




Course
0368-3500-22 Workshop In Computer Science
Meeting Time And Place (first few classes only!)
Wednesday, 14:00-16:00. Kaplun Building, Room 324.
Lecturer
Professor Haim Kaplan
Lecturer Email
haimk@cs.tau.ac.il
Teaching Assistant
Sagi Hed
Teaching Assistant Email
sagihed@post.tau.ac.il


Announcements

07/06/09 9:00  Updated project submission specifications.
08/03/09 12:00  Updated workshop deadline.
05/03/09 09:30  Updated references and specification document deadline.
02/03/09 19:00  Workshop web page is up.


Overview


Each group will first select a classical algorithmic problem (for example, string sorting, minimum spanning trees, shortest path, maximum flow etc.) to work on. This problem has classic algorithms that solve it, some of which you have learned and others you may learn by yourself.


In recent years there has been research on designing algorithms for problems with better I/O performance, where I/O sometimes means an access to main memory (cache miss) or access to disk (page fault). Often these algorithms pay some penalty in terms of running time and instruction counts for being I/O efficient.


Your goal is to pick a known I/O efficient algorithm for your problem, or design your own, and to implement the algorithm/algorithms you have chosen as efficiently as you can. You would also implement the standard textbook algorithm for the problem for comparison. It is important that your algorithm would deal with the I/O performance at the algorithmic level (rather simply optimizing the code, which the compiler already does to some extent.) Often you may want and need to simplify the theoretical implementation of the algorithm to make it more suited for implementation in practice. You do not have to stick precisely to a theoretical description given in a reference.

Next, you should carefully pick interesting inputs to run your algorithms on and compare the performance of your plain non-cache efficient algorithm, and the cache efficient algorithm/s (at least in theory) on these inputs. Typically you would want to compare the number of cache misses (of various caches), the instruction count, and the running time.


Specification Document

By a chosen deadline (see below), each group will submit a specification document explaining its choices and future plan for the workshop (see below for details).
There will be a meeting held with each group to review the document and the plan, and to make sure everything is on track.


Technical Information
  1. Submission is in pairs or triplets.
  2. Only the first few classes will be held, giving background and technical information concerning the workshop.
    After that, work is independent. You are welcome to contact the teaching assistant (sagihed@post.tau.ac.il) if you have questions or run into difficulties.

References

Caches And How They Work
  1. http://lwn.net/Articles/252125/

Cache-Aware and Cache-Oblivious Algorithms
  1. http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/ (many references inside)
  2. http://www.brics.dk/~gerth/Papers/isaac02.ps.gz
  3. http://www.daimi.au.dk/~large/ioS08/
  4. http://www.dunkel.dk/thesis/
  5. http://portal.acm.org/citation.cfm?id=1227161.1227164&coll=ACM&dl=ACM&CFID=24915435&CFTOKEN=71845061
  6. http://portal.acm.org/citation.cfm?id=1283383.1283481&coll=ACM&dl=ACM&CFID=24915435&CFTOKEN=71845061
  7. http://portal.acm.org/citation.cfm?id=1248377.1248394&coll=ACM&dl=ACM&CFID=24915435&CFTOKEN=71845061
Libraries
  1. Library for data structures handling inputs larger than main memory.
    http://stxxl.sourceforge.net/


How To Measure Results

Several tools are available to you. You may use any tool you see fit, even if it is not specified below, as long as its results are reasonable.
  1. getrusage (http://rabbit.eng.miami.edu/info/functions/time.html#getrusage)
    This method enables you to obtain a clean measure of the cpu time spent on your code (excluding time spent by the operating system while your code was running). This method also enables you to obtain a page fault count for your program (how many times a page had to be loaded from disk into memory). This count is relevant if you're comparing algorithm performance when input size does not fit into main memory, and you want to let the operating system manage the disk read/writes on its own. The relevant page fault count is the "major page fault" count (minor page fault do not load from disk), available using the ru_majflt field in the result of getrusage.
    This library is available on any linux in the CS computer network.
    See the documentation above for exact usage details.
  2. valgrind (http://valgrind.org/docs/manual/cl-manual.html)
    This tool simulates a cache, and provides several measures including a total of instructions executed, level 1 cache misses and level 2 cache misses.
    It also enables you to decide on any cache configuration you'd like, such as the size of the cache, the size of a cache block, and the cache associativity.
    Because the tool performs a simulation, its major drawback is that it slows down the execution considerably. Examine this drawback with different input sizes on your implementation, before choosing to use valgrind.

    The tool is available on nova in the CS computer network.
    Usage:
    1. run:
      valgrind --tool=callgrind --simulate-cache=yes
      --<cache level>=<cache configuration> <program> [arg1] [arg2] ...
      In the standard output, under "D1 misses" is the total level 1 cache misses.
      Under "L2 misses" is the total level 2 cache misses.
    2. An output file callgrind.out.<process_id> is generated in the current directory. Read it by running:
      callgrind_annotate callgrind.out.<process_id>
      Under "PROGRAM_TOTALS", "Ir" is the number of instructions executed.
    3. To set the cache configuration use the parameter -
      --<cache level>=<cache size bytes>,<cache associativity: possible cache blocks per memory block>,<cache block size bytes>
      where "cache level" is either D1 for level 1 cache or L2 for level 2 cache.
      e.g. --D1=16384,256,64 means a 16,384B Level 1 cache, with blocks of size 64B and fully associative.
  3. cputrack (http://docs.sun.com/app/docs/doc/816-0210/6m6nb7m6s?a=view)
    This tool enables you to obtain a real measure of cache misses and instructions executed on the machine you are running.
    Usage:

    cputrack -T=99999 -c <event>
    <program> [arg1] [arg2] ...
    <event> can be Instr_cnt for the total instruction count, DC_miss for level 1 cache misses or L2_dmiss_ld for level 2 cache misses.
    This
    tool is available on aries in the CS computer network.

Keep in mind that another option is to compare the performance of the solution when the input is larger than main memory, thus dominated by reading and writing from /to the disk. In such a case you may manage the reading and writing of input blocks on your own and count the number of read/writes yourself.



Deadlines
  1. The specification document should be submitted by - 17 / 5 / 2009.
  2. The final workshop project should be submitted by - 12 / 9 / 2009.
    Note: the final project deadline is the final deadline for all submitted works from this semester. This deadline is dictated by the university and therefore we have no control over it!! Therefore note that no extensions to this date will be possible.
What to submit?
  1. The specification document should be in html, word, pdf or rtf:
  2. The final project submission:

How to submit?
All submissions are via email to the teaching assistant (sagihed@post.tau.ac.il).


Good Luck!