Cheap network storage system

By: Eddie Aronovich, School of Computer science, Tel-Aviv University

July. 2012

Abstract

This document describes the set-up that we are using for a cheap (less than 10K$ for 120TB) and fast (500MB/sec read and almost 350MB/sec write) storage system. Assuming that one is familiar with the Backblaze JBOD, we focus on the configuration for performance.

Motivation

We need large storage for research. It means very large storage systems, but the reliability is moderately important and the price is a big issue. We looked at many vendors and solutions, but the common issues were:

The solution that we implemented

We decided to go for Backblaze as a low cost storage system. During the procurement (which is a long lasting process by itself at least in our institution) Backblaze v2.0 appeared which is very similar to our planning, but we already started with hardware that was a but different. We bought (2) cases and due to the problems that we encountered, one runs with Sil cards and the other with FastTrak cards. Our experience and measurement indicate that the Sil cards are faster and have better driver support but we had some heating problems with the disks that were connected to the Sil cards. The heating problem vanished (without he patch mentioned later).

Hardware deviation

The hardware components that we used compared to the original Backblaze configuration is described below.

Building the case

The case was assembled and wired as recommended by Backblaze.
mounting the fans mounting the electricity the motherboard port multipliers now only the disks are missing Working systems

Challenges

After we got all the hardware and assembled it, we installed Ubuntu Linux but almost nothing worked. The problems we encountered were: We tried to debug the system, but it did not worked. Actually it looked bad! We invested almost 10K$ and nothing work. But it gave hell of motivation to make it work!
We also tried to use openfiler and freenas. The openfiler recognized all the disks, but when we created raid arrays, some disks disappeared and it was not able to detect them even after reboot. The freenas could not work with the port multiplier cards that we had. We tried to change the kernel to a newer one but we did not succeed.

Solutions

At he end, we used Debian (we switched form Ubuntu to Debian without any specific reason) with mdadm as raid mechanism. The disks were partitioned into 5 raid of 8 disks each. Stripe (n) was build of the nth disks in each chunk. In this way, each stripe uses all the bandwidth of the port multipliers. Since the port multipliers and the multipliers cards support only SATA II, we load of each stripe is degraded to the speed of SATA II.
We performed the following changes in the Linux kernel and the configuration of the mdadm to make it work:

Some measurements

Summary of IOzone filesystem benchmark can be seen here

Working on a single raid array

The measurements below were performed to one of the raid arrays (strips) while the rest are idle. Reading using zcav after increasing the cache size: # echo "32768" > /sys/block/md3/md/stripe_cache_size # zcav -b 1024 /dev/md3 #block offset (GiB), MiB/s, time 0.00 308.82 3.316 1.00 497.39 2.059 2.00 506.10 2.023 3.00 495.45 2.067 4.00 505.45 2.026 5.00 501.43 2.042 6.00 496.08 2.064 7.00 504.34 2.030 8.00 499.66 2.049 9.00 494.67 2.070 10.00 503.68 2.033 11.00 498.78 2.053 12.00 494.46 2.071 Writing to a single raid array while the other arrays are idle. # date ; dd if=/dev/zero of=/dev/md0 bs=64M Tue Jun 12 17:09:40 IDT 2012 30802+0 records in 30802+0 records out 2067087228928 bytes (2.1 TB) copied, 6006.19 s, 344 MB/s Reading from a network station to that raid array: > time dd if=aaa-2012-07-01 of=/dev/null bs=64M 100+0 records in 100+0 records out 6710886400 bytes (6.7 GB) copied, 77.9076 s, 86.1 MB/s 0.000u 5.008s 1:18.02 6.4% 0+0k 13107944+8io 1pf+0w

So - we have a cheap storage that can read locally at ~500MB/sec and write locally at 344MB/sec.

Writing from a regular network node to a single raid array > time dd if=/dev/zero of=aaa-2012-07-03 bs=64M count=100 100+0 records in 100+0 records out 6710886400 bytes (6.7 GB) copied, 97.0069 s, 69.2 MB/s 0.000u 9.196s 1:37.06 9.4% 0+0k 0+13107208io 0pf+0w > time dd if=/dev/zero of=aaa bs=64M count=100 100+0 records in 100+0 records out 6710886400 bytes (6.7 GB) copied, 77.2204 s, 86.9 MB/s 0.000u 9.336s 1:17.29 12.0% 0+0k 0+13107208io 0pf+0w Writing speed from the same station to a Network Appliance system (that was not idle): > time dd if=/dev/zero of=aaa bs=64M count=50 50+0 records in 50+0 records out 3355443200 bytes (3.4 GB) copied, 59.4391 s, 56.5 MB/s 0.000u 4.252s 0:59.51 7.1% 0+0k 0+6553608io 0pf+0w Using htparm: # hdparm -tT /dev/md3 /dev/md3: Timing cached reads: 11042 MB in 2.00 seconds = 5523.24 MB/sec Timing buffered disk reads: 1110 MB in 3.03 seconds = 366.81 MB/sec #

What's next ?

If you can support Jbod based on SSD - please contact me!

Thanks !

First I would like to acknowledge prof. Ronitt Rubinfeld for her generous support of this project

The team that made this project happen which include : http://www.cs.tau.ac.il/research/eddie.aronovich/docs/Jbod-120TB.html