Sunday, August 06, 2006

Sun ZFS and Thumper (x4500)

I was one of the beta testers for Sun's new x4500 high density storage server, and it turned out pretty well. I was able to hire Dave Fisk as a consultant to help me do the detailed evaluation using his in-depth tools, and it turned into a fascinating investigation of the detailed behavior of the ZFS file system.

ZFS is simple to use, has lots of extremely useful features, and the price is right (bundled with Solaris 10 6/06 or OpenSolaris). However its doing lots of clever things under the hood and it behaves like nothing else. Its far more complicated to predict its performance than any other file system we've looked at. It even baffled Dave at first, he had to change his tools to support ZFS, but he's got it pretty well figured out now.

For a start, its a write anyware file system layout (WAFL) which is similar in some ways to a NetApp filer. This means that random writes are batched up, sorted by file, file system etc. and every few seconds a big burst of sequential writes commits the data to disk as a transaction. Since sequential writes to disk are always much more efficient than random writes, this mean that it gets much more performance per disk than UFS/VxFS etc for random writes.

The combination of the x4500 and ZFS works well, since ZFS knows that the firmware on the 48 SATA drives in the x4500 have a write cache that can safely be enabled and flushed on demand. This greatly improves performance and fixes an issue that I have been complaining about for years. Finally a safe way to use the write caches that exist in every modern drive.

Its actually easier to list the things that ZFS on the x4500 doesn't have.

  • No extra cost - its bundled in a free OS
  • No volume manager - its built in
  • No space management - file systems use a common pool
  • No long wait for newfs to finish - we created a 3TB file system in a second
  • No fsck - its transactional commit means its consistent on disk
  • No rsync - snapshots can be differenced and replicated remotely
  • No silent data corruption - all data is checksummed as it is read
  • No bad archives - all the data in the file system is scrubbed regularly
  • No penalty for software RAID - RAID-Z has a clever optimization
  • No downtime - mirroring, RAID-Z and hot spares
  • No immediate maintenance - double parity disks if you need them
  • No hardware failures in our testing - we didn't get to try out some of these features!

and finally, on the downside

  • No way to know how much performance headroom you have
  • No way to get at the disks without taking the top off the x4500
  • No clustering support - I guess they couldn't put everything on the wish list...

The performance is actually very good, and in normal use its going to be fine, but when we tried to drive ZFS to its limit, we found that the results were less consistent or predictable than more conventional file systems. Some of the issues we ran into are present in the Solaris 10 6/06 release, but when the x4500 ships it will have an update to ZFS that includes performance fixes to speed things up in general and reduce the impact of the worst case issues, so it should be more consistent.

We've put ZFS on some of our internal file servers, to see how it goes in light usage. However, it always takes a while to build up confidence in a large body of new code, especially if its storage related. If we can add this one to the list:

  • No nasty bugs or surprises?

Then ZFS looks like a good way to take a lot of cost out of the storage tier.

I'm interested to hear how other people are getting on with ZFS, especially mission critical production uses.

technorati tags:, , , ,

Blogged with Flock

6 comments:

  1. Here is my experience. Cannot even finish the test with ZFS - the box is just stuck. :(

    ReplyDelete
  2. Adrian,

    As a Sun employee, I'm really convinced that leting you go was one of the many big mistakes Sun did... In fact, whenever I try to convince customers that Solaris 10 is their best way to go, I use yor "The system will usually have a disk bottleneck" phrase from your "Sun perfornace and tuning" Second Edition Book. Then, I point them Dtrace, particulary to the iotop script, which will let the customer easely identify which process is consuming the most I/O.

    I just saw your blog for first time today. It looks great, good work.

    BTW, the "Overture: Fractional" song from Fractal is quite hard for me to understand (I couldn't found the melody before the 55 second, when the battery faded and the other instruments gain more volume).

    ReplyDelete
  3. Many thanks for the kind words Matias. I left Sun in 2004 because the whole group I was in was shut down. At the time I was learning a lot, having fun and working with a great team. When our parent organization was shut down everyone agreed we were needed and doing a good job, but no-one could find the headcount to keep our team intact. I took the opportunity to take the package and move upstream from the commodity supplier layer to the services layer. I see eBay, Skype, PayPal and others as forming the API system calls for the internet as an operating system. I think Sun makes great products, but we are mostly buying generic hardware (AMD or Intel) running generic OS (Solaris, Windows or Linux), whichever is the best deal and feature set at the time, and I don't often have to think hard or long about it. e.g. I don't spend very much time thinking about which grocery store to buy milk at, but I do care which restaurant I go to. Most of the variety and innovation has moved upstream from where Sun is. I spent ten times as much time working on the Skype API as I did on Thumper/ZFS in the last month....

    Also, Fractal are trying to make new kinds of music, but you may prefer some of their more recent and tuneful songs on myspace.com/fractalcontinuum

    ReplyDelete
  4. Adrian

    You say ZFS uses the on-disk Write cache for caching write data.

    How does the file system recover from a sudden power loss. Is there some kind of backup to the on-disk write cache?

    or does ZFS rely on some kind of checksum to at least know that something was not written.

    ReplyDelete
  5. ZFS writes are large transactional batches of data. ZFS then issues a write flush to the disks, and when that completes it knows that the transaction is safe. This feature is only enabled if ZFS knows that the disk firmware correctly implements the write flush, so there is some benefit from using Sun supplied disks, that have been tested to make sure this optimization is safe.

    ReplyDelete
  6. Faint hint of the love of friendship. There is a friend, I think that is a cheap wow gold kind of love and friendship between the feelings between, you will occasionally miss a time to silence him, reminds him, the warm heart, a beautiful, a moving. Sad and worried at the time, you will think of him, you hope that he can in your side, give you comfort and buy wow power leveling understanding to you, and you never talk to him, are you afraid of their own grief that would preclude him from peaceful life.

    You will be a song, a color, think of him, think of my wow gold his sincerity, he reminds me of the dedication, his mind had gone through the ups and downs together. With such a friend, you will treasure their lives and love their own lives, because you know I hope you had a good, he would like you to take care of themselves properly, and then to meet, he cheapest wow power leveling would like you to tell him you very happy.

    That the concept of secularism, in your mind, because his replica replica rolex presence has become pale and weak, you are only in the depths of the bottom of my heart for this man set up a small space, stick with that quietly happy memories, from the the very beginning you know that you will not have any love, it seems that talk about love on the desecration CHEAP wow power level of this emotion, this can only be a friendship. This is how in the end all about? You want to for many years, but no clue.

    You seldom have contacts in this long life, you meet the door may be only a few of the ten thousandth time, but in each other's hearts retain a miss, one asked, even if he went to the ends of the earth, even after many, many years, even BUY power leveling when the meet again, has long been a non-object is not, you still remember back as a profound, and this has been sufficient.

    Sometimes life will calm a dry, maybe you will fall into the dry I replica rolex replica did this, perhaps what you do not have love, perhaps early HUAFA, full sideburns pale, but with such a friend, in the your life will ripple slightly, a little color, do you think of him. He remembered the quiet, perhaps the Health and cheap lotro gold the WHO will not forget.

    Am grateful for you in this world, there is such a person, he is replica rolex not around you, he did not do anything for you, but you hope that he would have been very good, long life, happiness well-being ... ...

    Am glad you had a feeling like that, pure and long, in this complex CHEAPEST power leveling world, there are such a friend, it is worth to cheap rs gold lbless you to miss ... ...

    ReplyDelete