The SSD Anthology: Understanding SSDs and New Drives from OCZ
by Anand Lal Shimpi on March 18, 2009 12:00 AM EST- Posted in
- Storage
Bringing You Up to Speed: The History Lesson
Everyone remembers their first bike right? Mine was red. It had training wheels. I never really learned how to ride it, not that I didn’t go outdoors, I was just too afraid to take those training wheels off I guess. That was a long time ago, but I remember my first bike.
I also remember my first SSD. It was a 1.8” PATA drive made by Samsung for the MacBook Air. It was lent to me by a vendor so I could compare its performance to the stock 1.8” mechanical HDD in the Air.
The benchmarks for that drive didn’t really impress. Most application tests got a little slower and transfer speeds weren’t really any better. Application launch times and battery life both improved, the former by a significant amount. But the drive was expensive; $1000 from Apple and that’s if you bought it with the MacBook Air. Buying it from a vendor would set you back even more. It benchmarked faster than hard drive, but the numbers didn’t justify the cost. I pulled the drive out and sent it back after I was done with the review.
The next time I turned on my MacBook Air I thought it was broken. It took an eternity to boot and everything took forever to launch. Even though the benchmarks showed the SSD shaving off a few seconds of application launch time here and there, in the real world, it was noticeable. The rule of thumb is that it takes about a 10% difference in performance for a user to notice. The application tests didn’t show a 10% difference in performance, but the application launch tests, those were showing 50% gains. It still wasn’t worth $1000, but it was worth a lot more than I originally thought.
It was the MacBook Air experience that made me understand one important point about SSDs: you don’t think they’re fast, until I take one away from you.
My second SSD was a 60GB SuperTalent drive. I built a HTPC using it. It was my boot drive and I chose it because it drew less power and was silent; it helped keep my HTPC cool and I wouldn’t have to worry about drive crunching while watching a movie. My movies were stored elsewhere so the space didn’t really matter. The experience was good, not great because I wasn’t really hitting the drive for data, but it was problem-free.
SuperTalent was the first manufacturer to sell a SSD in a 3.5” enclosure, so when they announced their 120GB drive I told them I’d like to do a review of their SSD in a desktop. They shipped it to me and I wrongly assumed that it was the same as the 60GB drive in my HTPC just with twice the flash.
This drive did have twice the flash, but it was MLC (Multi-Level Cell) flash. While the 60GB drive I had was a SLC drive that used Samsung’s controller, the MLC drive used a little known controller from a company called JMicron. Samsung had a MLC controller at the time but it was too expensive than what SuperTalent was shooting for. This drive was supposed to be affordable, and JMicron delivered an affordable controller.
After running a few tests, the drive went in my Mac Pro as my boot/application drive. I remembered the lesson I learned from my first SSD. I wasn’t going to be able to fairly evaluate this drive until I really used it, then took it away. Little did I know what I was getting myself into.
The first thing I noticed about the drive was how fast everything launched. This experience was actually the source of my SSD proof-of-value test; take a freshly booted machine and without waiting for drive accesses to stop, launch every single application you want to have up and running at the same time. Do this on any system with a HDD and you’ll be impatiently waiting. I did it on the SuperTalent SSD and, wow, everything just popped up. It was like my system wasn’t even doing anything. Not even breaking a sweat.
I got so excited that I remember hopping on AIM to tell someone about how fast the SSD was. I had other apps running in the background and when I went to send that first IM and my machine paused. It was just for a fraction of a second, before the message I'd typed appeared in my conversation window. My system just paused.
Maybe it was a fluke.
I kept using the drive, and it kept happening. The pause wasn’t just in my IM client, it would happen in other applications or even when switching between apps. Maybe there was a strange OS X incompatibility with this SSD? That’d be unfortunate, but also rather unbelievable. So I did some digging.
Others had complained about this problem. SuperTalent wasn’t the only one to ship an affordable drive based on this controller; other manufacturers did as well. G.Skill, OCZ, Patriot and SiliconPower all had drives shipping with the same controller, and every other drive I tested exhibited the same problem.
I was in the midst of figuring out what was happening with these drives when Intel contacted me about reviewing the X25-M, its first SSD. Up to this point Intel had casually mentioned that their SSD was going to be different than the competition and prior to my JMicron experience I didn’t really believe them. After all, how hard could it be? Drive controller logic is nowhere near as complicated as building a Nehalem, surely someone other than Intel could do a good-enough job.
After my SuperTalent/JMicron experience, I realized that there was room for improvement.
Drive vendors were mum on the issue of pausing or stuttering with their drives. Lots of finger pointing resulted. It was surely Microsoft’s fault, or maybe Intel’s. But none of the Samsung based drives had these problems.
Then the issue was cache. The JMicron controller used in these drives didn’t support any external DRAM. Intel and Samsung’s controllers did. It was cache that caused the problems, they said. But Intel’s drive doesn’t use the external DRAM for user data.
Fingers were pointed everywhere, but no one took responsibility for the fault. To their credit, OCZ really stepped up and took care of their customers that were unhappy with their drives. Despite how completely irate they were at my article, they seemed to do the right thing after it was published. I can’t say the same for some of the other vendors.
The issue ended up being random write performance. These “affordable” MLC drives based on the JMicron controller were all tuned for maximum throughput. The sequential write speed of these drives could easily match and surpass that of the fastest hard drives.
If a company that had never made a hard drive before could come out with a product that on its first revision could outperform WD’s VelociRaptor and be more reliable thanks to zero moving parts...well, you get the picture. Optimize for sequential reads and writes!
The problem is that modern day OSes tend to read and write data very randomly, albeit in specific areas of the disk. And the data being accessed is rarely large, it’s usually very small on the order of a few KB in size. It’s these sorts of accesses that no one seemed to think about; after all these vendors and controller manufacturers were used to making USB sticks and CF cards, not hard drives.
Sequential Read Performance | |
JMicron JMF602B MLC | 134.7 MB/s |
Western Digital VelociRaptor 300GB | 118 MB/s |
The chart above shows how much faster these affordable MLC SSDs were than the fastest 3.5” hard drive in sequential reads, but now look at random write performance:
Random Write Latency | Random Write Bandwidth | |
JMicron JMF602B MLC | 532.2 ms | 0.02 MB/s |
Western Digital VelociRaptor 300GB | 7.2 ms | 1.63 MB/s |
While WD’s VelociRaptor averaged less than 8ms to write 4KB, these JMicron drives took around 70x that! Let me ask you this, what do you notice more - things moving very fast or things moving very slow?
The traditional hard drive benchmarks showed that these SSDs were incredible. The real world usage and real world tests disagreed. Storage Review was one of the first sites to popularize real world testing of hard drives nearly a decade ago. It seems that we’d all forgotten the lessons they taught us.
Random write performance is quite possibly the most important performance metric for SSDs these days. It’s what separates the drives that are worth buying from those that aren’t. All SSDs at this point are luxury items, their cost per GB is much higher than that of conventional hard drives. And when you’re buying a luxury anything, you don’t want to buy a lame one.
Cost Per GB from Newegg.com | |
Intel X25-E 32GB | $12.88 |
Intel X25-M 80GB | $4.29 |
OCZ Solid 60GB | $2.33 |
OCZ Apex 60GB | $2.98 |
OCZ Vertex 120GB | $3.49 |
Samsung SLC 32GB | $8.71 |
Western Digital Caviar SE16 640GB | $0.12 |
Western Digital VelociRaptor 300GB | $0.77 |
250 Comments
View All Comments
Kary - Thursday, March 19, 2009 - link
Why use TRIM at all?!?!?If you have extras Blocks on the drive (NOT PAGES, FULL BLOCKS) then there is no need for TRIM command.
1)Currently in use BLOCK is half full
2)More than half a block needs to be written
3)extra BLOCK is mapped into the system
4)original/half full block is mapped out of system.. can be erased during idle time.
You could even bind multiple continuous blocks this way (I assume that it is possible to erase simultaneously any of the internal groupings pages from Blocks on up...they probably share address lines...ex. erase 0000200 -> just erase block #200 ....erase 00002*0 -> erase block 200 to 290...btw, did addressing in base ten instead of binary just to simplify for some :)
korbendallas - Wednesday, March 18, 2009 - link
Actually i think that the Trim command is merely used for marking blocks as free. The OS doesn't know how the data is placed on the SSD, so it can't make informed decision on when to forcefully erase pages. In the same way, the SSD doesn't know anything about what files are in which blocks, so you can't defrag files internally in the drive.So while you can't defrag files, you CAN now defrag free space, and you can improve the wear leveling because deleted data can be ignored.
So let's say you have 10 pages where 50% of the blocks were marked deleted using the Trim command. That means you can move the data into 5 other pages, and erase the 10 pages. The more deleted blocks there are in a page, the better a candidate for this procedure. And there isn't really a problem with doing this while the drive is idle - since you're just doing something now, that you would have to do anyway when a write command comes.
GourdFreeMan - Wednesday, March 18, 2009 - link
This is basically what I am arguing both for and against in the fourth paragraph of my original post, though I assumed it would be the OS'es responsibility, not the drive's.Do SSDs track dirty pages, or only dirty blocks? I don't think there is enough RAM on the controller to do the former...
korbendallas - Wednesday, March 18, 2009 - link
Well, let's take a look at how much storage we actually need. A block can be erased, contain data, or be marked as trimmed or deallocated.That's three different states, or two bits of information. Since each block is 4kB, a 64GB drive would have 16777216 blocks. So that's 4MB of information.
So yeah, saving the block information is totally feasible.
GourdFreeMan - Thursday, March 19, 2009 - link
Actually the drive only needs to know if the page is in use or not, so you can cut that number in half. It can determine a partially full block that is a candidate for defragmentation by looking at whether neighboring pages are in use. By your calculation that would then be 2 MiB.That assumes the controller only needs to support drives of up to 64 GiB capacity, that pages are 4 KiB in size, and that the controller doesn't need to use RAM for any other purpose.
Most consumer SSD lines go up to 256 GiB in capacity, which would bring the total RAM needed up to 8 MiB using your assumption of a 4 KiB page size.
However, both hard drives and SSDs use 512 byte sectors. This does not necessarily mean that internal pages are therefore 512 bytes in size, but lacking any other data about internal pages sizes, let's run the numbers on that assumption. To support a 256 MiB drive with 512 byte pages, you would need 64 MiB of RAM -- which only the Intel line of SSDs has more than -- dedicated solely to this purpose.
As I said before there are ways of getting around this RAM limitation (e.g. storing page allocation data per block, keeping only part of the page allocation table in RAM, etc.), so I don't think the technical challenge here is insurmountable. There still remains the issue of wear, however...
GourdFreeMan - Wednesday, March 18, 2009 - link
Substitute "allocated" for "dirty" in my above post. I muddled the terminology, and there is no edit function to repair my mistake.Also, I suppose the SSD could store some per block data about page allocation appended to the blocks themselves at a small latency penalty to get around the RAM issue, but I am not sure if existing SSDs do such a thing.
My concerns about added wear in my original post still stand, and doing periodic internal defragmentation is going to necessitate some unpredictable sporadic periods of poor response by the drive as well if this feature is to be offered by the drive and not the OS.
Basilisk - Wednesday, March 18, 2009 - link
I think your concerns parallel mine, allbeit we have different conclusions.Parag.1: I think you misunderstand the ERASE concept: as I read it, after an ERASE parts of the block are re-written and parts are left erased -- those latter parts NEED NOT be re-erased before they are written, later. If the TRIM function can be accomplished at an idle moment, access time will be "saved"; if the TRIM can erase (release) multiple clusters in one block [unlikely?], that will reduce both wear & time.
Parag.2: This argument reverses the concept that OS's should largely be ignorant about device internals. As devices with different internal structures have proliferated over the years -- and will continue so with SSD's -- such OS differentiation is costly to support.
Parag 3 and onwards: Herein lies the problem: we want to save wear by not re-writing files to make them contiguous, but we now have a situation where wear and erase times could be considerably reduced by having those files be contiguous. A 2MB file fragmented randomly in 4KB clusters will result in around 500 erase cycles when it's deleted; if stored contiguously, that would only require 4-5 erase cycles (of 512KB SSD-blocks)... a 100:1 reduction in erases/wear.
It would be nice to get the SSD blocks down to 4KB in size, but I have to infer there are counter arguments or it would've been done already.
With current SSDs, I'd explore using larger cluster sizes -- and here we have a clash with MS [big surprise]. IIRC, NTFS clusters cannot exceed 4KB [for something to do with file compression!]. That makes it possible that FAT32 with 32KB clusters [IIRC clusters must be less than 64KB for all system tools to properly function] might be the best choice for systems actively rewriting large files. I'm unfamiliar with FAT32 issues that argue against this, but if the SSD's allocate clusters contiguously, wouldn't this reduce erases by a factor of 8 for large file deletions? 32KB clusters might ham-string caching efficiency and result in more disk accesses, but it might speed-up linear reads and s/w loads.
The impact of very small file/directory usage and for small incremental file changes [like appending to logs] wouldn't be reduced -- it might be increased as data-transfer sizes would increase -- so the overall gain for having fewer clusters-per-SSD-block is hard to intuit, and it would vary in different environments.
GourdFreeMan - Wednesday, March 18, 2009 - link
RE Parag. 1: As I understand it, the entire 512 KiB block must always be erased if there is even a single page of valid data written to it... hence my concerns. You may save time reading and writing data if the device could know a block were partially full, but you still suffer the 2ms erase penalty. Please correct me if I am mistaken in my assumption.RE Parag. 2: The problem is the SSD itself only knows the physical map of empty and used space. It doesn't have any knowledge of the logical file system. NTFS, FAT32, ext3 -- it doesn't matter to the drive, that is the OS'es responsibility.
RE Parag. 3: I would hope that reducing the physical block size would also reduce the block erase time from 2ms, but I am not a flash engineer and so cannot comment. One thing I can state for certain, however is that moving to smaller physical block sizes would not increase wear across the surface of the drive, except possibly for the necessity to keep track of a map of used blocks. Rewriting 128 blocks on a hypothetical SSD with 4 KiB blocks versus 1 512 KiB block still erases 512 KiB of disk space (excepting the overhead in tracking which blocks are filled).
Regarding using large filesystem clusters: 4 KiB clusters offer a nice tradeoff between filesystem size, performance and slack (lost space due to cluster size). If you wanted to make an SSD look artificially good versus a hard drive, a 512 KiB cluster size would do so admirably, but no one would use such a large cluster size except for a data drive used to store extremely large files (e.g. video) exclusively. BTW, in case you are unaware, you can format a non-OS partition with NTFS to cluster sizes other than 4 KiB. You can also force the OS to use a different cluster size by first formating the drive for the OS as a data drive with a different cluster size under Windows and then installing Windows on that partition. I have a 2 KiB cluster size on a drive that has many hundreds of thousands of small files. However, I should note that since virtual memory pages are by default 4 KiB (another compelling reason for the 4 KiB default cluster size), most people don't have a use for other cluster sizes if they intend to have a page file on the drive.
ssj4Gogeta - Wednesday, March 18, 2009 - link
Thanks for the wonderful article. And yes, I read every single word. LOLrudolphna - Wednesday, March 18, 2009 - link
Hey anand, page 3, the random read latency graph, they are mixed up. it is listed as the WD Velociraptor having a .11ms latency, I think you might want to fix that. :)