|
|
Subscribe / Log in / New account

XFS: the filesystem of the future?

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
January 20, 2012
Linux has a lot of filesystems, but two of them (ext4 and btrfs) tend to get most of the attention. In his 2012 linux.conf.au talk, XFS developer Dave Chinner served notice that he thinks more users should be considering XFS. His talk covered work that has been done to resolve the biggest scalability problems in XFS and where he thinks things will go in the future. If he has his way, we will see a lot more XFS around in the coming years.

XFS is often seen as the filesystem for people with massive amounts of data. It serves that role well, Dave said, and it has traditionally performed well for a lot of workloads. Where things have tended to fall down is in the [benchmark plot] writing of metadata; support for workloads that generate a lot of metadata writes has been a longstanding weak point for the filesystem. In short, metadata writes were slow, and did not really scale past even a single CPU.

How slow? Dave put up some slides showing fs-mark results compared to ext4. XFS was significantly worse (as in half as fast) even on a single CPU; the situation just gets worse up to eight threads, after which ext4 hits a cliff and slows down as well. For I/O-heavy workloads with a lot of metadata changes - unpacking a tarball was given as an example - Dave said that ext4 could be 20-50 times faster than XFS. That is slow enough to indicate the presence of a real problem.

Delayed logging

The problem turned out to be journal I/O; XFS was generating vast amounts of journal traffic in response to metadata changes. In the worst cases, almost all of the actual I/O traffic was for the journal - not the data the user was actually trying to write. Solving this problem took multiple attempts over years, one major algorithm change, and a lot of other significant optimizations and tweaks. One thing that was not required was any sort of on-disk format change - though that may be in the works in the future for other reasons.

Metadata-heavy workloads can end up changing the same directory block many times in a short period; each of those changes generates a record that must be written to the journal. That is the source of the huge journal traffic. The solution to the problem is simple in concept: delay the journal updates and combine changes to the same block into a single entry. Actually implementing this idea in a scalable way took a lot of work over some years, but it is now working; delayed logging will be the only XFS journaling mode supported in the 3.3 kernel.

The actual delayed logging technique was mostly stolen from the ext3 filesystem. Since that algorithm is known to work, a lot less time was required to prove that it would work well for XFS as well. Along with its performance benefits, this change resulted in a net reduction in code. Those wanting details on how it works should find more than they ever wanted in filesystems/xfs-delayed-logging.txt in the kernel documentation tree.

Delayed logging is the big change, but far from the only one. The log space reservation fast path is a very hot path in XFS; it is now lockless, though the slow path still requires a global lock at this point. The asynchronous metadata writeback code was creating badly scattered I/O, reducing performance considerably. Now metadata writeback is delayed and sorted prior to writing out. That means that the filesystem is, in Dave's words, doing the I/O scheduler's work. But the I/O scheduler works with a request queue that is typically limited to 128 entries while the XFS delayed metadata writeback queue can have many thousands of entries, so it makes sense to do the sorting in the filesystem prior to I/O submission. "Active log items" are a mechanism that improves the performance of the (large) sorted log item list by accumulating changes and applying them in batches. Metadata caching has also been moved out of the page cache, which had a tendency to reclaim pages at inopportune times. And so on.

[benchmark plot]

How the filesystems compare

So how does XFS scale now? For one or two threads, XFS is still slightly slower than ext4, but it scales linearly up to eight threads, while ext4 gets worse, and btrfs gets a lot worse. The scalability constraints for XFS are now to be found in the locking in the virtual filesystem layer core, not in the filesystem-specific code at all. Directory traversal is now faster for even one thread and much faster for eight. These are, he suggested, not the kind of results that the btrfs developers are likely to show people.

The scalability of space allocation is "orders of magnitude" faster than ext4 offers now. That changes a bit with the "bigalloc" feature added in 3.2, which improves ext4 space allocation scalability by two orders of magnitude if a sufficiently large block size is used. Unfortunately, it also increases small-file space usage by about the same amount, to the point that 160GB are required to hold a kernel tree. Bigalloc does not play well with some other ext4 options and requires complex configuration questions to be answered by the administrator, who must think about how the filesystem will be used over its entire lifetime when the filesystem is created. Ext4, Dave said, is suffering from architectural deficiencies - using bitmaps for space tracking, in particular - that are typical of an 80's era filesystem. It simply cannot scale to truly large filesystems.

Space allocation in Btrfs is even slower than with ext4. Dave said that the problem was primarily in the walking of the free space cache, which is CPU intensive currently. This is not an architectural problem in btrfs, so it should be fixable, but some optimization work will need to be done.

The future of Linux filesystems

Where do things go from here? At this point, metadata performance and scalability in XFS can be considered to be a solved problem. The performance bottleneck is now in the VFS layer, so the next round of work will need to be done there. But the big challenge for the future is in the area of reliability; that may require some significant changes in the XFS filesystem.

Reliability is not just a matter of not losing data - hopefully XFS is already good at that - it is really a scalability issue going forward. It just is not practical to take a petabyte-scale filesystem offline to run a filesystem check and repair tool; that work really needs to be done online in the future. That requires robust failure detection built into the filesystem so that metadata can be validated as correct on the fly. Some other filesystems are implementing validation of data as well, but that is considered to be beyond the scope of XFS; data validation, Dave said, is best done at either the storage array or the application levels.

"Metadata validation" means making the metadata self describing to protect the filesystem against writes that are misdirected by the storage layer. Adding checksums is not sufficient - a checksum only proves that what is there is what was written. Properly self-describing metadata can detect blocks that were written in the wrong place and assist in the reassembly of a badly broken filesystem. It can also prevent the "reiserfs problem," where a filesystem repair tool is confused by stale metadata or metadata found in filesystem images stored in the filesystem being repaired.

Making the metadata self-describing involves a lot of changes. Every metadata block will contain the UUID of the filesystem to which it belongs; there will also be block and inode numbers in each block so the filesystem can verify that the metadata came from the expected place. There will be checksums to detect corrupted metadata blocks and an owner identifier to associate metadata with its owning inode or directory. A reverse-mapping allocation tree will allow the filesystem to quickly identify the file to which any given block belongs.

[Dave Chinner] Needless to say, the current XFS on-disk format does not provide for the storage of all this extra data. That implies an on-disk format change. The plan, according to Dave, is to not provide any sort of forward or backward format compatibility; the format change will be a true flag day. This is being done to allow complete freedom in designing a new format that will serve XFS users for a long time. While the format is being changed to add the above-described reliability features, the developers will also add space for d_type in the directory structure, NFSv4 version counters, the inode creation time, and, probably, more. The maximum directory size, currently a mere 32GB, will also be increased.

All this will enable a lot of nice things: proactive detection of filesystem corruption, the location and replacement of disconnected blocks, and better online filesystem repair. That means, Dave said, that XFS will remain the best filesystem for large-data applications under Linux for a long time.

What are the implications of all this from a btrfs perspective? Btrfs, Dave said, is clearly not optimized for filesystems with metadata-heavy workloads; there are some serious scalability issues getting in the way. That is only to be expected for a filesystem at such an early stage of development. Some of these problems will take some time to overcome, and the possibility exists that some of them might not be solvable. On the other hand, the reliability features in btrfs are well developed and the filesystem is well placed to handle the storage capabilities expected in the coming few years.

Ext4, instead, suffers from architectural scalability issues. According to Dave's results, it is not the fastest filesystem anymore. There are few plans for reliability improvements, and its on-disk format is showing its age. Ext4 will struggle to support the storage demands of the near future.

Given that, Dave had a question of sorts to end his presentation with. Btrfs will, thanks to its features, soon replace ext4 as the default filesystem in many distributions. Meanwhile, ext4 is being outperformed by XFS on most workloads, including those where it was traditionally stronger. There are scalability problems that show up on even smaller server systems. It is "an aggregation of semi-finished projects" that do not always play well together; ext4, Dave said, is not as stable or well-tested as people think. So, he asked: why do we still need ext4?

One assumes that ext4 developers would have a robust answer to that question, but none were present in the room. So this seems like a discussion that will have to be continued in another setting; it should be interesting to watch.

[ Your editor would like to thank the linux.conf.au organizers for their assistance with his travel to the conference. ]

Index entries for this article
KernelFilesystems/XFS
Conferencelinux.conf.au/2012


(Log in to post comments)

XFS: the filesystem of the future?

Posted Jan 20, 2012 20:40 UTC (Fri) by rfunk (subscriber, #4054) [Link]

"Reliability is not just a matter of not losing data - hopefully XFS is already good at that"

Heh, having lost an entire filesystem (and zeroed-out files on another filesystem) to XFS a while back, I find that wording grimly amusing. In my experience it really was quite good at losing data.

But seriously, XFS has long had a reputation for being more likely to eat your data than its Linux competition, at least in non-server-room use cases. Have they really fixed that? I mean, yes "hopefully" they have, but it would be good to know beyond just hope before we start switching to it based on speed.

XFS: the filesystem of the future?

Posted Jan 20, 2012 20:55 UTC (Fri) by flashydave (guest, #29267) [Link]

The other issue that has bitten me numerous times is a kernel stack overflow when using xfs and NFS together which results in system lock up. On a given production system it occurs every 2 to 4 weeks - enough to be embarassing. There was a developers thread discussing this a while back but I have not heard of any fixes subsequently. This is the one issue that makes me cautious about using xfs in future. Can anyone bring me up to date?

XFS: the filesystem of the future?

Posted Jan 21, 2012 20:02 UTC (Sat) by dmcguicken (guest, #57851) [Link]

!

Do you have any other details on this? I have a sneaking suspicion this is EXACTLY what I was seeing on a little VIA box of mine that I blamed on a faulty I/O daughterboard.

I was using a 1/2 TB external USB drive with XFS for my MythTV recordings, shared over the LAN via NFSv4... and I saw hard lockups every few weeks, pretty reliably.

XFS: the filesystem of the future?

Posted Jan 23, 2012 10:21 UTC (Mon) by flashydave (guest, #29267) [Link]

XFS: the filesystem of the future?

Posted Jan 22, 2012 12:04 UTC (Sun) by dgc (subscriber, #6611) [Link]

Hi Flashydave,

If you don't report problems like this to the developers, then we don't know you are having them. The reason the stack overflow fix has not been pushed is that it might have unexpected performance issues due to moving allocation work into workqueues where they have a full stack to work with. Seeing that there has only be a couple of reports of stack overflows on the list in the past 2 years, it doesn't appear to be a widely occurring problem. Hence the urgency for the fix does not appear to be that great and so fixing it can wait until we fully understand the implications of the proposed fix.

IOWs, the frequency or likelihood of occurence of a problem greatly influences decisions on whether to push fixes right now or wait for more testing. So, don't assume that we know how much your systems are affected by the problem even thought we might discussing a possible fix - report them xfs@oss.sgi.com so we are guaranteed to know about them and can take that into account.

As it is, I have been testing the fix for some time now so I'm now pretty confident it doesn't cause any regressions. I'm definitely considering re-prosing it again for the next merge cycle now that I have a lot more testing done on it...

Dave.

XFS: the filesystem of the future?

Posted Jan 23, 2012 10:19 UTC (Mon) by flashydave (guest, #29267) [Link]

Sorry - I was under the impression from the threads I was reading it was a well documented issue and that it was not particularly helpful to simply say "me too".

XFS: the filesystem of the future?

Posted Jan 30, 2012 0:11 UTC (Mon) by sbergman27 (guest, #10767) [Link]

"""
I was under the impression from the threads I was reading it was a well documented issue and that it was not particularly helpful to simply say "me too".
"""

Nevertheless, it was negligent of you not to jump through all the required hoops to register your bug report. Just because a hundred people before you had already reported it does *not* mean that the devs would necessarily have noted the problem then, or admit to it existing now. Your data loss was your fault. Because you did not bother to file a bug.

FS developers toil day and night, and the only payment they get is fame, glory, and money. Any data loss is the fault of the user.

XFS: the filesystem of the future?

Posted Jan 30, 2012 10:59 UTC (Mon) by nix (subscriber, #2304) [Link]

FS developers toil day and night, and the only payment they get is fame, glory, and money.
What world do you live in, and how can I move there?

XFS: the filesystem of the future?

Posted Feb 6, 2012 3:16 UTC (Mon) by chloe_zen (subscriber, #8258) [Link]

I gave up on XFS when it started leaving me files of NULs after crashes. I understand they fixed that problem, but until iops are my limiting factor why should I leave ext4?

XFS: the filesystem of the future?

Posted Feb 7, 2012 16:26 UTC (Tue) by phoenix (guest, #73532) [Link]

Which version of the kernel, XFS, and NFS?

We use XFS and NFS on ~50 servers, serving 1-2 TB of RAID/LVM to 100-400 diskless stations each, without running into any kernel stack overflows or other lockups.

64-bit Debian 5.0.

If this is something new (since kernel 2.6.32), we'd like to know. :)

XFS: the filesystem of the future?

Posted Jan 20, 2012 22:14 UTC (Fri) by ricwheeler (subscriber, #4980) [Link]

I think that you need to put some analysis out there.

Did you have bad storage? Misconfigured write cache? Did you run xfs_repair? Did you (or a vendor) do any analysis of the errors?

In my experience, XFS has been quite reliable and is the most common file system used in many storage appliances.

XFS: the filesystem of the future?

Posted Jan 28, 2012 23:12 UTC (Sat) by sbergman27 (guest, #10767) [Link]

"Did you have bad storage? Misconfigured write cache?"

Yes. There must be some way to blame this on the user or his hardware.

What do you mean by "misconfigured write cache"? What special thing is it that you are supposed to do with the drive's write cache to keep XFS from eating your data?

XFS: the filesystem of the future?

Posted Jan 29, 2012 0:47 UTC (Sun) by dlang (guest, #313) [Link]

if you have a drive that is caching data on ram on the drive and then writing it to the drive later, you will loose data in a power failure.

many drives have been found to lie to the OS abut when the data is saved (they report it being saved when it's only in the cache, not when it's actually saved to the platter)

if you have a drive like this, then you will loose data, no matter what filesystem you use, and even if you use a high-end raid controller.

XFS: the filesystem of the future?

Posted Jan 29, 2012 1:36 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"if you have a drive like this, then you will loose data, no matter what filesystem you use, and even if you use a high-end raid controller."

Especially if the filesystem itself is already being cavalier with the data, by holding it in the page/buffer caches as long as it can, and playing russian roulette with "features" like delayed allocation. All in the name of good benchmark numbers.

Drive caches are a small factor in comparison. We didn't really even used to worry, or even thing about them. Now, they seem to be the preferred scapegoat for Linux filesystem devs when data loss occurs. I *know* how reliable things were before we had barriers and FUA. And I know what I'm seeing now. With all due respect, I'm just not buying this explanation.

XFS: the filesystem of the future?

Posted Jan 30, 2012 17:35 UTC (Mon) by Otus (subscriber, #67685) [Link]

> Drive caches are a small factor in comparison. We didn't really even used to worry, or even thing about them.

They used to be small. They've grown approximately at the same speed as HDD sizes, which is significantly faster than throughput, not to mention seek time.

A consumer HDD from c. 2000 might have a 2 MB cache and 40 MB/s throughput, so a full cache empties from sequential data in 50 ms best case. Current 2-3 TB drives have a 64 MB cache and 100-150 MB/s throughput, so a full cache takes around 500 ms minimum to empty.

For non-sequential data it's much worse.

XFS: the filesystem of the future?

Posted Feb 2, 2012 16:57 UTC (Thu) by jd (subscriber, #26381) [Link]

Cache would not be a problem if:

(a) it was battery-backed, and
(b) was write-through

Battery-backed doesn't have to mean the whole drive has to remain powered-up, it just has to mean the DRAM gets enough juice to keep refreshing until regular power is restored *if* there is any unwritten content in it. In other words, if everything is flushed to disk then you don't need to keep the drive's RAM powered. If drive manufacturers were *really* clever, then only those blocks of RAM with unflushed content would need to remain powered.

It's hard to get a frame of reference, as most devices with RAM and a modern LIon battery also have a power-hungry CPU and an even hungrier RF system to feed. Here, you only need to keep selective RAM chips powered, no processing is required. I have absolutely no idea what kind of leakage of charge good batteries suffer, but it is probably small. Just keeping DRAM alive doesn't take a vast amount of power. This solution should be adequate to handle even Katrina-length power outages. Beyond that, disk corruption is unlikely to be your major concern.

XFS: the filesystem of the future?

Posted Feb 5, 2012 19:45 UTC (Sun) by rilder (guest, #59804) [Link]

If you need data integrity (like database commits) you will need to enforce it from application with fsync etc or have a SSD which provides such guarantees.

Speaking of write cache, the directive is to disable it on disk if you have another battery backed write cache sitting behind, or speaking of laptops you can have them since they are battery backed, no one says to disable cache completely.

You can start reading about it here -- http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_w...

Speaking of caching in page cache, it is just to provide better I/O locality as mentioned in talk, it is flushed periodically, if it is flushed as and when required you will end up with seek nightmare with disk.

XFS: the filesystem of the future?

Posted Feb 6, 2012 2:52 UTC (Mon) by dlang (guest, #313) [Link]

> If you need data integrity (like database commits) you will need to enforce it from application with fsync etc or have a SSD which provides such guarantees.

having a SSD or battery backed cache does not replace doing fsyncs. If you don't do the fsync you don't know that the data is being written from the OS cache to the disk subsystem.

Shared pain

Posted Jan 20, 2012 22:16 UTC (Fri) by gwolf (subscriber, #14632) [Link]

I was also an XFS enthusiast a while back (~5 years). In my experience, XFS does not lose data under normal use, but a hardware crash (power outage, etc.) leaves many zeroed files. What I read back then is that the filesystem was sure to be coherent (the structure would never be compromised), but the data itself was left for later, whenever there was time to flush the caches.

Shared pain

Posted Jan 20, 2012 22:59 UTC (Fri) by zlynx (guest, #2285) [Link]

EXT4 acts the same way, creating files with zeros when run in writeback journaling mode. I run mine this way, although I do make sure that auto_da_alloc is turned on so that data is flushed when doing file replacement via rename.

I'd much rather have the performance.

Shared pain

Posted Jan 21, 2012 12:45 UTC (Sat) by dany (guest, #18902) [Link]

> I'd much rather have the performance.

Its ok that you would, but would also your employer/customers like better performance over reliability? There is reason, that default ext4 mount mode in RHEL is ordered.

Shared pain

Posted Jan 21, 2012 17:15 UTC (Sat) by ricwheeler (subscriber, #4980) [Link]

You don't need to give up reliability for performance in either ext4 or xfs.

Eric and I are both with the Red Hat file system team (as is Dave Chinner) and we would not be supporting XFS if it was not solid and reliable as well as high performance.

What you do need, as Eric mentioned, is to keep your box properly configured and to have applications that do the right things to persist data.

Jeff Moyer (also a Red Hat file system person) wrote up a nice article for LWN a few months back on best practices for data integrity.

Shared pain - article link

Posted Jan 22, 2012 21:01 UTC (Sun) by ndye (guest, #9947) [Link]

Jeff Moyer (also a Red Hat file system person) wrote up a nice article for LWN a few months back on best practices for data integrity.

This article, I presume?

Shared pain

Posted Jan 28, 2012 23:16 UTC (Sat) by sbergman27 (guest, #10767) [Link]

So you can either use Ext4 mounted with the nodellalloc option, or Ext3 mounted data=ordered, and sleep well at night. Or... you can use XFS and commission a code audit for every piece of important software you run, specifically checking for proper fsync usage. Cross your fingers, hoping the auditors didn't miss anything, and try to sleep well at night.

Shared pain

Posted Jan 29, 2012 0:44 UTC (Sun) by dlang (guest, #313) [Link]

you may sleep well at night, but it will be the sleep of someone who has been fooled about the reliability of their data.

skipping fsync is not safe on any filesystem that's not mounted -sync

this is true for every OS

Shared pain

Posted Jan 29, 2012 1:53 UTC (Sun) by sbergman27 (guest, #10767) [Link]

And again, you are glossing over the matter of the relative likelihood of data loss with various filesystems. It suits your purposes to turn it into a black and white issue. "Ext3 can lose your data too!", you cry.

Well, Both my mattress and J.P. Morgan *could* lose my money. So putting my money either place represents equal risk. If I understand you correctly, you are saying that Ext3 mounted data=ordered, Ext4 mounted with the defaults, and XFS mounted with the defaults, all represent equal risk to our data because any one of them *could* conceivably lose our data.

Again, I'm not buying it.

Shared pain

Posted Jan 30, 2012 21:49 UTC (Mon) by dlang (guest, #313) [Link]

replying to multiple comments in one reply

I am not saying that the risk is equal, I am disputing the statement that ext3 is rock solid and won't loose your data without you needing to do anything.

Ext3 is one of the worst possible filesystems to use if you really care about your data not getting lost (and therefor implement the fsync dance to make sure you don't loose data), because it's fsync performance is so horrid.

The applications are not keeping the data in buffers 'as long as they can', they are keeping the data in buffers for long enough to be able to optimize disk activity.

The first is foolishly risking data for no benefit, the second is taking a risk for a direct benefit. These are very different things. Ext3 also keeps data in buffers and delays writing it out in the name of performance. Every filesystem available on every OS does so by default, the may differ in how long they will buffer the data, and what order they write things out, but they all buffer the data unless you explicitly mount the filesystem with the sync option to tell it not to.

You say that people will always pick reliability over performance, but time and time again this is shown to not be the case. As pointed out by another poster, MySQL grew almost entirely on it's "performance over reliability" approach, only after it was huge did they start pushing reliability. The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

This extends beyond the computing field. There are no fields where the objective is to eliminate every risk, no matter what the cost. It is always a matter of balancing the risk with the costs of eliminating them, or with the benefits of accepting the risk.

That's the problem...

Posted Jan 30, 2012 22:54 UTC (Mon) by khim (subscriber, #9252) [Link]

In theory, theory and practice are the same. In practice, they are not.

Ext3 is one of the worst possible filesystems to use if you really care about your data not getting lost (and therefor implement the fsync dance to make sure you don't loose data), because it's fsync performance is so horrid.

Right, but that's the problem: most developers don't care about these things (most just don't think about the problem at all, others just hope it all will work... somehow). Most users do. Thus we have a strange fact: in theory ext3 is the worst possible FS from "lost data" POV, in practice it's one of the best.

That's the problem...

Posted Jan 30, 2012 23:16 UTC (Mon) by dlang (guest, #313) [Link]

trust me, users trying to run high performance software that implements data safety (databases, mail servers, etc) care about this problem as well.

For other developers, the fact that fsync performance is so horrible on the default filesystem for many distros has trained a generation of programmers to NOT use fsync (because it kills performance in ways that users complain about)

That's the problem...

Posted Feb 2, 2012 3:45 UTC (Thu) by tconnors (guest, #60528) [Link]

> For other developers, the fact that fsync performance is so horrible on the default filesystem for many distros has trained a generation of programmers to NOT use fsync (because it kills performance in ways that users complain about)

Then there's the fact that fsync will spin up your disks if you were trying to keep them spun down (to the point where on laptops, I try to use 30 minute journal commit times, and manually invoke sync when I absolutely want something committed). I don't want or need an absolute gaurantee that the new file has hit the disk consistent with metadata. I want an absolute guarantee that /either/ the new file or the old file is there consistent with the relevant metadata. ext3 did this. It's damn obvious what rename() means - there should be no need for every developer to go through all code in existance and change semantics of code that used to work well *in practice*. XFS loses files everytime power fails *in practice*. If I need to compare to backup *everytime* power fails, then I might as well be writing all my data to volatile RAM and do away with spinning rust all together, because that's all that XFS is good for.

Another pathological (but instructive) case...

Posted Jan 30, 2012 23:03 UTC (Mon) by khim (subscriber, #9252) [Link]

Similar story happens with USB sticks: most users believe FAT (on Windows) is super-safe, NTFS (on Windows) is horrible - and Linux is awful no matter what. Why? Delayed write defaults. FAT on Windows is tuned to flush everything on ANY close(2) call. NTFS works awfully slow in this mode thus it uses more aggressive caching. And on Linux caching is always on.

And users just snatch USB stick the very millisecond program window is closed (well... most do... the cautious ones wait one or two seconds). They feel it's their unalienable right. In these circumstances suddenly the oldest and the most awful filesystem of them all becomes the clear winner!

Exactly because "I care about my data" does not automatically imply "I'll do what I'm told to do to keep it".

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 4, 2012 13:08 UTC (Sat) by Wol (subscriber, #4433) [Link]

Don't get me started ... :-)

I work with Pick (a noSQL db), and a LOT of the reliability problems that ACID is meant to fix, just *can't* *happen* in Pick.

Well, they can if the database was badly designed, but you can get similar pain in relational databases too...

Relational is a lovely mathematical design - I would use it to design a database without a second thought - but I would then convert that design to NF2 (non-first-normal-form) and implement it in Pick. Because 90% of ACID's benefits would then be redundant, and the database would be so much faster too.

You've heard my war-story of a Pentium90/Pick combo outperforming an Oracle/twinXeon800, I'm sure ...

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 4, 2012 13:27 UTC (Sat) by gioele (subscriber, #61675) [Link]

> I work with Pick (a noSQL db), and a LOT of the reliability problems that ACID is meant to fix, just *can't* *happen* in Pick.

This is getting off-topic, but could you explain which kind of reliability problems meant to be fixed by ACID DBs are cannot happen in Pick and why?

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 7, 2012 23:20 UTC (Tue) by Wol (subscriber, #4433) [Link]

Okay, let's do a data analysis. In Pick, you would do an EAR.

Look for what I call "real world primary keys" - an invoice has a number, a person has a name. Now we've got a primary key, we work out all the attributes that belong to that key. We can now do a relational analysis on those attributes. (forget that real-world primary keys aren't always unique and you might have to create a GUID etc.)

With an invoice, in relational you'll end up with a bunch of rows spread across several tables for each invoice. IN PRACTICE, ACID is mostly used to make sure all those rows pass successfully through the database from application to disk and back again.

In Pick, however, you then coalesce all those tables (2-dimensional arrays) together into one n-dimensional Pick FILE. And you coalesce all those rows together into one Pick RECORD. With the result that there is no need for the database to make sure the transaction is atomic. All the data is held as a single atom in the application, and is passed through the database to and from disk as a single atom.

That's also why Pick smokes relational for speed - access any attribute of an object, and all attributes get pulled into cache. Try doing that for a complex object with relational !!! :-) (Plus Pick is self-optimising, and when optimised it takes, on average, just over one disk seek per primary key to find the data it's looking for on disk!)

The problem I see with relational is it is MATHS and, to reference Einstein, therefore has no basis in reality. Pick is based on solid engineering, and when "helped" with relational theory really flies. Relational practice actually FORBIDS a lot of powerful optimising techniques.

And if designed properly, a Pick database is normalised therefore it can look like a relational database, only superfast. I always compare Pick and relational to C and Pascal. Pick and C give you all the rope you need to seriously shoot yourself in the foot. Relational and Pascal have so many safety catches, it's damn hard to actually do any real work.

(And because foreign keys are attributes of a primary key, you can also trap errors in the application. For example, the client's key is a mandatory element of an invoice so it belongs in the invoice. Okay, it's the app's job to make sure the record isn't filed without a client, whereas in relational you can leave it to the DB, but in Pick it's easy enough to add a business layer between the app and the DB that does this sort of thing.)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 7, 2012 23:26 UTC (Tue) by Wol (subscriber, #4433) [Link]

Just to add, look at it this way ...

In relational, attributes that are tied together can be spread (indeed, for a complex object MUST be spread) across multiple tables and rows. Column X in table Y is meaningless without column A in table B.

In Pick, that data would share the same primary key, and would be stored in one FILE, in one RECORD. Delete the primary key and both cells vanish together. Create the primary key, and both cells appear waiting to be filled.

As far as Pick is concerned, a RECORD is a single atom to be passed through from disk to app and vice versa. From what I can make out, in relational you can't even guarantee a row is a single atom!

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 0:45 UTC (Wed) by dlang (guest, #313) [Link]

this has nothing to do with the ACID guarantees. the ACID guarantees have to do when you start modifying the datastore, specifically what happens if the modification doesn't complete (including that the system crashes in the middle of an update)

ACID is the sequence of modifying the file on disk so that you have either the new data or the old data at all times, and if the application says that the transaction is done, there's no way for it to disappear.

what you are talking about with Pick is a way to bundle related things together so that it's easier to be consistent. That doesn't mean that writing your record will happen in an atomic manner on the filesystem (if the record is large enough, this won't be an atomic action)

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 14:52 UTC (Wed) by Wol (subscriber, #4433) [Link]

So ACID is there to cope with the fact that, what the DB sees as a single transaction, the operating system and filestore doesn't, so it sits between the database and OS, and makes sure that the multiple success/fails returned by the OS are returned to the database as a single success or fail.

AND THAT IS MY POINT. In Pick, 90% of the time, this is unnecessary and wasteful, because what is a single transaction as seen by the DB is also a single transaction as seen by the OS and disk subsystem!

I agree with you, if the record is too big, it won't go onto the file store in one piece, but the point is that with a relational DB you can pretty much guarantee that, in practice, a transaction will never go onto the filestore in one piece. So ACID is needed. But in Pick it's unusual for it NOT to go on the filesystem in one piece. So most of the time ACID is an unnecessary complexity.

I'm ignoring file system failures like XFS/ext4 zeroing out your table in a crash :-) because I don't see how ACID can protect against the OS trashing your data :-)

What you need to do is realise that ACID sits between the database and disk. As you say, it guarantees that the database in memory is (a) consistent, and (b) accurately represented on disk. And because, *in* *the* *real* *world*, pretty much any change in a relational database requires multiple changes in multiple places on disk, ACID is a necessity.

But in the real world, most changes in a Pick database only involve a *single* change in a *single* place on disk to ensure consistency. So a "write successful" from the OS is all that's needed to provide a "good enough" implementation of ACID. (And if the OS lies to your ACID layer, you're sunk even if you've got ACID. See all the other posts in this thread about disks lying to the OS!)

(This has other side effects. Yes, a Pick database can get into an inconsistent state. But that inconsistent state MIRRORS REALITY. A Pick database can lose the connection between a person and his house. Or a car and its owner. But in reality a person can lose their house. A car can lose its owner. It's far too easy to assume in a relational database that everyone has a home, and next thing you can't put some poor vagrant in your database when you need to ...)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 22:17 UTC (Wed) by dlang (guest, #313) [Link]

you don't seem to understand that writes to the filesystem are not atomic in just about every case, let alone dealing with the rest of ACID

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 0:17 UTC (Thu) by Wol (subscriber, #4433) [Link]

Writes to the file system where? At the db/OS interface? At the OS/disk interface?

Because if it's at the OS/disk interface, what the heck is ACID doing in the database? It can't provide ANY guarantees, because it's too remote from the action.

And if it's at the db/OS interface, well as far as Pick is concerned, most transactions are near-enough atomic that the overhead isn't worth the cost (that was my comment about "90% of the time").

Your relational bias is clouding your thinking (although Pick might be clouding mine :-) But just because relational cannot do atomic transactions to disk doesn't mean Pick can't. As far as Pick is concerned, that transaction is atomic right up to the point that the OS code actually puts the data onto the disk. And if the OS screws that up, ACID isn't going to save you ...

Think of a "begin transaction" / "end transaction" pair. It's almost impossible for that transaction to truly be atomic in a relational database - you will invariably need to update multiple rows. In Pick, it's more than possible for that transaction to be truly atomic at the point where the db hands it over to the OS. ACID enforces atomicity between the OS and the db. Pick doesn't need it.

What guarantees does ACID provide over and above data consistency? Because a well-designed Pick app guarantees "if it's there it's consistent". And if the OS screws up and corrupts it, neither Pick nor ACID will save you.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 0:54 UTC (Thu) by dlang (guest, #313) [Link]

ACID has nothing to do with relational algebra

ACID is a feature that SQL databases have had, but you don't need to abandon SQL to abandon ACID and you don't need to have SQL to have ACID

Berkeley DB is ACID, but not SQL, MySQL was SQL but not ACID with the default table types for many years.

ACID involves the database application doing a lot of stuff to provide the ACID guarantees to users by using the features of the OS and hardware. If the OS/hardware lies to the database application about when something is actually completed then the database cannot provide ACID guarantees.

It appears that you have an odd interpretation about what ACID means, so reviewing

Atomicity

A transaction is either completely implemented or not implemented at all. For changes to a single record this is relatively easy to do, but if a transaction involves changing multiple records (subtract $10 from account A and add $10 to account B) it's not as simple as atomically writing one record. Remember that even a single write() call in C is not guaranteed to be atomic (it's not even guaranteed to succeed fully, you may be able to write part of it and not other parts)

Consistency

this says that at any point in time the database will be consistent, by whatever rules the database chooses to enforce. Berkeley DB has very trivial consistency checks, the records must all be complete. Many SQL databases have far more complex consistency requirements (foreign keys, triggers, etc)

Isolation

This says that one transaction can affect another transaction happening at the same time

Durability

This says that once a transaction is reported to succeed then nothing, including a system crash at that instant (but excluding something writing over the file on disk) will cause the transaction to be lost

What you are describing about Pick makes me thing that it has very loose consistency and isolation requirements, but to get Atomicity and Durability the database needs to be very careful about how it writes changes.

It cannot overwrite an existing record (because the write may not complete), and it must issue appropriate system calls (fsync and similar) to the OS, and watch for the appropriate results, to know when the data has actually been written to disk and will not change.

It's getting this last part done that really differentiates similar database engines from each other. There are many approaches to doing this and they all have their performance trade-offs. If you are willing to risk your data by relaxing these requirements a database becomes trivial to implement and is faster by several orders of magnitude.

note how the only SQL concept that is involved here is the concept of a transaction in changing the data.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:22 UTC (Thu) by Wol (subscriber, #4433) [Link]

Yup. I am being far looser in my requirements for ACID for Pick, but the reason is that Pick is far more ACID by accident than relational.

Atomic: as I said, a relational transaction in relational will pretty much inevitably be split across multiple, often many, tables. In Pick, all dependant attributes (excluding foreign-key links) will be updated as a single transaction right down to the file-system layer. So, as an example, if I have separate FILEs for people and buildings, it's possible I'll corrupt "where someone lives" if I update the person and fail to create the building, but I won't have inconsistent person or building data.

Consistency: IF designed properly, a Pick database should be consistent within entities. All data associated with an individual "real world primary key". Relations between entities could get corrupted, but that *should* be solved with good programming practice - in my example above, "lives at" is an attribute of person, so you update building then person.

Isolation: I don't quite understand that, so I won't comment.

Durability: Well, when I tried to write a Pick engine, my first reaction to actually writing FILEs to disk was "copy on write seems pretty easy...". And there comes a point where you have to take the OS on trust.

So I think my premise still stands - a LOT of the requirement for ACID is actually *caused* by the rigorous separation demanded by relational between the application and the database. By allowing the application to know about (and work with) the underlying database structure you can get all the advantages of relational's rigorous analysis, all the advantages of a strong ACID setup, and all the advantages of noSQL's speed. But it depends on having decent programmers (cue my previous comment about Pick and C giving you all the rope you need ...)

And one of the reasons I wanted to write that Pick db engine was so I could put in - as *optional* components - loads of stuff that enforced relational constraints to try and reign in the less-competent programmers! I want a Modula-2 sort of Pick, that by default protects you from yourself, but where the protections can be turned off.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:36 UTC (Thu) by dlang (guest, #313) [Link]

atomic, your scheme won't work if you need to make changes to two records (the ever popular "subtract $10 from account A, add $10 to account B" example)

consistency, what if part of your updates get to disk and other parts don't? what if the OS (or drive) re-orders your updates so that the write to the record for person happens before the write to building?

As far as durability goes, if you don't tell the OS to flush it's buffers (which is what fsync does), then in a crash you have no idea what may have made it to disk and what didn't.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 10, 2012 16:17 UTC (Fri) by Wol (subscriber, #4433) [Link]

The ever popular "subtract $10, add $10" ...

Well, if you define the transaction as an entity, then it gets written to its own FILE. If the system crashes then you get a discrepancy that will show up in an audit. It makes sense to define it as an entity - it has its own "primary key" ie "time X at teller Y". Okay, you'll argue that I have to run an integrity check after a crash (true) while you don't, but I can probably integrity-check the entire database in the time it takes you to scan one big table :-)

Consistency? Journalling a transaction? Easily done.

And yes, your point about flushing buffers is good, but that really should be the OS's problem, not the app (database) sitting on top. Yes I know, I used the word *should* ...

Look at it from an economic standpoint :-) If my database (on equivalent hardware) is ten times faster than yours, and I can run an integrity check after a crash without impinging on my users, and I can guarantee to repair my database in hours, which is the economic choice?

Marketing 101 - proudly announce your weaknesses as a strength. The chances of a crash occuring at the "wrong moment" and corrupting your database are much higher with SQL, because any given task will typically require between 10s and 100s more transactions between the db and OS than Pick. So SQL needs ACID. With Pick, the chances of a crash happening at the wrong moment and corrupting data are much, much lower. So expensive strong ACID actually has a prohibitive cost. Especially if you can get 90% of the benefits for 10% of the effort.

I'm not saying ACID isn't a good thing. It's just that the cost/benefit equation for Pick says strong ACID isn't worth it - because the benefits are just SO much less. (Like query optimisers. Pick doesn't have an optimiser because it's pretty much a dead cert the optimser will save less than it costs!)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 10, 2012 18:43 UTC (Fri) by dlang (guest, #313) [Link]

so that means that you don't have any value anywhere in your database that says "this is the amount of money in account A", instead you have to search all transactions by all tellers to find out how much money is in account A

that doesn't sound like a performance win to me.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 11, 2012 2:30 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, git works exactly the same way. Is it fast enough for you?

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 11, 2012 5:48 UTC (Sat) by dlang (guest, #313) [Link]

what gives you reasonable performance for a version control system with a few updates per minute is nowhere close to being reasonable for something that measures it's transaction rate in thousands per second.

besides, git tends to keep the most recent version of a file uncompressed, it's only when the files are combined into packs that things need to be reconstructed, and even there git only lets the chains get so long.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 11, 2012 13:44 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

git/svn/... use store intermediate versions of the source code, so that applying all patches becomes O(log N) instead of O(N). But that's just an optimization.

NoSQL systems work in a similar way - they can store the 'tip' of the data, so that they don't have to reapply all the patches all the time. However, the latest data view can be rebuilt if required.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 12, 2012 15:57 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually, even the most recent stuff is compressed. It just might not be deltified in terms of other blobs (which is what you meant, I know).

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 12, 2012 18:29 UTC (Sun) by dlang (guest, #313) [Link]

yes, everything stored in git is compressed, but it only gets deltafied when it gets packed.

and it's frequently faster to read a compressed file and uncompress it than it is to read the uncompressed equivalent (especially for highly compressible text like code or logs), I've done benchmarks on this within the last year or so

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 12, 2012 13:38 UTC (Sun) by Wol (subscriber, #4433) [Link]

Okay, it would need a little bit of coding, but I'd do the following ...

Each month, when you run end-of-month statements, you save that info. When you upate an account you keep a running total.

If the system crashes you then do "set corruptaccout = true where last-month plus transactions-this-month does not equal running balance". At which point you can do a brute force integrity check on those accunts.

(If I've got a 3rd state of that flag, undefined, I can even bring my database back on line immediately I've run a "set corruptaccount to undefined" command!)

And in Pick, that query will FLY! If I've got a massive terabyte database that's crashed, it's quite likely going to take a couple of hours to reboot the OS (I just rebooted our server at work - 15-20 mins to come up including disk checks etc). What's another hour running an integrity check on the data? And I can bring my database back on line immediately that query (and others like it) have completed. Tough luck on the customer who's account has been locked ... but 99% of my customers can have normal service resume quickly.

Thing is, I now *know* after a crash that my data is safe, I'm not trusting the database company and the hardware. And if my system is so much faster than yours, once the system is back I can clear the backlog faster than you can. Plus, even if ACID saves your data, I've got so much less data in flight and at risk.

But this seems to be mirroring the other debate :-) the moan about "fsync and rename" was that fsync was guaranteeing (at major cost) far more than necessary. The programmer wanted consistency, but the only way he could get it was to use fsync, which charged a high price for durability. If I really need ACID I can use BEGIN/END TRANSACTION in Pick. But 99% of the time I don't need it, and can get 90% of its benefits with 10% of its cost, just by being careful about how I program. At the end of the day, Pick gives me moderate ACID pretty much by default. Why should I have to pay the (high) price for strong ACID when 90% of the time, it is of no benefit whatsoever? (And how many SQL programmers actually use BEGIN/END TRANSACTION, even when they should?)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 14:08 UTC (Wed) by nix (subscriber, #2304) [Link]

From what I can make out, in relational you can't even guarantee a row is a single atom!
Well, the relational algebra does not discuss storage at all, and does not stipulate where relations might reside on permanent storage (nor *which* might: you could perfectly well store join results permanently for all it cares).

But in practice, in SQL... just try INSERTing half a row. You can't. Atomicity at the row level is guaranteed. I hate SQL, but at least it does this right.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 15:00 UTC (Wed) by Wol (subscriber, #4433) [Link]

Well, the relational algebra does not discuss storage at all, and does not stipulate where relations might reside on permanent storage

Which is exactly my beef with relational databases. C&D FORBID you from telling the database where relations should be stored for efficiency. But in REALITY it is highly probable that, if you access one attributed associated with my primary key, that you will want to access others. But it's a complete gamble retrieving the same attribute associated with other primary keys. Because Pick guarantees (by accident, admittedly) that all attributes are stored in the same atom as the primary key they describe, all those attributes you are statistically most likely to want are exactly the attributes that coincidentally get retrieved together.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 15:05 UTC (Wed) by Wol (subscriber, #4433) [Link]

Atomicity at the row level IN THE DATABASE is guaranteed, yes.

What I meant was it's not guaranteed at the physical level in the datastore. Two cells in the same row could be stored in completely different "buckets" in the database, for example the data is stored in an index with a pointer from the row. I know that probably doesn't happen but if the guy who designed the database engine thinks it's more efficient there's nothing stopping him.

So even if you the database programmer *think* an operation should be atomic right down to the disk, there's no guarantee.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 21:53 UTC (Wed) by nix (subscriber, #2304) [Link]

It happens quite a lot, increasingly often now that databases are lifting the horrible restrictions many of them had on the total amount of data stored per row (Oracle and MySQL had limits low enough that you could hit them in real systems quite easily).

If it matters that data is written to the disk atomically, you have already lost, because *nothing* is written to the disk atomically, not least because you invariably have to update metadata, and secondly because no disk will guarantee what happens to partial writes in the case of power failure. So, as long as you have to keep a journal or a writeahead log to deal with that, why not allow arbitrarily large amounts of data to appear to be written atomically? Hence, transactions.

It is true that programs that truly use transactions are relatively rare: in one of my least proud moments I accidentally changed the rollback operation in one fairly major financial system to do a commit and it was a year before anyone noticed. However, when you *do* have code that uses transactions, the effect on the code complexity and volume can be dramatic. As a completely random example, I've written backtracking searchers that relied on rollback in about 200 lines before, because I could rely on the database's transaction system to do nearly all the work for me.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:30 UTC (Thu) by Wol (subscriber, #4433) [Link]

It happens quite a lot, increasingly often now that databases are lifting the horrible restrictions many of them had on the total amount of data stored per row (Oracle and MySQL had limits low enough that you could hit them in real systems quite easily).

Sorry, I have to laugh here. It's taken Pick quite a while to get rid of the 32K limit, but that limit does date from the age of the dinosaur when computers typically came with 4K of core ...

And no limit on the size of individual items, or the number of items in a FILE.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:38 UTC (Thu) by dlang (guest, #313) [Link]

if a single item is larger than the track size of a drive, it is physically impossible for the write to be atomic. You don't need to get this large to run in to problems though, any write larger than a block runs the possibility of being split across different tracks (or in a RAID setup, across different drives). If you don't tell the filesystem that you care about this, the filesystem will write these blocks in whatever order is most efficient for it.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 10, 2012 16:25 UTC (Fri) by Wol (subscriber, #4433) [Link]

:-)

Look at the comment you're replying to :-) In early Pick systems I believe it was possible for a single item to be larger than available memory ...

Okay, it laid the original systems wide open to serious problems if something went wrong, but as far as users were concerned Pick systems didn't have disk. It was just "permanent memory". And Pick was designed to "store all its data in ram and treat the disk as a huge virtual memory". I believe they usually got round any problem by flushing changes from core to disk as fast as possible, so in a crash they could just restore state from disk.

Cheers,
Wol

Shared pain

Posted Feb 2, 2012 1:06 UTC (Thu) by darrint (guest, #673) [Link]

"Well, Both my mattress and J.P. Morgan *could* lose my money. So putting my money either place represents equal risk."

I'm not sure how that metaphor is supposed to work. A few years ago when the U.S. government passed new credit reforms they were time delayed by several months at the request of the targeted banks, supposedly to give them time to carefully update their big and very important computer systems. The reality was the banks used those several months to burn and pillage the assets of people like me. I was in debt, my own stupidty of course, and I probably lost over a thousand dollars in fees alone due to a few financial institutions playing shenanigans with the ragged edges of my account terms.

At the time I thought wistfully of how much more secure my money would be if I could just collect my pay in cash and drive it to my house.

Shared pain

Posted Feb 2, 2012 14:08 UTC (Thu) by rwmj (subscriber, #5474) [Link]

You were in debt ...

Does your house loan you money? What does this negative money look like that you keep under your mattress?

Shared pain

Posted Jan 29, 2012 2:02 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"I'd much rather have the performance."

Why? And have you actually measured the "performance" that you are sacrificing reliability for? I'd certainly want to quantify what I was getting in return for the reduced reliability. In testing I've done, I couldn't tell the difference between data=ordered and data=writeback with ext4. Or for that matter, between default and nodelalloc.

Even for a single user desktop, I just don't see that the trade-off is a win. YMMV, I suppose. But I would encourage you to run some objective tests.

Shared pain

Posted Feb 2, 2012 2:44 UTC (Thu) by tconnors (guest, #60528) [Link]

> I'd much rather have the performance.

Really? Why bother writing your data to disk at all then? RAM is *super* fast! Me, when I call rename(), I damn well expect either the previous file or the current version has hit the platter and is consistent with metadata. ext3 does this, and ext4 does this with some tweaks. rename() has been implied by programmers for generations to mean an atomic barrier.

XFS has never done this. That is why I don't use XFS. Because they have a damn stubborn following that insists that the perfectly reasonable semantic of close();rename(); is "wrong wrong wrongity wrong, burn you evil data hater!"

Shared pain

Posted Feb 2, 2012 22:13 UTC (Thu) by dlang (guest, #313) [Link]

no, rename is an atomic barrier from the point of view of your software on the running machine.

however if the filesystem does not get unmounted cleanly, all guarantees are off. This has always been the case in Unix.

Shared pain

Posted Feb 2, 2012 23:27 UTC (Thu) by khim (subscriber, #9252) [Link]

Unix is dead, sorry. On Linux you have filesystems which was was unreliable against crashes at all (ext, ext2, etc) or which guarantee atomicity across reboots (ext3, ext4, btrfs). Oh, and there are XFS, too - looks like it's developers finally understood that filesystems exist to support applications, not the other way around (even if XFS fanbois didn't)

P.S. Yes, ext4 and btrfs also had the problem under discussion. But they were quickly fixed.

Shared pain

Posted Feb 2, 2012 22:50 UTC (Thu) by dgc (subscriber, #6611) [Link]

> rename() has been implied by programmers for generations to mean an
> atomic barrier

Not true. rename is atomic, but it is not a barrier and never has implied that one exists. rename() has been around for 3 times longer than ext3, so I don't really see how ext3 behaviour can possibly be what generations of programmers expect to see....

Indeed, ext3 has unique rename behaviour as a side effect of data=ordered mode - it flushes the data before flushing the metadata, and so appears to give rename "barrier" semantics. It's the exception, not the rule.

> XFS has never done this. That is why I don't use XFS.

Using data=writeback mode on ext3 makes it behave just like XFS. So ext3 is just as bad as XFS - you shouldn't use ext3 either! :P

> they have a damn stubborn following that insists that the perfectly
> reasonable semantic of close();rename(); is "wrong wrong wrongity wrong,
> burn you evil data hater!"

That's a bit harsh.

There's many good reasons for not doing this - lots of applications don't need or want barrier semantics to rename, or are cross platform and can't rely on implementation specific behaviours for data safety. e.g. rsync() is a heavy user of rename, but adding barrier semantics to the way it uses rename would slow it down substantially. Further, rsync doesn't need barrier semantics to guarantee that data has been copied and safely overwritten - it's written to be safe with current rename behaviour because it is both operating system and filesystem independent.

There have also been good arguments put forward for making this change, such as from Val Aurora (who I also quoted in my talk):

http://lwn.net/Articles/351422/

However, no-one has ever followed up on such discussions with patches to the VFS to make this a standard behaviour that you can rely on all linux filesystems to support. I'm certainly not opposed to such changes if the consensus is that this is what we should be doing - I might argue to maintain the status quo (e.g. because rsync performance is extremely important for backups on large filesystems) but that doesn't mean I don't see or understand the benefits of such a change.

Indeed, adding a new rename syscall with the desired semantics rather than changing the existing one is a compromise everyone would agree with. Perhaps you could do write patches to propose this seeing as you seem to care about such things?

Dave.

Shared pain

Posted Feb 2, 2012 23:41 UTC (Thu) by khim (subscriber, #9252) [Link]

rename is atomic, but it is not a barrier and never has implied that one exists. rename() has been around for 3 times longer than ext3, so I don't really see how ext3 behaviour can possibly be what generations of programmers expect to see....

Easy: most currently active programmers never seen a Unix with journalling FS and without the ability to safely use rename across reboots. Actually they very much insist on such ability - and looks like XFS developers try to provide the capability. But it's not clear if you can trust them: clearly they value POSIX compatibility and benchmarks more then needs of real users (who need working applications, after all, filesystem needs are just minor implementation detail for them).

Indeed, ext3 has unique rename behaviour as a side effect of data=ordered mode - it flushes the data before flushing the metadata, and so appears to give rename "barrier" semantics. It's the exception, not the rule.

When "exception" happens in 90% cases it becomes a new rule - it's as simple as that.

However, no-one has ever followed up on such discussions with patches to the VFS to make this a standard behaviour that you can rely on all linux filesystems to support.

That's because we already have a solution: don't use XFS and you are golden. OSes exist to support application - as you've succinctly shown above with rsync example. The only problem: I'm not all that concerned with rsync speed. I need mundane things: fast compilation (solved with gobs of RAM and SSD, filesystem is minor issue after that), reliable work with bunch of desktop applications (which don't issue fsync(2) before rename(2), obviously). Since I already have a solution I don't see why I should push the patches. If you want to advocate xfs - then you must fix it's problems, I'm happy with ext3/ext4 (which may contain bugs but which at least don't try to play "all your apps are broken, you should just fix them" card).

Shared pain

Posted Feb 3, 2012 4:42 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> rename() has been around for 3 times longer than ext3, so I don't really see how ext3 behaviour can possibly be what generations of programmers expect to see

I'm going to go out on a limb and say that there are more people who are familiar with expected ext3 behavior than the entire number of people who have run UNIX so I do think that ext3-like behavior is what programmers in general expect these days.

Shared pain

Posted Feb 3, 2012 5:01 UTC (Fri) by neilbrown (subscriber, #359) [Link]

This doesn't change the fact that the ext3 behaviour is a mistake, was not designed, was never universal and so should not be seen as a desirable standard.

Yes, there is room for improvement - there always is. Copying a mistake because it has some good features is not a wise move.

As Dave said - if there is a problem, let's fix it properly.

(and yes, my beard is gray (or heading that way)).

Shared pain

Posted Feb 3, 2012 5:16 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Even though the behavior was created by accident doesn't mean that it's not a good idea and that it's become a de-facto standard expectation. Why else would there have been so much noise with ext4?

Shared pain

Posted Feb 3, 2012 5:25 UTC (Fri) by dlang (guest, #313) [Link]

so you are saying that because ext3 gives better behavior for people who don't code carefully, it's behavior is the gold standard, even though there is still room for data loss and the same ext3 mistake that gave you the better reliability if you are careless also gives you horrid performance if you try to be careful and make sure your data is really safe.

if you could get the advantages without the drawbacks, of course it would be nice, but the same flaw in the ext3 logic that gives you one also gives you the other.

Shared pain

Posted Feb 3, 2012 5:49 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> ext3 gives better behavior for people who don't code carefully, it's behavior is the gold standard

Its not even about coding carefully, doing the "correct" thing is not even possible in many of the use cases which are protected by the default ext3 behavior such as atomically updating a file from a program which is not in C such as a shell script. I learned, along with many admins, to use the atomic rename behavior to implement "safe" updates which may have been a misunderstanding at the time but can now be considered the new requirement.

At the time this issue was discovered with ext4 there was a frank exchange of ideas and the realization that the expected rename behavior is beneficial to overall reliability and we should make it work properly. I'd be interested in seeing this kind of thing handled at the VFS layer so that the behavior is consistant across all filesystems, that sounds like a great idea.

Shared pain

Posted Feb 6, 2012 23:33 UTC (Mon) by dlang (guest, #313) [Link]

the rename behavior was only 'usually safe' without fsyncs (like from scripts), and you could always have a script call 'sync' (a sledgehammer to swat a fly yes, but in the case of ext3, it generated the same disk I/O that a fsync on an individual file would)

yes the we can look at changing the standard, but the way to do that is to talk about changing the standard, not insist that the behavior of one filesystem is the only 'correct' way to do things and that all filesystem developers don't care about your data.

Shared pain

Posted Feb 7, 2012 23:47 UTC (Tue) by Wol (subscriber, #4433) [Link]

I think this argument has long been hashed out, but the point is the unwanted behaviour is pathological.

And IT IS LOGICALLY IMPOSSIBLE if the computer actually does what the programmer asked it to. THAT is the problem - the computer ends up in an "impossible" state.

And if it is logically impossible to end up there, at least in the programmer's mind, it is also logically impossible to make allowances for it and fix the system!

The state, as per the program's world view, is
(a) old file exists
(b) new file is written
(c) new file replaces old file

If the computer crashes in the middle of this we "magically" end up in state (d) old file is full of zeroes.

How do you program to fix a state that it is not logically possible to get to? In such a way as the program is actually guaranteed to work properly and portably?

Cheers,
Wol

Shared pain

Posted Feb 7, 2012 23:53 UTC (Tue) by neilbrown (subscriber, #359) [Link]

Seems like the programmer has an incorrect model of the world.

Writing to a file has never made the data safe in the event of a crash. fsync is needed for that.

If the programmer did not issue 'fsync' but still expected the data to be safe after a crash, then the programmer made a programming error. It really is that simple.

Incorrectly written programs often produce pathological behaviour - it shouldn't surprise you.

Shared pain

Posted Feb 8, 2012 3:41 UTC (Wed) by mjg59 (subscriber, #23239) [Link]

I appreciate that ordering has never been guaranteed by POSIX, but let's limit it to the actual argument rather than an obvious straw man. The desired behaviour was never for a rename to guarantee that the new data had hit disk. The desired behaviour was for it to be guaranteed that *either* the old data or the new data be present. fsync provides guarantees above and beyond that which weren't required in this particular use case. It's unhelpful to simply tell application developers that they should always fsync when we've just spent the best part of a decade using a filesystem that crippled any application that did.

Shared pain

Posted Feb 8, 2012 4:08 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> we've just spent the best part of a decade using a filesystem that crippled any application that did.

That's the heart of the matter to me.... but now XFS - a filesystem that didn't cripple correct applicatons - is getting a hard time because it doesn't follow the lead of a filesystem that did.

And yes, I know, technical excellence doesn't determine market success, and even the best contender must adapt or die when facing with an ill-informed market. So maybe XFS should adopt the extX model for rename even though it hurts performance in some cases - because if it doesn't people might choose not to use it - and who wants to be the best filesystem that nobody uses (though XFS is a long way from that fate).

So I'm just being a lone voice trying to teach the history and show people that the feature they like so much was originally a mistake and the programs that use it are actually incorrect (or at least not-portable) and maybe there are hidden costs in the thing they keep asking for..

I don't expect to be particularly successful, but that is no justification for being silent.

Shared pain

Posted Feb 8, 2012 12:38 UTC (Wed) by mjg59 (subscriber, #23239) [Link]

Arguing "The specification allows us to do this" isn't something that convinces the people who consume your code. Arguing "Our design makes it difficult" is more convincing, but implies that your design stage ignored your users. "We made this tradeoff for these reasons" is something that people understand, but isn't something I've seen clearly articulated in most of these discussions. It just usually ends up with strawman arguments about well how did you expect this stuff to end up on disk when you didn't fsync, which just makes people feel like you don't even care about pretending to understand what they're actually asking for.

(Abstract you throughout)

Shared pain

Posted Feb 8, 2012 13:24 UTC (Wed) by nye (guest, #51576) [Link]

>That's the heart of the matter to me.

Then you have misunderstood the nature of the problem.

The problem is that there are cases when atomicity is required but durability is not so important. With ext3 (et al.) it is possible to get one without the other, but with XFS (et al.) atomicity can only be gained as a side-effect of durability, which is more expensive.

Thus, ext3 provides a feature which XFS does not - one which filesystem developers, as a rule, don't seem to care about, but application developers, as a rule, do. The characterisation of anyone who actually cares for that feature as 'ill-informed' is grating, even offensive to many.

General addendum, not targeted at you specifically: falling back to the observation that XFS's behaviour is POSIX-compliant is pointless because - though true - it is vacuous. In fact POSIX doesn't specify anything in the case of power loss or system crashes, hence it would be perfectly legal for a POSIX-compliant filesystem to fill your hard drive with pictures of LOLcats.

Shared pain

Posted Feb 8, 2012 22:29 UTC (Wed) by dlang (guest, #313) [Link]

and with ext3 it's not possible to get durability without a huge performance impact

with any filesystem you have atomic renames IF THE SYSTEM DOESN'T CRASH before the data is written out, that's what the POSIX standard provides.

ext3 gains it's 'atomic renames' as a side effect of a bug, it can't figure out what data belongs to what, so if it's trying to make sure something gets written out it must write out ALL pending data, no matter what the data is part of. That made it so that if you are journaling the rename, all the writes prior to that had to get written out first (making the rename 'safe'), but the side effect is that all other pending writes, anywhere in the filesystem also had to be written out, and that could cause 10's of seconds of delay.

for the casual user, you argue that this is "good enough", but for anyone who actually wants durability, not merely atomicity in the face of a crash has serious problems.

ext4 has a different enough design that they can order the rename after the write of the contents of THAT ONE file, so they can provide some added safety at relatively little cost

you also need to be aware that without the durability, you can still have corrupted files in ext3 after a crash, all it takes is any application that modifies a file in place, including just appending to the end of the file

Shared pain

Posted Feb 8, 2012 19:48 UTC (Wed) by Wol (subscriber, #4433) [Link]

Lets just say that governments (and businesses) have wasted billions throwing away applications where the application met the spec but in practice was unfit for purpose.

And a filesystem that throws away user data IS unfit for purpose. After all, what was the point of journalling? To improve boot times after a crash and get the system back into production quicker. If you need to do data integrity check on top of your filesystem check you've just made your reboot times far WORSE - a day or two would not be atypical after a crash!

Cheers,
Wol

Shared pain

Posted Feb 8, 2012 20:51 UTC (Wed) by raven667 (subscriber, #5198) [Link]

The hyperbole is getting a little out of control. Journaled filesystems have traditionally only journaled the metadata so any file data in-flight at the time of a crash would be lost and corruption would be the result. Pre-journaling any filesystem with a write cache would be susceptible to losing in-flight data and corrupting metadata leading to long fsck times after crash to repair the damage. All filesystems lose data in those circumstances, that doesn't mean that all filesystems are unfit for any purpose or that computers are fundamentally unfit for any purpose. The current state of the art is to be safer with regular data writes, even to the point of checksumming everything, that's nice but the world didn't end when this wasn't the case.

Shared pain

Posted Feb 8, 2012 15:13 UTC (Wed) by Wol (subscriber, #4433) [Link]

"fsync is needed for that"

And what is the poor programmer to do if he doesn't have access to fsync?

Or what are the poor lusers supposed to do as their system grinds to a halt with all the disk io as programs hang waiting for the disk?

Following the spec is not an end in itself. Getting the work done is the end. And if the spec HINDERS people getting the work done, then it's the spec that needs to change, not the people.

THAT is why Linux is so successful. Linus is an engineer. He understands that. "DO NOT UNDER ANY CIRCUMSTANCES WHATSOEVER break userspace" is the mantra he lives by. And filesystems eating your data while making everything *appear* okay is one of the most appalling breaches of faith by the computer that it could commit!

Cheers,
Wol

Shared pain

Posted Feb 9, 2012 1:26 UTC (Thu) by dlang (guest, #313) [Link]

> And what is the poor programmer to do if he doesn't have access to fsync?

use a language that gives them access to data integrity tools like fsync.

for shell scripts, either write a fsync wrapper, or use the sync command (which does exactly the same as fsync on ext3)

> Or what are the poor lusers supposed to do as their system grinds to a halt with all the disk io as programs hang waiting for the disk?

use a better filesystem that doesn't have such horrible performance problems with applications that try and be careful about their data.

> Following the spec is not an end in itself.

True, but what you are asking for is for the spec to be changed, no matter how much it harms people who do follow the spec (application programmers and users who care about durability)

There is no filesystem that you can choose to use that will not loose data if the system crashes. If you are expecting something different, you need to change your expectation.

Shared pain

Posted Feb 9, 2012 7:39 UTC (Thu) by khim (subscriber, #9252) [Link]

Somehow you've forgotten about the most sane alternative:
Remove XFS from all the computers and use sane filesystems (extX, btrfs when it'll be more stable) exclusively.

In a battle between applications and filesystems applications win 10 times out of 10 because without applications filesystems are pointless (and applications are pointless without the user's data).

The whole discussion just highlights that XFS is categorically, absolutely, totally unsuitable for the use as general-purpose FS. And when you don't care about data integrity then ext4 without journalling is actually faster (see Google datacenters, for example).

True, but what you are asking for is for the spec to be changed, no matter how much it harms people who do follow the spec

Yes.

application programmers and users who care about durability

Applications don't follow the spec. When they do they are punished and fixed. Thus users who care about durability need to use filesystems which work correctly given the existing applications.

Is it fair? No. It's classic vicious cycle. But said sycle is fact of the life. Ignore it at your peril.

I, for one, have a strict policy to never use XFS and to don't even consider bugs which can not be reproduced with other filesystems. Exactly because XFS developers think specs trump reality for some reason.

There is no filesystem that you can choose to use that will not loose data if the system crashes. If you are expecting something different, you need to change your expectation.

That's irrelevant. True, the loss of data in the case of system crash is unavoidable. I don't care if the window I've opened right before crash in Firefox is reopened or not. I understand that spinning rust is slow and can lose such info. But if the windows which were opened hour before that are lost because XFS replaced save state file with zeros then such filesystem is useless in practice. Long time ago XFS was prone to such data loss even if fsync was used and data was "saved" to disk days before crash. After a lot of work looks like XFS developers fixed this issue, but now they are stuck with the next step: atomic rename. It should be implemented for the FS to be suitable for real-world applications. There are even some hints that XFS have implemented it, but as long as XFS developer will exhibit this "specs are important, real applications don't" pathological thinking it's way too dangerous to even try to use XFS.

Shared pain

Posted Feb 9, 2012 9:12 UTC (Thu) by dlang (guest, #313) [Link]

if you use applications that follow the specs (for example, just about every database, or mailserver), then XFS/ext4/btrfs/etc are very reliable.

what you seem to be saying is that these classes of programs should be forced to use filesystems that give them huge performance penalties to accommodate other programs that are more careless, so that those careless programs loose less data (not no data loss, just less)

Shared pain

Posted Feb 9, 2012 9:19 UTC (Thu) by dlang (guest, #313) [Link]

by the way, I've done benchmarks on applications that do the proper fsync dance needed for the data to actually be safe (durable, not just atomic filesystem renames that may or may not get written to disk), and even on an otherwise idle system ext3 was at least 2x slower, and if you have other disk activity going on at the same time, the problem only goes up (if you hae another process writing large amounts of data, the performance difference for your critical app can easily be 40x slower on ext3)

Shared pain

Posted Feb 9, 2012 17:37 UTC (Thu) by khim (subscriber, #9252) [Link]

Exactly. This is part of the very simple proof sequence.

Fact 1: any application which calls fsync is very slow in ext3. You've just observed it.
Conclusion: most applications don't call fsync.
Fact 2: most systems out there are either "small" (where a lot of applications share one partition) or huge (where reliability of filesystem does not matter because there are other ways to keep data around like GFS).
Conclusion: any real-world filesystem needs to support all the application which are "wrong" and don't call fsync, too.
Fact 3: XFS does not provide these guarantees (and tries to cover it with POSIX, etc).
Conclusion: XFS? Fuhgeddaboudit.

Yes, it's not fair to XFS. No, I don't think being fair is guaranteed in real world.

Shared pain

Posted Feb 9, 2012 19:26 UTC (Thu) by dlang (guest, #313) [Link]

sorry, on my systems I'm not willing to tolerate a 50x slowdown just to make badly written apps be a little less likely to be confused after a power outage.

and I think that advocating that you have the right to make this choice for everyone else is going _way_ too far.

when I have applications that loose config data after a problem happens (which isn't always a system crash, apps that have this sort of problem usually have it after the application crashes as well), my solution is backups of the config (idealy into something efficient like git), not crippling the rest of the system to band-aid the bad app.

Shared pain

Posted Feb 9, 2012 20:44 UTC (Thu) by Wol (subscriber, #4433) [Link]

And what is "badly written" about an app that expects the computer to do what was asked of it?

I know changing things around for the sake of it doesn't matter when everything goes right, but if I tell the computer "do this, *then* that, *followed* by the other", well, if I told an employee to do it and they did things in the wrong order and screwed things up as a *direct* *result* of messing with the order, they'd get the sack.

The only reason we're in this mess, is because the computer is NOT doing what the programmer asked. It thinks it knows better. And it screws up as a result.

And the fix isn't that hard - just make sure you flush the data before the metadata (or journal the data too), which is pretty much (a) sensible, and (b) what every user would want if they knew enough to care.

Cheers,
Wol

Shared pain

Posted Feb 9, 2012 20:52 UTC (Thu) by dlang (guest, #313) [Link]

it is badly written because you did not tell the computer that you wanted to make sure that the data was written to the drive in a particular order.

If the system does not crash, the view of the filesystem presented to the user is absolutely consistent, and the rename is atomic.

The problem is that there are a lot of 'odd' situations that you can have where data is written to a file while it is being renamed that make it non-trivial to "do the right thing" because the system is having to guess at what the "right thing" is for this situation.

try running a system with every filesystem mounted with the sync option, that will force the computer to do exactly what the application programmers told it to do, writing all data exactly when they tell it to, even if this means writing the same disk sector hundreds of times as small writes happen. The result will be un-usable.

so you don't _really_ want the computer doing exactly what the programmer tells it to, you only want it to do so some of the time, not the rest of the time.

Shared pain

Posted Feb 9, 2012 21:13 UTC (Thu) by khim (subscriber, #9252) [Link]

so you don't _really_ want the computer doing exactly what the programmer tells it to, you only want it to do so some of the time, not the rest of the time.

Sure. YMMV as I've already noted. Good filesystem for USB sticks must flush on close(2) call. Good general purpose filesystem must guarantee rename(2) atomicity in the face of system crash.

You can use whatever you want for your own system - it's you choice. But when question is about replacement of extX… it's other thing entirely. To recommend filesystem which likes to eat user's data is simply irresponsible.

Shared pain

Posted Feb 14, 2012 16:16 UTC (Tue) by nye (guest, #51576) [Link]

>when I have applications that loose config data after a problem happens (which isn't always a system crash, apps that have this sort of problem usually have it after the application crashes as well)

That can't possibly be the case. You must be talking about applications which do something like truncate+rewrite, which is entirely orthogonal to the discussion (and is pretty clearly a bug).

I suspect you haven't understood the issue at hand.

Shared pain

Posted Feb 9, 2012 17:25 UTC (Thu) by khim (subscriber, #9252) [Link]

What you seem to be saying is that these classes of programs should be forced to use filesystems that give them huge performance penalties to accommodate other programs that are more careless, so that those careless programs loose less data

In a word: yes.

not no data loss, just less

Always and forever. No matter what filesystem you are using you data is toast in a case of RAID failure or lightning strike. This means that we always talk about probabilities.

This leads us to detailed explanation of the aforementioned phenomenon: in most cases you can not afford dedicated partitions for your database or mailserver and is this world filesystem without suitable reliability guarantees (like atomic rename in a crash case without fsync) is pointless. When your system grows it becomes good idea to dedicate server just to be a mailserver or just to be a database server. But the window of opportunity is quite small because when you go beyond handful of servers you need to develop plans which will keep your business alive in a face of hard crash (HDD failure, etc). And if you've designed your system for such a case then all these journalling efforts in a filesystem are just a useless overhead (see Google which switched from ext2 to ext4 without journal).

I'm not saying XFS is always useles. No, there are exist cases where you can use it effectively. But these are rare cases thus XFS will always be undertested. And this, in turn, usually means you should stick with extX/btrfs.

Shared pain

Posted Feb 3, 2012 10:19 UTC (Fri) by khim (subscriber, #9252) [Link]

Yes, there is room for improvement - there always is. Copying a mistake because it has some good features is not a wise move.

This depends on your goal, actually. If your goal is something theoretically sound, then no, it's not a wise move. If your goal is creation of something which will actually be used by real users then it's the only possible move.

(and yes, my beard is gray (or heading that way)).

My beard is not yet gray, but I was around long enough to see where the guys who did "wise move" ended. I must admit that they create really TRULY nice exhibits in Computer History Museum. Meanwhile creations of "unwise" guys are used for real work.

If your implementation unintentionally introduced some property and people started depending on it - it's the end of story: you are doomed to support said property forever. If you want to keep these people, obviously. If your goal is just to create something nice for the sake of art or science, then situation is different, of course.

This is basic fact of life and it's truly sad to see that so many Linux developers (especially the desktop guys) don't understand that.

XFS: the filesystem of the future?

Posted Jan 20, 2012 22:39 UTC (Fri) by sandeen (subscriber, #42852) [Link]

Regarding data loss on a crash;

#1, if your application follows proper data integrity practices (fsync et al) and your storage is properly configured, you will not lose data on a crash on xfs. Losing buffered data on a crash is expected with any filesystem, though there are differences in that behavior between various filesystems. Ext3 tended to push data out more regularly, which had its pros and cons, but people came to expect their data to be safe(er) even if it wasn't explicitly flushed. But nothing is really safe until it's explicitly synced.

#2, the un-synced "null files" behavior has been fixed since this commit from 2007 ... http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-...

XFS: the filesystem of the future?

Posted Jan 20, 2012 22:46 UTC (Fri) by dlang (guest, #313) [Link]

and the ext3 behavior was linked to the pathologically bad fsync behavior that ext3 had where a fsync could take 10's of seconds to complete.

this combination of behavior has probably done more to harm data reliability (by encouraging bad programming practices and discouraging good programming practices) than anything else I can think of.

XFS: the filesystem of the future?

Posted Jan 28, 2012 3:36 UTC (Sat) by sbergman27 (guest, #10767) [Link]

"#1, if your application follows proper data integrity practices (fsync et al)"

And if it doesn't happen to? How many filesystem users do code audits of all the programs their organizations runs, specifically looking for good practice regarding fsync? It's much better to just use a reliable filesystem.

"and your storage is properly configured, you will not lose data on a crash on xfs."

That's reasuring. I note that your recommended finger-pointing re: data loss, is now distributed between user and vendor. But it's definitely not XFS's fault!

"Losing buffered data on a crash is expected with any filesystem,"

No. Losing data on a crash is not "expected with any filesystem". (I've never had a customer quiz me as to whether lost data was "buffered" or not.) EXT3 was rock solid before they ruined it around 2.6.31 or so in the name of performance. A manual adjustment to data=journal should still restore it to its former glory. Though I fear that this new "performance over reliability" mindset in the Linux FS world might be eroding it in other ways, too.

"Ext3 tended to push data out more regularly"

EXT3 tended to keep your data rock solid safe.

"which had its pros and cons"

If I sent out a survey to my customers asking their opinion on reliability vs performance on filesystems, I already know what they would say. Reliability over performance: 100%. Performance over reliability: 0%.

It seems to me that, these days, Linux FS devs either live in ivory towers of performance benchmarks, or in California prisons. But none seem to live anywhere near me. We need more Stephen Tweedies.

-Steve Bergman

XFS: the filesystem of the future?

Posted Jan 28, 2012 5:53 UTC (Sat) by dlang (guest, #313) [Link]

ext3 has always had the possibility of loosing data in a unclean shutdown if you didn't do the fsync dance.

the window of loss was smaller, and most people don't actually have unclean shutdowns that frequently, so I believe you when you say you never experienced it.

but that doesn't mean that ext3 was 'rock solid' in the face of poorly written applications. There are plenty of people who lost data on ext3.

XFS: the filesystem of the future?

Posted Jan 28, 2012 16:37 UTC (Sat) by sbergman27 (guest, #10767) [Link]

The window was *far* smaller. However, the question is not did anyone ever suffer any data loss. It's how many people and how much data in comparison to other filesystems. I know of no other filesystem as resilient at its default settings as was the pre2.6.31 (or so) ext3. I also never noticed any performance problems. Certainly nothing that my customers cared about.

And of course, we can still mount ext3 explicitly data=ordered.

Oh for the days when Linux filesystem devs cared more about reliability than benchmarks... no matter whose fault the data loss might be. Things have changed since the Tweedie days.

Today, I use ext4 with nodelalloc. I tried ext4 at its defaults. But the first time power was lost (and the UPS failed) we had to rebuild a bunch of C/ISAM files. Never in our long history with ext3 did we *ever* have anything like that happen. (And my customers are bad about letting their UPS's go.) Personally, I think nodelalloc should be default, with delalloc available as a mount option.

Current Linux filesystem devs just seem reckless to me.

XFS: the filesystem of the future?

Posted Jan 29, 2012 0:53 UTC (Sun) by dlang (guest, #313) [Link]

the current filesystem devs are the same people as they were in the 'good old days'

go talk to a good database admin (or especially a database developer), they will tell you that all filesystems have always had these problems, and they can probably tell you horror stories about ext3 and it's "worst in the industry" fsync performance (it's been documented for a fsync to take 10s of seconds on ext3 vs ms for other filesystems under the same workload)

you can disable all caching by mounting a filesystem with the sync option, but the performance is going to be _so_ horrible (unless you have drives that lie about when the data is safe) that you will end up changing back.

XFS: the filesystem of the future?

Posted Jan 29, 2012 3:09 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"the current filesystem devs are the same people as they were in the 'good old days'"

No. In the Ext world, I hardly ever hear the name "Stephen Tweedie" anymore. And he's the main developer I trusted. A very careful, conservative, and patient devloper. Took forever to get Ext3 out, Which annoyed me at the time. But I've learned to appreciate his "good things come to those who wait" philosophy. Today, it's Ted T'so at the forefront of Ext. A *very* different person.

"go talk to a good database admin (or especially a database developer), they will tell you that all filesystems have always had these problems"

And there you go again. Casting it as "black and white". If Ext3 ever lost a byte of data on someone's machine in Kenya, then by the gods, it's just as bad as the current crop of linux data sieve filesystems.

Hey, if you can keep doing the black and white thing, I can compensate by adjusting the contrast knob a bit. ;-) Current linux filesystems may not be, exactly, data sieves. But they are a far cry from the halcyon days of Ext3 pre 2.6.31.

XFS: the filesystem of the future?

Posted Jan 29, 2012 6:45 UTC (Sun) by raven667 (subscriber, #5198) [Link]

If I sent out a survey to my customers asking their opinion on reliability vs performance on filesystems, I already know what they would say. Reliability over performance: 100%. Performance over reliability: 0%.

I don't believe that is true. If you ask them then sure, they will pick reliability but in actual operations they will performance first most of the time. Witness the popularity of MySQL/myisam which makes that exact trade off, if reliability was the most important thing everyone would use PostgreSQL and no one would have even heard of MySQL. What people actually do and what they say are often diametrically opposed.

XFS: the filesystem of the future?

Posted Jan 29, 2012 20:51 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"I don't believe that is true. If you ask them then sure, they will pick reliability but in actual operations they will performance first most of the time."

No, they would not. None have any complaints about our filesystem performance, either on my CentOS 4 machines using Ext3 mounted data=ordered, or on my later servers with Ext4 mounted nodelalloc. Benchmarking of Ext4 with and without nodelalloc always come out pretty much a wash. I've never understood what the fuss was about. Delayed allocation just doesn't improve performance noticeably. Nor did Ext3 with data=ordered in any of the server scenarios I have been involved with. And I see no appreciable difference regarding fragmentation rates, either.

If you've done your own benchmarks with your own server workloads which disagree with mine, I would be interested in hearing about them.

But regarding delayed allocation in ext4 and xfs, and the "wonders" of making data=writeback the default for ext3, I must observe that the Emperor has no clothes.

XFS: the filesystem of the future?

Posted Jan 21, 2012 10:13 UTC (Sat) by jengelh (subscriber, #33263) [Link]

>having lost an entire filesystem (and zeroed-out files on another filesystem) to *blip*. But seriously, *blip* has long had a reputation for being more likely to eat your data

Yes yes, the usual complaints. "omg X suxx, and Y is so much better." Until Y ate your data, then they are all "omg Y suxx, Z is so much better."

XFS: the filesystem of the future?

Posted Jan 29, 2012 3:13 UTC (Sun) by sbergman27 (guest, #10767) [Link]

'"omg X suxx, and Y is so much better." Until Y ate your data, then they are all "omg Y suxx, Z is so much better."'

I've never waivered from my stance that Ext3 was the most reliable filesystem ever. Not in 11 years.

XFS: the filesystem of the future?

Posted Jan 22, 2012 11:52 UTC (Sun) by pkolloch (subscriber, #21709) [Link]

Well, I had that zero-ing out of data in a small scale with ext4. Since it has happened in the context of system freezes, something else might be to blame, though.

XFS: the filesystem of the future?

Posted Feb 8, 2012 15:25 UTC (Wed) by Wol (subscriber, #4433) [Link]

No. ext4 and xfs both had the same bug. So basically anything that stopped disk i/o and forced a reboot could cost you data.

Basically what happened was

(a) user writes file
(b) filesystem writes metadata (file header) to journal
(c) in no particular order, system crashes and journal is flushed to disk

What should, of course, happen next is that the file data is flushed to disk, but because it hasn't been journalled and the system has crashed, it's not there to be written.

And that's why the rename gave you a file full of zeros, because the new file header overwrote the old one, but the new file contents hadn't been flushed to disk.

The fix, with "mode=journal" or whatever, basically blocks the metadata from updating until after the data has updated, by whatever means seems most suitable.

Cheers,
Wol

XFS: the filesystem of the future?

Posted Jan 23, 2012 15:36 UTC (Mon) by cruff (subscriber, #7201) [Link]

I successfully ran 100 TB of XFS file systems under Irix for years without any data loss as the disk cache for the NCAR Mass Store. The performance was great with the middle of the road RAID hardware that we could afford. Of course, that wasn't Linux, so YMMV. :-)

XFS: the filesystem of the future?

Posted Jan 23, 2012 15:52 UTC (Mon) by rfunk (subscriber, #4054) [Link]

Yeah, I used XFS on Irix for a while before it came to Linux, which is why I used to trust it.

One issue, however, was that XFS was designed with server room assumptions. So when I used it in a Linux desktop context, with a less-reliable power supply, bad things happened. One of those bad things was losing an entire filesystem once; another was the zeroed files that people have mentioned.

Of course this was probably around 2005, so I wouldn't be surprised if it's improved since then, but I'm also not going to assume that it has.

XFS: the filesystem of the future?

Posted Jan 23, 2012 16:06 UTC (Mon) by wazoox (subscriber, #69624) [Link]

I've been using XFS for 15 years, and currently manage 250 systems in the 10 - 200 TB range; I've lost files once over a known XFS bug, and though I hear here and there about "XFS zeroing files" I never encountered the problem.

I'm afraid it's in the same ballpark as people who have 3 hard drives and say "OMG this brand sucks because one of my drives failed". I have several thousands spinning hard drives under my guard, and I think I have quite better statistics on what works and what doesn't; the same goes for filesystems, how many filesystems do you manage? If it's less than a couple of hundreds, the fact that you encountered a particular problem with a particular filesystem once doesn't carry much significance.

XFS: the filesystem of the future?

Posted Jan 23, 2012 16:25 UTC (Mon) by rfunk (subscriber, #4054) [Link]

I said I was talking about non-server contexts. XFS was designed for server rooms, where the power never goes out. It sounds like that's where you're using it.

XFS: the filesystem of the future?

Posted Jan 23, 2012 18:40 UTC (Mon) by wazoox (subscriber, #69624) [Link]

Well, I use it on all of my desktop systems too, and my work desktop has an uptime of 176 days. I don't reboot often :)

XFS: the filesystem of the future?

Posted Jan 29, 2012 3:47 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"Well, I use it on all of my desktop systems too, and my work desktop has an uptime of 176 days. I don't reboot often :)"

I hope you don't connect to the Internet with that security-hole ridden kernel you're running. You should reboot after kernel updates.

XFS: the filesystem of the future?

Posted Jan 30, 2012 8:03 UTC (Mon) by youareretarded (guest, #82640) [Link]

are you retarded? nobody gets exploited by missing a kernel update? it sounds like you don't know shit about the kernel if you think someone is going to be remotely exploited through a kernel bug introduced in the last 180 days. it's always the software, never the kernel.

XFS: the filesystem of the future?

Posted Jan 23, 2012 21:46 UTC (Mon) by dgc (subscriber, #6611) [Link]

Actually, XFS was designed for storage subsystems that didn't lie to it, not "server storage". You can say exactly the same for ext3, ext4, btrfs, etc.

"Consumer storage" violates the write ordering guarantees that these filesystems require to have journal recovery work because they have volatile write caches. That's why we have write barriers and use them by default on these filesystems these days. XFS was the first filesystem to enable them by default, another reason it was always slower on metadata intensive workloads than ext3/4.

"server storage" doesn't violate write ordering in an effort to improve performance, so XFS has always worked fine and performed well on that class of storage.

Dave.

XFS: the filesystem of the future?

Posted Jan 31, 2012 17:36 UTC (Tue) by Cato (guest, #7643) [Link]

Do you have evidence that all "consumer storage" devices violate write ordering guarantees, or simply don't flush pending writes on request?

As far as I can tell, some consumer drives do lie about when writes have been flushed, and write back caching is the default anyway.

Some relevant links:

http://serverfault.com/questions/15404/sata-disks-that-ha...

http://brad.livejournal.com/2116715.html - disk testing tool

https://lwn.net/Articles/351521/

XFS: the filesystem of the future?

Posted Jan 31, 2012 19:30 UTC (Tue) by dlang (guest, #313) [Link]

I don't think anyone is meaning to say that all consumer storage devices are broken, but I also don't think that there's much dispute that some are.

XFS: the filesystem of the future?

Posted Feb 2, 2012 21:19 UTC (Thu) by dgc (subscriber, #6611) [Link]

> Do you have evidence that all "consumer storage" devices violate write
> ordering guarantees,

Any device with a volatile write cache tells the OS that the write IO has been completed before it actually is written to stable storage. IO completion is supposed to mean "the IO is complete" and any device with a volatile write cache is actually lying - the write is not yet on stable storage, so is "lying" about the completion status of the IO to the OS. Pretty much all consumer devices ship with a volatile write cache enabled by default for performance reasons.

Barriers and cache flushes were introduced to provide a mechanism that allowed filesystems to force such drives to order writes the way the filesystem wants correctly. The original barrier mechanism was "cache flush, write, cache flush" and could make the drive slower than not caching in the first place depending on the workload. More recently we just use the FUA mechanism if the drive supports that, and that has neglible performance overhead.

> As far as I can tell, some consumer drives do lie about when writes
> have been flushed, and write back caching is the default anyway

If drives lie about cache flush or FUA completion on volatile writeback caches, then that's a bug in the disk firmware.

FWIW, the difference with server storage (SAS drives) is that most ship with the volatile write cache turned off by default. They don't need it for performance because the SCSI/SAS protocol is much more efficient than SATA and so in most cases a write cache isn't necessary. You can turn it on, but you don't need to to reach full disk performance....

Indeed, it's not just filesystems that don't like volatile write caches. if you turn on volatile write caching on disks behind a RAID controller, the disk will now violate the write ordering guarantees that the RAID controller relies on to maintain data safety (exactly the same as for filesystems). You still lose data or corrupt filesystems on power loss in this case, even though the OS and RAID controller are behaving correctly.

Dave.

XFS: the filesystem of the future?

Posted Feb 3, 2012 4:34 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> They don't need it for performance because the SCSI/SAS protocol is much more efficient than SATA and so in most cases a write cache isn't necessary.

I think you are correct on every other point but I dont think this is right. SATA is pretty much the SCSI protocol as is SAS, they are only slightly incompatible for marketing rather than technical reasons. The big performance difference historically between consumer (IDE) and enterprise (SCSI) drives was tagged command queuing which is now very common in SATA drives as well although it wasn't so common in early SATA implementations. A tagged command queue allows the drive to implement an elevator which is a big win against a naive implementation without one.

XFS: the filesystem of the future?

Posted Feb 3, 2012 12:35 UTC (Fri) by Jonno (subscriber, #49613) [Link]

Actually, SATA (Serial ATA) uses a slightly extended ATA command set over a serial bus, while SAS (Serial Attached Scsi) uses the SCSI command set over the same serial bus.

The SCSI command set is generally considered "better" than the ATA command set, though the difference isn't quite as large as the grand parent suggests. Write caches are still beneficial for SCSI (including SAS) performance, but the difference is not quite as large with SCSI as with ATA. That, as well as the fact that the average enterprise customer are more concerned about reliability than the average home user, are the reason that most SAS drives have write cache disabled by default, while most SATA drives have write cache enabled by default.

XFS: the filesystem of the future?

Posted Feb 3, 2012 17:30 UTC (Fri) by raven667 (subscriber, #5198) [Link]

The "extended ATA" command set that's used on SATA devices not operating in Legacy IDE mode is the SCSI command set. This goes all the way back to ATAPI which is the SCSI command set encapsulated with the IDE bus protocol. Drives and controllers are capable of speaking either SATA-II or SAS protocols without any cost difference AFAICT but don't, for largely marketing reasons rather than engineering ones. As I was saying before, having a command queue on the drive allows for the drive to have an IO elevator which is _the_ big performance win, details about how commands are named and whatnot is not really an important factor.

scsi misinformation

Posted Feb 3, 2012 14:08 UTC (Fri) by quanstro (guest, #77996) [Link]

the argument here seems to be circular. sas is faster
because it's sas.

sata and sas send the same data in the same size fises/frames
to the drive. neither are wire-speed limited. they're spin/seek
limited; physics limited.

could you please explain the mechanism where by sas is going to
be faster than sata?

scsi misinformation

Posted Feb 3, 2012 19:00 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> the argument here seems to be circular. sas is faster because it's sas

That's the power of branding, replacing rational thought with mental shortcuts which put things in "good" or "bad" boxes.

XFS: the filesystem of the future?

Posted Feb 4, 2012 13:00 UTC (Sat) by zomonto (guest, #82108) [Link]

> The original barrier mechanism was "cache flush, write, cache flush" and could make the drive slower than not caching in the first place depending on the workload. More recently we just use the FUA mechanism if the drive supports that, and that has neglible performance overhead.

No, libata always disables FUA by default. You can enabled it with a
kernel parameter though.

XFS: the filesystem of the future?

Posted Feb 8, 2012 13:09 UTC (Wed) by yungchin (guest, #72949) [Link]

> Pretty much all consumer devices ship with a volatile write cache enabled by default for performance reasons.

Dave, I was wondering, given the optimisations you discussed in the talk, where there's now lots of merging and reordering going on before sending things to the i/o scheduler, do you still expect much performance improvements from these hardware caches? (Or - that's of course the hidden question here - should we from now on happily disable them, at least for most use cases?) Thanks.

XFS: the filesystem of the future?

Posted Feb 1, 2012 11:33 UTC (Wed) by Cato (guest, #7643) [Link]

Maybe I didn't quite get your point - would be good to understand exactly what write ordering guarantees are provided by server storage but not consumer storage. Is this just that a consumer hard drive's write cache will reorder writes without respecting the kernel's write barriers?

XFS: the filesystem of the future?

Posted Feb 1, 2012 19:28 UTC (Wed) by dlang (guest, #313) [Link]

some consumer drives lie about when the data has actually been written to disk (making write barriers ineffective), in those cases the OS will send more writes to the drive and the drive will go ahead and re-order them with the other writes that are in it's buffer.

XFS: the filesystem of the future?

Posted Feb 1, 2012 20:13 UTC (Wed) by raven667 (subscriber, #5198) [Link]

And as someone else pointed out drives these days seem to have up to 64MB write buffers so that could be a lot of corruption if that data goes missing in-flight when the OS was told that it was permanently committed.

XFS: the filesystem of the future?

Posted Feb 1, 2012 22:20 UTC (Wed) by magila (guest, #49627) [Link]

While there's been a lot of talk about consumer devices "lying" in this thread, they really don't behave any differently from server drives with the same cache settings. Both consumer and server drives provide an option to turn on write caching. Both will lie about writes completing when write caching is enabled. Both will only signal command completion when data has gone to disk if write caching is disabled[1]. The only real difference is the default, most (but not all) enterprise drives ship with write cache disabled while all consumer drive ship with it enabled. Both give the option of changing the setting to whatever the user pleases.

[1] I vaguely remember hearing a story several years ago that a handful of ATA drive models where not respecting write cache settings. This was an isolated incident. Newer drives can reasonably be assumed to handle write caching correctly.

XFS: the filesystem of the future?

Posted Feb 2, 2012 1:39 UTC (Thu) by dlang (guest, #313) [Link]

when I talk about a drive lying, I'm not talking about normal write caching, I'm talking about it either not respecting write cache settings, or lying about data integrity commands that are supposed to work even in the face of write caching (cache flush commands for example)

most consumer drives don't have these problems, but a few have been found to have them.

unfortunately you cannot just assume that newer dries will not have the problem. On the database mailing lists you see a couple drive models every year where someone runs across the problem yet again.

XFS: the filesystem of the future?

Posted Feb 2, 2012 2:40 UTC (Thu) by magila (guest, #49627) [Link]

"On the database mailing lists you see a couple drive models every year where someone runs across the problem yet again."

I'd be rather surprised if that were the case. The code that handles cache flushing isn't something which usually changes between models. If a manufacturer's firmware had a bug in that area I'd expect to see it across the board, not just randomly poping up periodically on different SKUs.

XFS: the filesystem of the future?

Posted Feb 2, 2012 13:11 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link]

> The code that handles cache flushing isn't something which usually changes between models. If a manufacturer's firmware had a bug …

You won't get any manufacturer to admit it, but this is not a bug, it's a feature (to get higher benchmark numbers).

XFS: the filesystem of the future?

Posted Feb 2, 2012 17:38 UTC (Thu) by magila (guest, #49627) [Link]

You might not believe it, but I can say based on first hand experience that hard drive manufacturers take data integrity very seriously. None of them would risk losing customer data just to gain extra performance. The potential backlash from data loss would be far worse than scoring lower on a benchmark.

Plus the people running benchmarks, especially for tier 1 OEMs, aren't stupid. Lying about cache flushes is pretty easy to detect so the likelihood of getting away with it is pretty low right from the start. Pissing off OEMs is another thing hard drive manufactures would never, ever take risks with.

XFS: the filesystem of the future?

Posted Jan 20, 2012 20:59 UTC (Fri) by zomonto (guest, #82108) [Link]

Nice article.
Is there a video (or a pdf) of Dave's presentation available?

XFS: the filesystem of the future?

Posted Jan 20, 2012 21:02 UTC (Fri) by corbet (editor, #1) [Link]

Videos from the talks are going up on this youtube page, but they don't yet seem to have gotten to Wednesday, which is when this talk was given.

XFS: the filesystem of the future?

Posted Jan 20, 2012 22:43 UTC (Fri) by sandeen (subscriber, #42852) [Link]

Links to 119 LCA2012 videos so far

Posted Jan 21, 2012 17:43 UTC (Sat) by dowdle (subscriber, #659) [Link]

I like to download various Linux/FLOSS conference talks from YouTube, convert them to webm and then post them to archive.org (assuming the licensing allows it). Here are direct links for anyone who cares:

# Using Open Source to Build a Gravitational Wave Observatory - Elizabeth Garbee
youtube-dl -t 'http://www.youtube.com/watch?v=cjNkrDDtWiY'

# Keynote - Paul Fenwick
youtube-dl -t 'http://www.youtube.com/watch?v=KV1iUmDVsM4'

# Keynote - Bruce Perens
youtube-dl -t 'http://www.youtube.com/watch?v=Uoum-DHO7S8'

# Creating the Open Source Academy - Ian Beardslee
youtube-dl -t 'http://www.youtube.com/watch?v=_OOmsrDWi10'

# OGPC - One Geek Per Classroom - Thomas Sprinkmeier
youtube-dl -t 'http://www.youtube.com/watch?v=mUO29GdElEk'

# Helping your audience learn - Jacinta Richardson
youtube-dl -t 'http://www.youtube.com/watch?v=S7-tP_olziM'

# Testing CTDB - not necessarily trivial - Martin Schwenke,Ronnie Sahlberg
youtube-dl -t 'http://www.youtube.com/watch?v=tlLNBys04uA'

# Ending Software Patents in Australia - Ben Sturmfels
youtube-dl -t 'http://www.youtube.com/watch?v=mzz-w55D9vM'

# EFI and Linux: the future is here, and it's awful - Matthew Garrett
youtube-dl -t 'http://www.youtube.com/watch?v=V2aq5M3Q76U'

# Erlang in production: "I wish I'd known that when I started" - Bernard Duggan
youtube-dl -t 'http://www.youtube.com/watch?v=G0eBDWigORY'

# Scaling OpenStack Development with git, Gerrit and Jenkins
youtube-dl -t 'http://www.youtube.com/watch?v=ARtkLcVxSTo'

# Serval Maps - Building Collaborative Infrastructure Independent Maps on Mobile - Romana Challans, Paul Gardner-Stephen
youtube-dl -t 'http://www.youtube.com/watch?v=bIVtXkDAIQ8'

# Optimizing Web Performance with TBB - Nicolas Erdody, Lenz Gschwendtner
youtube-dl -t 'http://www.youtube.com/watch?v=Qia7lBE-L4Y'

# XFS: Recent and Future Adventures in Filesystem Scalability - Dave Chinner
youtube-dl -t 'http://www.youtube.com/watch?v=FegjLbCnoBw'

# Smashing a square peg into a round hole: Automagically building and configuring - David Basden, Christopher Collins
youtube-dl -t 'http://www.youtube.com/watch?v=3acclV9y-4c'

# Keynote - Karen Sandler
youtube-dl -t 'http://www.youtube.com/watch?v=5XDTQLa3NjE'

# Efficient multithreading with Qt - Dario Freddi
youtube-dl -t 'http://www.youtube.com/watch?v=MMFhc2jXzgw'

# Codec 2 - Open Source Speech Coding at 2400 bit/s and Below - David Rowe
youtube-dl -t 'http://www.youtube.com/watch?v=KsywWf8dQgU'

# The Kernel Report - Jonathan Corbet
youtube-dl -t 'http://www.youtube.com/watch?v=elRCAD3sPEk'

# Desktop Home Hacks - Allison Randal
youtube-dl -t 'http://www.youtube.com/watch?v=a8asl5SsGy4'

# Challenges for the Linux plumbing community - Jonathan Corbet
youtube-dl -t 'http://www.youtube.com/watch?v=dNXggr8ycNE'

# Operating System Support for the Heterogeneous OMAP4430: A Tale of Two Micros - Etienne Le Sueur
youtube-dl -t 'http://www.youtube.com/watch?v=2GjSdhJtYOU'

# Beginning with the Shell - Peter Chubb
youtube-dl -t 'http://www.youtube.com/watch?v=Sye3mu-EoTI'

# Freedom, Out of the Box! - Bdale Garbee
youtube-dl -t 'http://www.youtube.com/watch?v=z-P2Jaeg0aQ'

# Linux as a Boot Loader - Peter Chubb
youtube-dl -t 'http://www.youtube.com/watch?v=pteHg54WBbQ'

# I Can't Believe This is Butter! A tour of btrfs - Avi Miller
youtube-dl -t 'http://www.youtube.com/watch?v=hxWuaozpe2I'

# This Old Code, or Renovating Dusty Old Open Source For Fun and Profit - Greg Banks
youtube-dl -t 'http://www.youtube.com/watch?v=mpLHm5sSmSs'

# Extracting metrics from logs for realtime trending and alerting - Jamie Wilkinson
youtube-dl -t 'http://www.youtube.com/watch?v=JhCwsXdIaFM'

# Opus, the Swiss Army Knife of Audio Codecs - Jean-Marc Valin
youtube-dl -t 'http://www.youtube.com/watch?v=iaAD71h9gDU'

# Ganeti: Clustered Virtualization on Commodity Hardware - Ben Kero
youtube-dl -t 'http://www.youtube.com/watch?v=aQc8GcedfEU'

# The Samba tour of scripting languages - Andrew Bartlett, Amitay Isaacs
youtube-dl -t 'http://www.youtube.com/watch?v=NFdHTXJJ6Go'

# antiSMASH: Searching for New Antibiotics Using Open Source Tools - Kai Blin
youtube-dl -t 'http://www.youtube.com/watch?v=WpybrLh_Kp8'

# Cheap tabloid tricks: The truth about Linux, open source and the media - Angus Kidman
youtube-dl -t 'http://www.youtube.com/watch?v=JRLgD1jW-Fs'

# Data mining packages to assess update risks - Kate Stewart
youtube-dl -t 'http://www.youtube.com/watch?v=RBKokCCFD7Y'

# The best Software Freedom Day in the world - and how you can do it too! - Kathy Reid
youtube-dl -t 'http://www.youtube.com/watch?v=C5u3ez0VQXg'

# Moving Day: Migrating Big Data from A to B - Laura Thomson, Shyam Mani, Justin Dow
youtube-dl -t 'http://www.youtube.com/watch?v=3XRkeP6fkWc'

# Scaling web applications with message queues - Lenz Gschwendtner
youtube-dl -t 'http://www.youtube.com/watch?v=aOrGq9yb6og'

# Mentoring: We're Doing It Wrong - Leslie Hawthorn
youtube-dl -t 'http://www.youtube.com/watch?v=ydS4vXNzN0I'

# Android Accessories Made Easy With Arduino - Philip Lindsay
youtube-dl -t 'http://www.youtube.com/watch?v=4yBkSwP9x7s'

# Ubuntu ARM from netbook to Server, the journey from the beginning and where it - David Mandala
youtube-dl -t 'http://www.youtube.com/watch?v=LRWpuJRrTn4'

# IPv6 Dynamic Reverse Mapping - the magic, misery and mayhem - Robert Mibus
youtube-dl -t 'http://www.youtube.com/watch?v=JsAUXuL6IrY'

# The Serval Project presents Rhizome - Self Replicating Software and Data Distri - Corey Wallis, Jeremy Lakeman
youtube-dl -t 'http://www.youtube.com/watch?v=u-v4yhTyP_c'

# where is your data cached and where should it be cached - Sarah Novotny
youtube-dl -t 'http://www.youtube.com/watch?v=ge_Xybwab5M'

# Design your own Printed Circuit Board using FOSS - Scott Finneran
youtube-dl -t 'http://www.youtube.com/watch?v=GcrxdFGbrwU'

# Mistakes were made - Selena Deckelmann
youtube-dl -t 'http://www.youtube.com/watch?v=SL7pbj7B1hk'

# The Web as an Application Development Platform - Shane Stephens, Mike Lawther
youtube-dl -t 'http://www.youtube.com/watch?v=L-toi4RuSk4'

# Multi-tenancy, multi-master, Sharding, scaling and analytics with Drizzle - Stewart Smith
youtube-dl -t 'http://www.youtube.com/watch?v=3-t7KRAIwwA'

# Making video streaming interactive, heckling user groups from the clouds! - Tim Ansell
youtube-dl -t 'http://www.youtube.com/watch?v=rCoCRmcrPlM'

# A (Mostly) Gentle Introduction to Computer Security - Todd Austin
youtube-dl -t 'http://www.youtube.com/watch?v=0BHn4Su2qEo'

# Low-hanging Fruit vs. Micro-optimization, Creative Techniques for Loading Web P - Trevor Parscal
youtube-dl -t 'http://www.youtube.com/watch?v=YRGO3n-ggT0'

# Hack everything: re-purposing everyday devices - Matt Evans
youtube-dl -t 'http://www.youtube.com/watch?v=VY9SBPo1Oy8'

# World domination and party tricks with the Android Open ADK - Jonathan Oxer
youtube-dl -t 'http://www.youtube.com/watch?v=cixG5-jPjQw'

# 1,000,000 Watchpoints, 20 Applications, 1 Driver, 0 Kernel Modifications - Todd Austin
youtube-dl -t 'http://www.youtube.com/watch?v=PS5idj8BO7E'

# Conference Opening
youtube-dl -t 'http://www.youtube.com/watch?v=S2y07sMVp-4'

# Tutorial 1 (hardware assembly)
youtube-dl -t 'http://www.youtube.com/watch?v=gbga1XY3I3w'

# Pebble v2 Software - Andy Gelme, Luke Weston
youtube-dl -t 'http://www.youtube.com/watch?v=Mqv36MVmr24'

# TOPCAT: Arduino in Space - Mark Jessop
youtube-dl -t 'http://www.youtube.com/watch?v=KfZmGxGN-2A'

# Building TeleMetrum companion boards - Bdale Garbee
youtube-dl -t 'http://www.youtube.com/watch?v=J6-nKXzPTfU'

# AVR XMEGA internals - David Zanetti [Altus Metrum, Arduino and... - Bdale Garbee]
youtube-dl -t 'http://www.youtube.com/watch?v=uUuroWJRpsI'

# Lego + Kids + Arduino - James Muraca
youtube-dl -t 'http://www.youtube.com/watch?v=RdLEh89fIUc'

# Web interaction with physical objects - Andrew Fisher
youtube-dl -t 'http://www.youtube.com/watch?v=CixYrA4rm_c'

# Lightning Talks: project showcase
youtube-dl -t 'http://www.youtube.com/watch?v=pVXYtFUO9-A'

# The Javascript testing toolbox - Malcolm Locke
youtube-dl -t 'http://www.youtube.com/watch?v=BOSDRrqHnlg'

# AltJS - Brian McKenna
youtube-dl -t 'http://www.youtube.com/watch?v=Grgz5yBhvRo'

# Migrating to PHP 5.4 - Adam Harvey
youtube-dl -t 'http://www.youtube.com/watch?v=QnCd0rG4Fvo'

# Finding vulnerabilities in PHP code (via static code analysis) - Peter Serwylo
youtube-dl -t 'http://www.youtube.com/watch?v=zrXFGjJyP8M'

# CSS Progress Goes Boink (with the right timing function) - Adam Harvey
youtube-dl -t 'http://www.youtube.com/watch?v=NI6kiRQ3dtI'

# Application programming in Lua: experiences through LOMP - daurnimator
youtube-dl -t 'http://www.youtube.com/watch?v=0ddeyLueduc'

# 7 Networking things all Systems Administrators should know - Julien Goodwin
youtube-dl -t 'http://www.youtube.com/watch?v=TYxPLEmbCuk'

# Using Performance Co-Pilot to monitor SNMP devices - Hamish Coleman
youtube-dl -t 'http://www.youtube.com/watch?v=2azBcj8QUdI'

# Stress and Performance Testing in Virtual Environments - Rodger Donaldson, Aneel Hay
youtube-dl -t 'http://www.youtube.com/watch?v=hF3jHrCod3U'

# You can't spell KABOOM without OOM - Anthony Towns
youtube-dl -t 'http://www.youtube.com/watch?v=p_u6BDFkybE'

# Time to harden up - SELinux is no longer an option - Steven Ellis
youtube-dl -t 'http://www.youtube.com/watch?v=dtclmj3H7ZU'

# Lazy Security in a Large Gateway - Mark Suter
youtube-dl -t 'http://www.youtube.com/watch?v=JIQa1Avn_bY'

# Easy Platform as a Service - Mark Atwood
youtube-dl -t 'http://www.youtube.com/watch?v=GUxUoVaNVPs'

# Storage Replication in High-Performance High-Availability (HPHA) Environments - Florian Haas
youtube-dl -t 'http://www.youtube.com/watch?v=l910kiEuHOM'

# Building a Non-Shared Storage HA Cluster with Pacemaker and PostgreSQL 9.1 - Keisuke Mori
youtube-dl -t 'http://www.youtube.com/watch?v=ON4QGfDkqwg'

# Extend Pacemaker to Support Geographically Distributed Clustering - Tim Serong on behalf of Jiaju Zhang
youtube-dl -t 'http://www.youtube.com/watch?v=S3DB_DSVI_A'

# Adventures in Logo Design: One Coder's Pain is Your Gain - Jon Cruz
youtube-dl -t 'http://www.youtube.com/watch?v=vmbvArpJ8Z8'

# High Availability Sprint: from the brink of disaster to the Zen of Pacemaker - Florian Haas
youtube-dl -t 'http://www.youtube.com/watch?v=3GoT36cK6os'

# mitmproxy - use and abuse of a hackable SSL-capable man-in-the-middle proxy - Jim Cheetham
youtube-dl -t 'http://www.youtube.com/watch?v=kQ1-0G90lQg'

# BITS: Running Python in GRUB to test BIOS and ACPI - Josh Triplett
youtube-dl -t 'http://www.youtube.com/watch?v=36QIepyUuhg'

# Tux in Space: High altitude ballooning - Joel Stanley, Mark Jessop
youtube-dl -t 'http://www.youtube.com/watch?v=rb8XOwacRKA'

# Australia's Toughest Linux Deployment - Sridhar Dhanapalan
youtube-dl -t 'http://www.youtube.com/watch?v=mWji2O3p-9s'

# How good are you, really? Improving your technical writing skills - Lana Brindley
youtube-dl -t 'http://www.youtube.com/watch?v=VePt7jcrs0M'

# What is in a tiny Linux installation - Malcolm Tredinnick
youtube-dl -t 'http://www.youtube.com/watch?v=4UU0Dd4dQ1I'

# POLICY CIRCLES - Freedom to Think Aloud - Dan McGarry
youtube-dl -t 'http://www.youtube.com/watch?v=GliEqba3loE'

# Developing accessible web applications - how hard can it be? - Silvia Pfeiffer, Alice Boxhall
youtube-dl -t 'http://www.youtube.com/watch?v=sVZ3tJj8DxI'

# An Introduction to Open vSwitch - Simon Horman
youtube-dl -t 'http://www.youtube.com/watch?v=_PCRNUB7oNw'

# HiPBX - HiAv VoIP with Open Source Software and 5000 Lines of Bash - Rob Thomas
youtube-dl -t 'http://www.youtube.com/watch?v=CpMifzcYSdU'

# Squashing SPOFs with Common Sense, Velcro, and a Hammer - Rob Thomas
youtube-dl -t 'http://www.youtube.com/watch?v=6mQ65Flmri8'

# CTDB Overview - Ronnie Sahlberg
youtube-dl -t 'http://www.youtube.com/watch?v=L7-QSbEEjS0'

# High Availability Login Services with Samba4 Active Directory - Kai Blin
youtube-dl -t 'http://www.youtube.com/watch?v=-EeqYbEwJU8'

# HA Lessons Learned from Darth Vader - Ronnie Sahlberg
youtube-dl -t 'http://www.youtube.com/watch?v=tnBz8212X5M'

# MySQL for the Developer in a Post-Oracle World - Adam Donnison
youtube-dl -t 'http://www.youtube.com/watch?v=oJ9HnFgC48s'

# MySQL and Postgres Cloud Offerings - Stewart Smith, Selena Deckelmann
youtube-dl -t 'http://www.youtube.com/watch?v=UFTp0zA4Mx8'

# Scaling Data: Postgres, The Stack and the Future of Replication - Selena Deckelmann
youtube-dl -t 'http://www.youtube.com/watch?v=Pdgzy7KoGWU'

# Rusty's Welcome - Rusty Russel
youtube-dl -t 'http://www.youtube.com/watch?v=xH3HgXZlsGk'

# Swift 101 - Monty Taylor
youtube-dl -t 'http://www.youtube.com/watch?v=mX25RtDvf8E'

# MySQL Web Infra Scaling and Keeping it Online, Cheaply - Arjen Lentz
youtube-dl -t 'http://www.youtube.com/watch?v=A4K-ZDDBRHI'

# Linux Australia AGM - Linux Australia Board
youtube-dl -t 'http://www.youtube.com/watch?v=XOWZa517pBI'

# VCS Interoperability - David Barr
youtube-dl -t 'http://www.youtube.com/watch?v=0hVuv-wv4Dw'

# Cloud meets Word Processor -- RDF and abiword in the Browser - Ben Martin
youtube-dl -t 'http://www.youtube.com/watch?v=RJ5EeLuMaAk'

# Android is not vi: mobile user experience for geeks - Paris Buttfield-Addison
youtube-dl -t 'http://www.youtube.com/watch?v=9zrZylL98k8'

# Torturing OpenSSL - Valeria Bertacco
youtube-dl -t 'http://www.youtube.com/watch?v=xKBlB8tejjI'

# Women in open technology and culture worldwide - Valerie Aurora, Mary Gardiner
youtube-dl -t 'http://www.youtube.com/watch?v=9LY5bo0JULU'

# Creating social applications with Telepathy and Libsocialweb - Dario Freddi
youtube-dl -t 'http://www.youtube.com/watch?v=3XxbqVqo83I'

# Lightning talks
youtube-dl -t 'http://www.youtube.com/watch?v=CoZmUsN91Xs'

# Keynote - Jacob Appelbaum
youtube-dl -t 'http://www.youtube.com/watch?v=GMN2360LM_U'

# Guerrilla Data Liberation - Henare Degan
youtube-dl -t 'http://www.youtube.com/watch?v=gFZReZk_KqE'

# The copyright safe harbour is no longer safe - Ben Powell
youtube-dl -t 'http://www.youtube.com/watch?v=wFqszZ8LCvM'

# Samba4: After the merge, ready for the real world - Andrew Bartlett, Andrew Tridgell
youtube-dl -t 'http://www.youtube.com/watch?v=zVjdOFjNNRQ'

# Best Of #1
youtube-dl -t 'http://www.youtube.com/watch?v=9bQc_z-Cb7E'

# Bloat: How and Why UNIX Grew Up (and Out) - Rusty Russell, Matt Evans
youtube-dl -t 'http://www.youtube.com/watch?v=Nbv9L-WIu0s'

# Best Of #2
youtube-dl -t 'http://www.youtube.com/watch?v=7y6CHpMauHw'

# Gang Scheduling in Linux Kernel Scheduler - Nikunj A Dadhania
youtube-dl -t 'http://www.youtube.com/watch?v=4SdzmT9gfQI'

# Best Of #3
youtube-dl -t 'http://www.youtube.com/watch?v=B_hueOIsdys'

# Best Of #4
youtube-dl -t 'http://www.youtube.com/watch?v=IfKF7mEY5Dc'

# Rescuing Joe - Andrew Tridgell
youtube-dl -t 'http://www.youtube.com/watch?v=ML__e_ZcWiQ'

Links to 119 LCA2012 videos so far

Posted Jan 21, 2012 19:24 UTC (Sat) by kragilkragil2 (guest, #76172) [Link]

Excellent! Thanks. That was fast.
The desktop summit should take note. All Gnome events always take forever to post their videos.

Links to 119 LCA2012 videos so far

Posted Jan 23, 2012 14:42 UTC (Mon) by tsdgeos (guest, #69685) [Link]

The Desktop Summit is not a Gnome event ;-)

Links to 119 LCA2012 videos so far

Posted Jan 23, 2012 15:08 UTC (Mon) by halla (subscriber, #14185) [Link]

I was told that the videos for the 2011 desktop summit actually have all been lost. (And even though I was there, I wasn't able to attend all that many talks, so I had rather counted on being able to check out some presentations later on.)

Links to 119 LCA2012 videos so far

Posted Jan 25, 2012 7:04 UTC (Wed) by kragilkragil2 (guest, #76172) [Link]

Let me guess, Gnome people lost them, right? They seem to have a habit of doing so. Just try to find videos for their events(there are only 2 categories: either not there or years late)

Links to 119 LCA2012 videos so far

Posted Jan 26, 2012 9:03 UTC (Thu) by job (guest, #670) [Link]

<irony> Well, almost all users come to watch only the latest talks, so lets just remove the old ones to make the site easier to use! No options are required then, hardly even a navigation bar! </irony>

Links to 119 LCA2012 videos so far

Posted Feb 1, 2012 1:43 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]

Here are the official videos, transcoded directly from the DV sources http://linux.conf.au/wiki/index.php/Video. These were up within half a week of the conference, and most of the youtube videos were up by two days afterwards. Considering that they had to do editing, cutting etc, they did quite a good job.

Not unexpected

Posted Jan 20, 2012 23:41 UTC (Fri) by dcg (subscriber, #9198) [Link]

I find the claims of better performance over Ext4 surprising - the overall better performance of XFS had always been a given, or so I thought.

I will play the devil's advocate here: that Ext4 has improved so much in the last years in data-oriented workloads (and despite of the huge shortcomings in the Ext design!), says a lot of good things about Ext.

Not unexpected

Posted Jan 21, 2012 1:04 UTC (Sat) by dlang (guest, #313) [Link]

XFS has been ahead of ext* for many workloads for a long time, but not for all. In general the more drives you had on a system, and the larger the files, the more likely it was that XFS would be ahead. It's been slower in dealing with lots of small files on a single drive.

that being said, you really do need to test your workload on various filesystems. I've done testing that has shown a 4x performance difference between ext2 and ext3 on a particular workload (fsync heavy small writes, ext2 was the clear winner), so it may not be what you expect.

As filesystems get more complex, the 'best' filesystem for a particular use case will not always be the same one.

XFS: the filesystem of the future?

Posted Jan 21, 2012 0:31 UTC (Sat) by hechacker1 (guest, #82466) [Link]

Any thoughts on which FS is better for a CPU starved router? I'm currently using ext4 with "journal_async_commit" and it can barely manage 15MB/s. The USB disk is capable of 30MB/s.

With these XFS changes, perhaps it will work better for cpu limited scenarios?

I did try JFS on it, and it was slightly faster, but it had a bad behavior of having to run fsck on it every time it was powered off (this router tends to get reset a lot). Ext4 in comparison recovers gracefully without needing to do a fsck. Not sure how XFS would respond to that.

The data isn't critical, since it's mostly just used as a shared temp storage.

XFS: the filesystem of the future?

Posted Jan 21, 2012 7:32 UTC (Sat) by jmalcolm (subscriber, #8876) [Link]

According to benchmarks from back in the 2.6.36 days, XFS does pretty well:

http://free.linux.hp.com/~enw/ext4/2.6.36-rc6/large_file_...

XFS: the filesystem of the future?

Posted Jan 21, 2012 21:52 UTC (Sat) by runekock (subscriber, #50229) [Link]

I don't think that the FS will matter much for you. If your CPU is maxed out, it is probably because the USB-driver is inefficient.

XFS: the filesystem of the future?

Posted Jan 21, 2012 9:04 UTC (Sat) by jmalcolm (subscriber, #8876) [Link]

This article is timely for me as I just brought up a new Scientific Linux 6.2 (RHEL clone) server last week that uses XFS for most of the storage. This is my first new attempt to use XFS in years.

At one point, I tried very much to make a go with XFS but I always ran into compatibility issues that made me regret it. Thankfully, I never had any of the reliability problems that XFS used to suffer from when running on commodity (unreliable) hardware. It feels though that XFS on Linux has finally grown up.

SGI Altix customers, like NASA, run XFS systems in the hundreds of terabytes (although I am sure some of those are CXFS). Also, XFS is a fully supported filesystem in RHEL6 (including xfsprogs). My understanding is that Red Hat now employs the majority of the XFS developers.

Filesystem of the future? Btrfs and ZFS are more feature rich although adding LVM2 and mdraid to XFS closes the gap. Of course, even that setup lacks deduplication. That said, given the performance and current stability of XFS, perhaps it is the right filesystem for today.

XFS: the filesystem of the future?

Posted Jan 21, 2012 10:45 UTC (Sat) by drag (guest, #31333) [Link]

dedupe is overrated.

From what I understand on Solaris is that in order to have a effecient De-dupe you need to be able to maintain a table of the 'deduped' items in RAM. That way when you need to access a file the filesystem knows where the actual bits are located without having to look them up. Something like that.

http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-...

Does not seem to be a significant advantage as in many situations 5GB is considerably more expensive then 1TB of disk space.

The amount of RAM required to keep ZFS happy can be staggering sometimes.

However the kick-ass things that more modern file systems bring are things like online compression, checksum'ng, raid-like features, easy subvolumes. That sort of thing is very nice to have from a administrative, integrity, and availability viewpoint...

XFS: the filesystem of the future?

Posted Jan 21, 2012 18:13 UTC (Sat) by jmalcolm (subscriber, #8876) [Link]

Indeed. I have also read that deduplication on ZFS is 'broken' but I did not have a link to back that up. Still, technology has a way of improving such that 'resource usage' and maturity concerns of today become unimportant in 'the future'.

I agree that snapshotting/cloning are exciting features of systems like ZFS. The 'time-slider' that Sun added to Nautilus in OpenSolaris invoked quite a lot of jealousy in me. I also thought that Nexenta integrating ZFS into 'apt-get' with 'apt-clone' was simply brilliant. (Note: on Ubuntu, 'apt-clone' is something else)

As a developer, I sometimes do silly thing like build or install a bleeding edge version of an important library which I later regret. I would love to have a simple and seamless way to roll-back the clock or easily hit the save button just before I do something stupid. Version control is great for code repositories but it does not really help me when I mess up my filesystem or install a broken version on an IDE. Not that I do those kinds of things of course...

XFS: the filesystem of the future?

Posted Jan 26, 2012 11:53 UTC (Thu) by jospoortvliet (guest, #33164) [Link]

Actually, openSUSE and SLE do this using btrfs. It's build into the zypper package manager and additionally has commandline (snapper) and GUI (in YaST) interfaces.

Based on btrfs, a timeslider in a gui filemanager would be possible too, I'm sure, either using btrfs directly or as GUI to snapper (but that'd be (open)SUSE specific unless other distro's pick up on snapper).

XFS: the filesystem of the future?

Posted Feb 7, 2012 1:38 UTC (Tue) by jmalcolm (subscriber, #8876) [Link]

I did not know that about SUSE. Thanks.

XFS: the filesystem of the future?

Posted Jan 21, 2012 23:10 UTC (Sat) by cmccabe (guest, #60281) [Link]

Yeah, I agree that dedupe is overrated, for most applications.

Also keep in mind that the more compressed and de-duped your data is, the more likely it is that you'll lose data when there's a hardware problem. Some filesystems, like HDFS, actually write out the data three times or more, which is a kind of anti-deduplication.

XFS: the filesystem of the future?

Posted Jan 23, 2012 14:29 UTC (Mon) by jezuch (subscriber, #52988) [Link]

> Yeah, I agree that dedupe is overrated, for most applications.

On the other hand, cp --reflink is quite awesome.

> Also keep in mind that the more compressed and de-duped your data is, the more likely it is that you'll lose data when there's a hardware problem. Some filesystems, like HDFS, actually write out the data three times or more, which is a kind of anti-deduplication.

I guess that native RAID-ing in the filesystem is expected to offset this risk, in any "normal" situation at least.

XFS: the filesystem of the future?

Posted Jan 23, 2012 18:48 UTC (Mon) by martinfick (subscriber, #4455) [Link]

It seems unfair to say that dedupe is overrated if you are only basing this on a single implementation (ZFS). There are many ways to dedupe which do not suffer from the same RAM problem (COW comes to mind), and I suspect that many more will be implemented in the future.

Also I suspect that you may not have considered that while RAM indeed is expensive compared to disks, if implemented properly deduping files will actually save RAM when a single file can be cached instead of many. Vserver unification, while not a full featured dedup, does allow for this RAM savings which can be huge in virtualised environments (and more).

XFS: the filesystem of the future?

Posted Jan 24, 2012 21:42 UTC (Tue) by wazoox (subscriber, #69624) [Link]

Generally speaking, dedupe is trading CPU and RAM for storage space. It has also the serious drawback of making many sequential IOs random. It probably makes sense when your storage stack is horribly expensive, or when you really need to squeeze out some more bandwidth on a replicated system, etc. However given current hard drives prices (even with the current 50% price hike) and subsystem performance (any 500 bucks RAID card can do 1 GB/s), it's almost always a gain only for the vendor.

XFS: the filesystem of the future?

Posted Jan 24, 2012 22:36 UTC (Tue) by khim (subscriber, #9252) [Link]

On the other hand dedupe is pretty good fit for SSD. SSD is expensive (albeit less expensive then RAM) and seeks are not as important.

XFS: the filesystem of the future?

Posted Jan 25, 2012 7:48 UTC (Wed) by wazoox (subscriber, #69624) [Link]

That's true, but so far dedupe is mostly touted for secondary-level storage, so SSDs are a bit of a stretch.

XFS: the filesystem of the future?

Posted Jan 28, 2012 20:36 UTC (Sat) by robbe (guest, #16131) [Link]

> in many situations 5GB is considerably more expensive then 1TB of disk
> space.

How do you figure? This amount of memory sets me back for less than double the cost of disk space. But *only* if looking at ECC RAM versus cheap & big SATA storage. Go to SAS, as used in many servers, and the scale is more like 1:1.

And that's not even considering RAID, where net capacity is not 100%

Of course, in many environments, you need to get RAM and disks from your server vendor, and they mark up prices arbitrarily ... so your numbers can come out completely different.

XFS: the filesystem of the future?

Posted Jan 22, 2012 13:47 UTC (Sun) by dgc (subscriber, #6611) [Link]

> Filesystem of the future? Btrfs and ZFS are more feature rich although
> adding LVM2 and mdraid to XFS closes the gap. Of course, even that setup
> lacks deduplication.

I address this point in the presentation - XFS is not trying to replace BTRFS as XFS has a fundamentally different view of data to BTRFS and ZFS. That is, XFS does not "transform" user data (e.g. CRC, encrypt, compress or dedupe) as it passes through the filesystem. All XFS does is provide an extremely large pipe to move the data to/from the storage hardware to/from the application. There is no way we can scale to tens of GB/s data throughput if we have to run CPU based calculations on every piece of data that passes through it.

This is a fundamental limitation of filesystems like BTRFS and ZFS - they assume that there is CPU and memory available to burn for the transformations and that they scale arbitrarily well. If you are limited on your CPU or memory (e.g. your application is using it!) then hardware offload is the only way you can scale such data transforms. At that point, you may as well be using XFS.

i.e. BTRFS will only scale with all it's features enabled up to a certain point, but there are already many people out there with much higher performance requirements than that cross-over point. It's above that cross-over point that I see XFS as "the filesystem of the future". Indeed, I expect the BTRFS system/binary/home filesystems and XFS production data filesystems combination to become a quite common server configuration in the not too distant future....

Dave.

XFS: the filesystem of the future?

Posted Jan 23, 2012 14:24 UTC (Mon) by masoncl (subscriber, #47138) [Link]

Dave gave us (the btrfs list) a chance to optimize things for his runs a few weeks ago. We've got patches in hand that do make it much faster, but the biggest improvement is just using a larger btree block size. That lets us dramatically reduce the metadata required to track the extents.

XFS is putting out awesome numbers in these workloads, well done.

XFS: the filesystem of the future?

Posted Feb 2, 2012 12:21 UTC (Thu) by ArbitraryConstant (guest, #42725) [Link]

> This is a fundamental limitation of filesystems like BTRFS and ZFS - they assume that there is CPU and memory available to burn for the
> transformations and that they scale arbitrarily well. If you are limited on your CPU or memory (e.g. your application is using it!) then hardware offload
> is the only way you can scale such data transforms. At that point, you may as well be using XFS.

Doesn't that more or less depend on the storage/volume management though?

Doing thin provisioning on a SAN, many of the potential gains from btrfs have already been covered and there's no sense paying for them twice. Reasonable data integrity protection is available if appropriately configured. In that case, yes, xfs has a lot going for it.

But on local disk none of the options look that great. LVM can't do non-crap thin provisioning. In that case if you selectively set nodatacow on btrfs and succeed in having it act like other filesystems, that's really a huge win. Nodatacow is useful for things like databases that frequently (eg mysql/innodb) implement their own CRC, but filesystem CRC is available for other applications on the same storage pool.

The btrfs guys seem pretty focused on fsck for now, but RAID5/6 and subvol/file level RAID is in the works. There's no non-crap way to do this on local disk, certainly not in any way that's easy to reize or thin provision, but mixing RAID levels is no problem for a SAN.

You suggest xfs shouldn't be regarded as targeted towards big iron because its performance is relevant to current and future inexpensive hosts, but it still seems pretty specialized in that direction if inexpensive hosts need high end storage to get important features. Btrfs brings SAN functionality within the ambit of cheap local storage.

CPU to burn won't always be true, but we're getting a lot of cores these days, we're getting them cheaper than most storage solutions, and they're getting CRC acceleration instructions. I wouldn't be surprised if CRC performance on one of these 16+ thread CPUs was indistinguishable from memory bandiwdth.

XFS: the filesystem of the future?

Posted Jan 21, 2012 11:55 UTC (Sat) by and (guest, #2883) [Link]

> For one or two threads, XFS is still slightly slower than ext4, but it
> scales linearly up to eight threads, while ext4 gets worse, and btrfs gets
> a lot worse

I don't know which benchmark result motivated this statement, but the image on the right still shows ext4 being twice as fast as the new XFS, at least for one and two threads...

XFS: the filesystem of the future?

Posted Jan 21, 2012 16:29 UTC (Sat) by hmh (subscriber, #3838) [Link]

The thing is, on any server worth of notice you will have 8 to 32 cores, and mkxfs will usually give you at least 4 aggregation groups per filesystem.

AFAIK, that means you'll be using xfs with at least 4 threads per filesystem on servers. Also, the lower IOPS matters a lot.

Desktops and laptops might well be best served by a different fs, which is fine. Use the best tool for the job, and the tools need not all be the same tool.

XFS: the filesystem of the future?

Posted Jan 21, 2012 22:20 UTC (Sat) by csamuel (✭ supporter ✭, #2624) [Link]

Maybe I'm misunderstanding, but I always thought you would want as many IOPS as possible from your storage system. Otherwise you're going to bottleneck horribly on some codes (yes badly written Java bioinfomatics codes with your 1 byte synchronous I/O's, I'm looking at you).

XFS: the filesystem of the future?

Posted Jan 21, 2012 23:57 UTC (Sat) by dlang (guest, #313) [Link]

some software is limited by iops, other software is limited by throughput. you can's say that one is all that matters.

XFS: the filesystem of the future?

Posted Jan 22, 2012 10:56 UTC (Sun) by ttonino (guest, #4073) [Link]

I think the 'IOPS' in the above graphs are the result of benchmarking. So from 2 to 4 threads, the ext4 benchmark result goes up a little, but the number of IO's hitting the disks explodes tenfold.

Which means that ext4 starts to produce inefficient I/O patterns with multiple threads, while XFS is better at combining the I/O's.

Compare to the CPU load while running a disk benchmark; you want that to be as low as possible, compared to throughput.

XFS: the filesystem of the future?

Posted Jan 23, 2012 0:26 UTC (Mon) by dgc (subscriber, #6611) [Link]

Hi hmh,

> The thing is, on any server worth of notice you will have 8 to 32 cores,
> and mkxfs will usually give you at least 4 aggregation groups per
> filesystem.

The default for non-raid devices (single drives or hardware RAID that does not advertise it's configuration) is 4AGs below 4TB, and then scales at 1AG per TB. I was testing on a 17TB volume, so the default mkfs config gave 17 AGs because the virtio device doesn't pass on RAID alignment from the host.

I even pointed out in the talk some performance artifacts in the distribution plots that were a result of separate threads lock-stepping at times on AG resources, and that increasing the number of AGs solves the problem (and makes XFS even faster!) e.g. at 8 threads, XFS unlink is about 20% faster when I increase the number of AGs from 17 to 32 on teh same test rig.

If you have a workload that has a heavy concurrent metadata modification workload, then increasing the number of AGs might be a good thing. I tend to use 2x the number of CPU cores as a general rule of thumb for such workloads but the best tunings are highly depended on the workload so you should start just by using the defaults. :)

> Desktops and laptops might well be best served by a different fs, which is fine.

Which for me all run XFS.

Indeed, laptops and desktops are signficantly more powerful than you give them credit for. My desktop has 8 CPU threads, 500MB/s of IO bandwidth and can do 70,000 random 4k write IOPS and that cost less than $AU1500 when I bought it a couple of years ago. That's a serious amount of capability at very low cost, yet ext4 struggles to make use of all that capability. XFS, OTOH, is a perfect fit for such configurations, especially if you are running highly parallel applications (like kernel builds) all the time on your desktop....

This is one of the things I'm trying to make people aware of - that high performance (SSD) and large scale storage (4TB drives) are here right now and are affordable on your desktop and laptop. Filesystem concurrency and high throughput is not a supercomputer or high end server problem anymore - what you need from your desktop filesystem to use the maximum potential of your storage hardware is very different from what was needed 5 years ago....

> Use the best tool for the job, and the tools need not all be the same
> tool.

I couldn't have said it any better myself. :)

Dave.

XFS: the filesystem of the future?

Posted Feb 7, 2012 2:14 UTC (Tue) by jmalcolm (subscriber, #8876) [Link]

This is really one of my favourite things about Open Source. Much of the amazing technology that allows me to get the most out of my modest hardware was originally designed for very high-end workstation and data-center type use. As commodity hardware approaches the capabilities of yesterday's high end the benefits of the software trickle-down to the rest of us.

XFS was not designed for home media server or my development laptop but I sure enjoy it anyway.

XFS: the filesystem of the future?

Posted Jan 22, 2012 12:59 UTC (Sun) by dgc (subscriber, #6611) [Link]

Hi and,

Scalability is not about peak single thread throughput - it's about how much of that per-cpu throughput is maintained as concurrency increases. Also, there were more results in the presentation than just those in the article above.

As I explained in the talk, these are simple workloads designed to demonstrate fundamental differences in behaviour between the filesystems. The absolute numbers don't matter - it's the trends in performance that are important. To summarise, XFS shows linear scaling in both throughput and IO patterns from 1 to 8 threads for file creation, traversal and removal. ext4 goes extremely non-linear at the IO level as the thread count increases and that's what causes throughput to suffer. BTRFS has a mix of CPU usage and IO scalability issues - sometimes it is CPU bound, other times it is IO bound like ext4 - and so goes non-linear for different reasons.

FWIW, I didn't run numbers for 16 threads for the presentation simply because I didn't have a week to wait for ext4 and BTRFS to create, traversal an unlink the 400 million files such a workload would have created. If they ran as fast as XFS (roughly 6 hours for the complete 16 thread workloads), then I would have presented them....

Dave.

XFS: the filesystem of the future?

Posted Jan 22, 2012 13:30 UTC (Sun) by and (guest, #2883) [Link]

Hi Dave,

my comment was not meant as an attack on XFS (I think it's a really nice file system and I even used it for a while), I was just wondering about the IMHO glaring discrepancy between the article's text and the actual results shown.

On a different matter: does XFS still scale (almost) linearly for 16 or 32 threads? if yes, neither XFS nor the VFS layer are probably hitting the scalability wall.

to summarize how I understood you: you recommend using XFS for "big storage" systems, while -- for the time being -- the desktop use-case is still better served by ext4?

XFS: the filesystem of the future?

Posted Jan 22, 2012 15:06 UTC (Sun) by dgc (subscriber, #6611) [Link]

> my comment was not meant as an attack on XFS

I didn't take it that way - I was trying to explain what it meant ;)

> (I think it's a really
> nice file system and I even used it for a while), I was just wondering
> about the IMHO glaring discrepancy between the article's text and the
> actual results shown.

There's always the problem of context. Jon has done a pretty good job of communicating my message, but it's hard to convey all the context that I put around those charts. IOWs, for explanations of the finer details of the charts you really should watch the video first - I do actually explain that ext4 is faster for 1-2 threads on most of the workloads I presented and I explain why that is the case, too...

> On a different matter: does XFS still scale (almost) linearly for
> 16 or 32 threads? if yes, neither XFS nor the VFS layer are probably
> hitting the scalability wall.

That's in the talk, too, and I think Jon mentioned it in the article. ;)

For some raw numbers at 16 threads, XFS performance increases from about 100k file creates/s to about 130k file creates/s, but CPU usage doubles and most of it is wasted spinning on the contended VFS inode and dentry cache LRU locks. I have some patches that I'm working on to minimise that problem.

> to summarize how I understood you: you recommend using XFS for "big
> storage" systems, while -- for the time being -- the desktop use-case
> is still better served by ext4?

No, that's exactly the opposite of what I am saying. My point is that even for desktop use cases, XFS is now so close ext4 performance on single threaded metadata intensive workloads that it ext4 has lost the one historical advantage it held over XFS.

The example I use in the talk is untarring a kernel tarball - XFS used to take a minute, ext4 about 3s. That's one of the common "20x slower" workloads that people saw all the time. Now XFS will do that same untar in 4s. And the typical 50x slower workload was then doing a 'rm -rf' on that unpacked kernel tarball. XFS has gone from about a minute down to 3s, compared to 2s for ext4....

While these XFS numbers are still slightly slower from a "benchmark perspective", in practice most people will consider them both to be "instantaneous" because they now both complete faster than the time it takes to type your next command. IOWs, users will notice no practical difference between the performance of the filesystems for common desktop/workstation usage.

Let's now fast forward a few months: your desktop and server system filesystems will be BTRFS, whilst XFS is fast enough and scales well enough for everything else that BTRFS can't be used for. I can't see where ext4 fits into this picture because, AFAIC, XFS is now the better choice for just about every workload you wouldn't use BTRFS for....

So I'm not sure ext4 has a future - ext4 was always intended as a stop-gap measure until BTRFS is ready to take over as the default linux filesystem. This milestone is rapidly approaching, so now is a good time to look at the future of the other filesystems that are typically used on systems. Everyone knows what I think now, so I'm very interested in what users and developers think about what I've said and the questions I've posed. The upcoming LSF workshop could be very interesting. :)

Dave.

XFS: the filesystem of the future?

Posted Jan 21, 2012 22:35 UTC (Sat) by dcg (subscriber, #9198) [Link]

I'm curious - what are the potentially unsolvable problems of Btrfs?

XFS: the filesystem of the future?

Posted Jan 22, 2012 15:31 UTC (Sun) by dgc (subscriber, #6611) [Link]

Hi DiegoCG,

There's several issues that I can see. The big one is that BTRFS metadata trees grow very large and as they grow larger they get slower because it takes more IO to get to any given piece of metadata in the filesystem. When you have a metadata tree that contains 150GB of metadata (what the 8-thread benchmarks I was running ended up with), finding things can take some time and burn a lot of IO and CPU.

This shows up with workloads like the directory traversal - btrfs has a lot more dependent reads to get the directory data from the tree than XFS or ext4 and so is significantly slower at such operations. Whether that can be fixed or not is an open question. Rebalancing (expensive) and larger btree block sizes (mkfs option) are probably ways of reducing the impact of this problem, but metadata tree growth can't actually be avoided.

Another problem is that as the BTRFS filesystem ages, it becomes fragmented due to all the COW that is done. sequential read IO performance will degrade over time as the data gets moved around more widely. Indeed, as the filesystem fills from the bottom up, the distance between where the file data was first written and where the next COW block is written will increase. Hence on spinning rust, seek times will also increase when reading as physical distance between sequential data also increases as the filesystem ages. automatic defrag is the usual way to fix this, but that can be expensive if it occurs at the wrong time...

Then there is the amount of IO that BTRFS does - for a COW filesystem that is supposed to be able to do sequential write IOs, it does an awful lot of small writes and a lot of seeks. Indeed, the limiting performance in all my testing was that BTRFS rapidly got IOPS bound at about 6000 IOPS - sometimes even on single threaded workloads. Part of that is the RAID1 metadata, but even when I turned that off it still drove the disk way harder than XFS and was IOPS bound more than half the time. I'm sure this is fixable to some extent, but I'd suggest there's lots of work to be done here because it ties into the transaction reservation subsystem and how it drives writeback.

[ As an aside, that was one of the big changes I talked about for XFS - making metadata writeback scale. In most cases for XFS, that is driven by the transaction reservation subsystem just like BTRFS does. It's not a simply problem to solve :/ ]

The last thing I'll mention briefly because I've already said some stuff about it is the scalability of the data transformation algorithms in BTRFS. There is already considerable effort going into reducing the overhead of tranformations, but the problem that may not be solvable for everyone - you can only make compression/CRCs/etc so fast and use only so much memory.

I could keep going, but this will give you an idea of some of the problems that are apparent from the scalability testing I was doing....

Dave.

XFS: the filesystem of the future?

Posted Jan 22, 2012 20:53 UTC (Sun) by dcg (subscriber, #9198) [Link]

I didn't know that Btrfs needed more IO for metadata...I'm not sure of why. Maybe the generic btree and the extra fields needed for the btrfs features make the btree less compact? Or the double indexing of the contents of a directory doubles the size of the metadata?

I understand the concerns about fragmentation of data due to COW - my workstation runs on top of Btrfs and some files are so fragmented that they don't seem like files anymore (.mozilla/firefox/loderdap.default/urlclassifier3.sqlite: 1145 extents found).

But COW data fragmentation isn't just the reverse of a coin - I guess you could say that non-COW filesystem such as XFS also suffer "write fragmentation" (although I don't know how much of a real problem is). From this point of view, using COW or not for data may be mostly a matter of policy. And since Btrfs can disable data COW not just for the entire filesystems, but for individual files/directories/subvolumes, it doesn't really seem a real problem - "if it hurts, don't do it". And the same applies for data checksums and the rest of data transformations.

As for the issue of making medatada writeback scalable, that probably was the most interesting part of your talk. I imagined it as a particular case of softupdates.

XFS: the filesystem of the future?

Posted Jan 22, 2012 23:49 UTC (Sun) by dgc (subscriber, #6611) [Link]

> I didn't know that Btrfs needed more IO for metadata...

It might be butter, but it is not magic. :)

> But COW data fragmentation isn't just the reverse of a coin - I guess you
> could say that non-COW filesystem such as XFS also suffer "write
> fragmentation" (although I don't know how much of a real problem is).

What I described is more of an "overwrite fragmentation" problem which non-COW filesystems do not suffer from at all for data or metadata. They just overwrite in place so if the initial allocation is contiguous, it remains that way for the life of the file/metadata. Hence you don't get the same age based fragmentation and the related metadata explosion problems on non-COW filesystems.

> From this point of view, using COW or not for data may be mostly a matter
> of policy. And since Btrfs can disable data COW not just for the entire
> filesystems, but for individual files/directories/subvolumes, it doesn't
> really seem a real problem - "if it hurts, don't do it". And the same
> applies for data checksums and the rest of data transformations.

Sure, you can use nodatacow on BTRFS, but then you are overwriting in place and BTRFS cannot do snapshots or any data transforms (even CRCs, IIRC) on such files. IOWs, you have a file that behaves exactly like it is on a traditional filesystem and you have none of the features or protections that made you want to use BTRFS in the first place. IOWs, you may as well use XFS to store nodatacow files because it will be faster and scale better. :P

Dave.

XFS: the filesystem of the future?

Posted Jan 23, 2012 14:38 UTC (Mon) by masoncl (subscriber, #47138) [Link]

Most of the metadata reads come from tracking the extents. All the backrefs we store for each extent get expensive in these workloads. We're definitely reducing this, starting with just using bigger btree blocks. Longer term if that doesn't resolve things we'll make a optimized extent record just for the metadata blocks.

I only partially agree on the crcs. The intel crc32c optimizations do make it possible for a reasonably large server to scale to really fast storage. But the part where we hand IO off to threads introduces enough latencies to notice in some benchmarks on fast SSDs.

Also, since we have to store the crc for each 4KB block, we do end up tracking much more metadata on the file with crcs on (this is a much bigger factor than the computation time).

With all of that said, there's no reason Btrfs with crcs off can't be as fast as XFS for huge files on huge arrays. Today though, xfs has decades of practice and infrastructure in those workloads.

XFS: the filesystem of the future?

Posted Jan 22, 2012 20:45 UTC (Sun) by kleptog (subscriber, #1183) [Link]

One problem is the Copy On Write nature of the filesystem. While for many usages this is not a problem, possibly even good, it's makes it completely unsuitable to run a database on.

Databases care about reliability and speed. One thing they do is preallocate space for journals to ensure they exist after a crash. Data is updated in place. On a COW filesystem these otherwise efficient methods turn your nice sequentially allocated tables into enormously fragmented files. Ofcourse, SSD will make fragmentation moot, but the $/GB for spinning disks is still a lot lower.

I guess this is because databases and filesystems are trying to solve some of the same problems. A few years ago I actually expected filesystems to export transaction-like features to userspace programs (apparently NTFS does, but that's no good on Linux), but I see no movement on that front.

For example, the whole issue of whether to flush files on rename becomes moot if the program can simply make clear that this is supposed to be an atomic update of the file. This would give the filesystem the necessary information to know that it can defer the writes to the new file, just as long as the rename comes after. Right now there's no way to indicate that.

If you have transactions you don't need to rename at all, just start a transaction, rewrite the file and commit. Much simpler.

XFS: the filesystem of the future?

Posted Jan 22, 2012 23:46 UTC (Sun) by coolcold (guest, #82499) [Link]

Can you please specify kernel versions for vanilla/rhel you are talking/writing about? As I could understand it is 3.0.x for vanilla, but my english is not so good to recognize what it was said in http://www.youtube.com/watch?feature=player_detailpage&...

It would be nice to add version into slides too.

XFS: the filesystem of the future?

Posted Jan 23, 2012 0:25 UTC (Mon) by dgc (subscriber, #6611) [Link]

Hi CoolCold,

The upstream kernel that was used for most of the testing was 3.2-rc6, and the version of RHEL that has all the XFS improvements in it is RHEL 6.2.

Dave.

XFS: the filesystem of the future?

Posted Jan 23, 2012 12:44 UTC (Mon) by wdaniels (guest, #80192) [Link]

"It just is not practical to take a petabyte-scale filesystem offline to run a filesystem check and repair tool; that work really needs to be done online in the future."

It's also not practical sometimes to backup/format/restore just to shrink some much smaller volumes. It takes time to copy 6TB of data even over a gigabit link. The last time I tried to use XFS I got caught out because I didn't know that you can't just shrink an XFS filesystem the way you can with ext4.

Has this changed? If not, is it ever likely to?

"So, he asked: why do we still need ext4?"

The problem for me is that choice of filesystem is not _usually_ significant _enough_ for whatever I'm doing to research the differences in very much depth. I used to be more inclined to experiment, but got caught out too many times and ended up losing time on projects because of it.

It's not often very important to be able to easily shrink a filesystem, so long as you know you're not going to be able to do it in advance.

I pretty much know where I am with ext so that is always my first choice until I next encounter some particular requirement for best performance in some way or another.

So that really is the point of ext4 as far as I'm concerned...fewer surprises for people with other priorities. And I think that reasoning holds up well for choosing a default filesystem in distros.

XFS: the filesystem of the future?

Posted Jan 23, 2012 17:35 UTC (Mon) by sandeen (subscriber, #42852) [Link]

IIRC Dave addressed a similar question about shrink in the talk, with something like "datasets generally gets bigger, not smaller" and when pressed, suggested that one should (could?) instead use thin provisioning to manage dynamically changing space requirements.

As you point out, there is value to familiarity, but it's also worth poking at the familiar now and then, to see if that familiarity is enough to warrant automatic selection...

XFS: the filesystem of the future?

Posted Jan 23, 2012 22:41 UTC (Mon) by dgc (subscriber, #6611) [Link]

[stuff about shrinking filesystems]

There is no reason why we can't shrink an XFS filesystem - it's not rocket science but there's quite a bit of fiddly work to do and validate:

http://xfs.org/index.php/Shrinking_Support

If you really want shrink support, there's nothing stopping you from doing the work - we'll certainly help you as needed and test and review the changes if you really want to do it. That invitation is extended to anyone who wants to help implement it and write all the tests needed to validate the implementation. I'd estimate about a man-year of work is needed to get it to production ready.

However...

The reason it hasn't been done is that there is basically no demand for shrinking large filesystems. Storage is -cheap-, and in most environments data sets and capacity only grow.

I mentioned thin provisioning in my talk when asked about shrinking - it makes shrinking a redundant feature. All you need to do is run fstrim on the filesystem to tell the storage what regions are unused and all that unused space is returned to the storage free space pool. The filesystem has not changed at all, but the amount of space it consumes is now only the allocated blocks. It will free up more space than even shrinking the filesystem will....

Further, shrinking via thin provisioning is completely filesystem independent so the "shrink" method is common across all filesystems that supports discard operations. IOWs, there's less you need to know about individual filesystem functionality...

In comparison, shrinking is substantially more complex, requires moving data, inodes, directories and other metadata around (i.e. new transactions), requires some tricky operations (like moving the journal!), invalidates all your incremental backups (because inode numbers change), and on top of it all you have to be prepared for a shrink to fail. This means you need to take a full backup before running the operation. If a shrink operation fails you could be left in an unrecoverable situation, requiring a mkfs/restore to recover from. At that point, you may as well just do a dump/mkfs/restore cycle...

IOWs, shrinking is not a simple operation, it has a considerable risk associated with it, and requires a considerable engineering and validation effort to implement it. Those are good arguments for not supporting it, especially as thin provisioning is a more robust and faster way of managing limited storage pools.

Dave.

XFS: the filesystem of the future?

Posted Jan 23, 2012 23:10 UTC (Mon) by dlang (guest, #313) [Link]

shrinking can fail, that I agree with.

but shrinking should never fail in a way that leaves the filesystem invalid.

block numbers will need to change, but I don't see why inode numbers would have to change (and if you don't change those, then lots of other problems vanish), they are already independent of where on the disk the data lives.

this seems fairly obvious to me, what am I missing that makes the simple approach of

identify something to move
copy the data blocks
change the block pointers to the new blocks
free the old blocks
repeat until you have moved everything

not work? (at least for the file data)

if you try to do this on a live filesystem, then you need to do a lot of locking and other changes to make sure the data doesn't change under you (and that new data doesn't go into the space you are trying to free), but if the filesystem is offline for the shrink this shouldn't be an issue.

moving metadata will be more complex, but the worst case should be that something can't be moved, and so you can't shrink the filesystem beyond that point, but there should still be no risk.

XFS: the filesystem of the future?

Posted Jan 24, 2012 0:03 UTC (Tue) by dgc (subscriber, #6611) [Link]

> block numbers will need to change, but I don't see why inode numbers
> would have to change (and if you don't change those, then lots of other
> problems vanish), they are already independent of where on the disk the
> data lives.

Inode numbers in XFS are an encoding of their location on disk. To shrink, you have to physically move inodes and so their number changes.

> this seems fairly obvious to me, what am I missing that makes the
> simple approach of

[snip description of what xfs_fsr does for files]

> not work? (at least for file data)

Moving data and inodes is trivial - most of that is already there with the [almost finished] xfs_reno tool (moves inodes) and the xfs_fsr (moves data) tools. It's all the other corner cases that are complex and very hard to get right.

The "identify something to move" operation is not trivial in the case of random metadata blocks in the regions that will be shrunk. A file may have all it's data in a safe location, but it may have metadata in some place that needs to be moved (e.g. an extent tree block). Same for directories, symlinks, attributes, etc. That currently requires a complete metadata tree walk which is rather expensive. It will be easier and much faster when the reverse mapping tree goes in, though.

The biggest piece of work is metadata relocation. For each different type of metadata that needs to be relocated, the action is different - reallocation of the metadata block and then updating all the sibling, parent and multiple index blocks that point to it is not a simple thing to do. It's easy to get wrong and hard to validate. And there are a lot of different types. e.g. there are 6 different types of metadata blocks with multiply interconnected indexes in the directory structure alone.

> if you try to do this on a live filesystem

If we want it to be a fail-safe operation then it can only be done online. xfs_fsr and xfs_reno already work online and are fail-safe. Essentially, every metadata change must be atomic and recoverable and that means it has to be done through the transaction subsystem. We don't have a transaction subsystem implemented for offline userspace utilities, so a failure during an offline shrink would almost certainly result in a corrupted filesystem or data loss. :(

In case you hadn't guessed by now, one of the reasons we haven't implemented shrinking is that we know *exactly* how complex it actually is to get it right. We're not going to support a half-baked implementation that screws up, so either we do it right the first time or we don't do it at all. But if someone wants to step up to do it right then they'll get all the help they need from me. ;)

Dave.

XFS: the filesystem of the future?

Posted Jan 24, 2012 0:41 UTC (Tue) by dlang (guest, #313) [Link]

> Inode numbers in XFS are an encoding of their location on disk. To shrink, you have to physically move inodes and so their number changes.

If I understand this correctly, this means that a defrag operation would have the same problems. Does this mean that there is no way (other than backup/restore) to defrag XFS?

as for the rest of the problems (involving moving metadata), would a data-only shrink that couldn't move metadata make any sense at all?

XFS: the filesystem of the future?

Posted Jan 24, 2012 2:04 UTC (Tue) by dgc (subscriber, #6611) [Link]

xfs_fsr doesn't change the inode number. It copies the data to another temporary file and if the source file hasn't changed once the copy is complete, it atomically swaps the extents between the two inodes via a special transaction. It uses invisible IO, so not even the timestamps on the inode being defragged get changed.

As to data only shrink, that makes no sense because metadata like directories will pin the blocks high up in the filesystem. and so you won't be able to shrink it anyway....

XFS: the filesystem of the future?

Posted Jan 24, 2012 8:13 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

OK, so XFS doesn't support full defrag, it can't move metadata to improve performance - but it does have a data-only defrag which will be enough for some people.

XFS: the filesystem of the future?

Posted Jan 26, 2012 4:58 UTC (Thu) by sandeen (subscriber, #42852) [Link]

The other thing to consider about shrinking is that without a LOT of work, it will almost certainly give you a "best fit" into your new space, not an optimal layout. I've seen extN filesystems that have gone through a lot of shrink/grow/shrink/grow and the result is quite a mess, allocation wise. That's not really even a dig at ext4; if you are constantly rescrambling any filesystem like that, you're going to stray from any optimal allocations you may have had before you started...

XFS: the filesystem of the future?

Posted Jan 24, 2012 2:06 UTC (Tue) by wdaniels (guest, #80192) [Link]

Hi Dave,

I would not want to argue that shrinking support is actually needed and that was not my intention, though fstrim is certainly new to me and I thank you for that pointer. Let me explain more precisely my XFS problem in case you would like to understand more of the kinds of things that happen to people that discourage them from moving away from the familiar:

I was provisioning a new 8TB server (4 x 2TB physical disks) that was to serve as a storage area for a number of different systems. Some peculiarities of the applications meant that separate partitions were desirable. Overall, I needed a number of small (~50-250GB) volumes and to be able to utilise the remaining space for VM images. Since I wasn't sure about how many of the smaller partitions I needed, I used LVM to create one 8TB PV with LVs allocated upon that to suit.

When it came to creating the largest logical volume for all the remaining free space, I decided to try XFS because I had heard it was more efficient with large files and I was slightly worried about the performance impact of LVM (it was my first time playing with LVM also).

So I provisioned around 6TB that remained for the large XFS partition and soon filled it up. Then came the half-expected requirement to add another couple of smaller partitions. No problem I thought, LVM to the rescue. Or it would have been if I could have shrunk the XFS filesystem to truncate the logical volume!

I may well misunderstand the subtleties of XFS block allocations over LVM's physical extent mappings, sparse file allocation and the like for thin-provisioning (I often do misunderstand such things) but I don't think fstrim would have helped me there even had I known about it at the time.

I only had a 100Mbps NIC on the server so it took quite some time (days if I recall) to backup all that data, recreate the filesystem as ext4 and copy it all back.

It may well be that my use case was highly unusual, my research insufficient, my knowledge limited and/or my strategy idiotic. I'm a programmer first and reluctant sysadmin. But this is not at all unusual outside of such groups of experienced experts such as you'd find at LWN.

One reason for my posting about the shrinking issue was that I hadn't seen it mentioned yet (sorry, I did not watch the video of the full talk), but really my point was that I ended up causing myself a great deal of trouble which only re-enforced to me the wisdom of the tech dinosaurs I have worked with in my time that you should avoid deviating from what you know and trust, without sufficiently good reason.

There are many who take this view, at least enough that it seems improbable to me that you will succeed in convincing the partly-informed, risk-averse and time-constrained majority to displace ext4 for XFS as a default choice in their minds.

I think it's great that you are taking the time to promote the benefits of XFS and to keep improving on it. I read this article precisely because I was made aware through my previous screw-ups that I need to invest more time learning about different filesystems, but I generally feel better knowing about where I'm likely to face problems as much as what I have to gain. And for that reason I still find it useful when people pick up on minor detractions, even if they seem unimportant in the grand scheme of things.

Hope you understand!
-Will

XFS: the filesystem of the future?

Posted Jan 24, 2012 4:49 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Here's some advice. Shrinking a filesystem, any filesystem, is generally very problematic and often not supported because it is a very complicated operation as detailed elsewhere in the comments. My policy is to use LVM to size the filesystem for your immediate needs and a small amount of breathing room and extend as necessary. Good, modern filesystems can safely extend while mounted. With that policy you would never have the situation you describe, you would either add space where it is needed or there just physically isn't enough space and you need to buy more, those are the only options and both are easy to support.

XFS: the filesystem of the future?

Posted Jan 27, 2012 0:59 UTC (Fri) by dgc (subscriber, #6611) [Link]

Hi Will,

Your problem is poster-child case for why you should use thin provisioning. Make the filesystems as large as you want, and let actually usage of the filesystems determine where the space is used. When you then realise that the 6TB XFS volume was too large, remove stuff from it and run fstrim on it to release the free space back to the thinp pool where it is now available to be used by the other filesystems that need space. No need to shrink at all, and you have an extremely flexible solution across all your volumes and filesystems.

And if you want to limit an XFS filesystem to a specific, lesser amount of space than the entire size it was made with (after releasing all the free space), you could simply apply a directory tree quota to /...

Dave.

XFS: the filesystem of the future?

Posted Feb 7, 2012 0:22 UTC (Tue) by ArbitraryConstant (guest, #42725) [Link]

> Your problem is poster-child case for why you should use thin provisioning.

hm... It doesn't seem consistent to talk about how xfs is well suited to inexpensive servers, but then require features not available from inexpensive servers for important functionality.

XFS: the filesystem of the future?

Posted Feb 7, 2012 3:22 UTC (Tue) by dgc (subscriber, #6611) [Link]

> hm... It doesn't seem consistent to talk about how xfs is well
> suited to inexpensive servers, but then require features not
> available from inexpensive servers for important functionality.

Thin provisioning is available to any linux system via the Device Mapper module dm-thinp. You don't need storage hardware that supports this functionality any more - all recent kernels support it.

Dave.

XFS: the filesystem of the future?

Posted Feb 7, 2012 3:40 UTC (Tue) by ArbitraryConstant (guest, #42725) [Link]

Interesting - that's a very useful feature! Thanks

XFS: the filesystem of the future?

Posted Jan 23, 2012 16:02 UTC (Mon) by Nelson (subscriber, #21712) [Link]

XFS has been incredibly robust and reliable in my experience with it, I won't go in to details but I shipped an appliance with it and also used it on several servers since it was available until just a couple years ago. If you lose data you lose data though and that's always touchy and tough to get over, can't say I ever did though.

What stood out to me was that XFS wasn't a firstclass filesystem in greater Linuxdom, that's why I switched. the Ext2/3/4 family was and BTRFS is chosen to be the next one, should they get it worked out. I remember doing security upgrades and XFS would panic at boot time, and then with extended attributes being in different states in different distributions and such, extra security features may or may not work. I also got a pretty overwhelming feeling that JFS is all but dead and XFS was getting bug fixes but for the most part, it was "done." Running XFS effectively took you off main street and at some level made more work for you. Not sure how much that is still the case.

XFS: the filesystem of the future?

Posted Jan 23, 2012 17:45 UTC (Mon) by sandeen (subscriber, #42852) [Link]

It's very much NOT the case that xfs is in bugfix-only mode.

Just looking at commit logs, fs/xfs has had 830 commits since 2.6.32; fs/ext4 & fs/jbd2 have had 826. They are both very actively developed.

fs/jfs has had 114, for reference.

Commits are a coarse measure, but it should give you some idea of the activity going on.

XFS has never been as "main street" as extN, but that's a bit of a circular argument; fewer people use xfs because it's less familiar because fewer people use xfs because ...

XFS2?

Posted Jan 23, 2012 19:07 UTC (Mon) by martinfick (subscriber, #4455) [Link]

> That implies an on-disk format change. The plan, according to Dave, is to not provide any sort of forward or backward format compatibility; the format change will be a true flag day. This is being done to allow complete freedom in designing a new format that will serve XFS users for a long time.

Why not take a leaf out of the EXT book, and simply fork the filesystem at this point and call it XFS2?

XFS2?

Posted Jan 23, 2012 21:16 UTC (Mon) by dgc (subscriber, #6611) [Link]

No fork is needed because 99% of the code that does all the work will be common to both formats. The majority of change is in the routines that read and write the disk format, and that's a very small amount of code that is mostly isolated.

Dave.

XFS2?

Posted Jan 24, 2012 4:36 UTC (Tue) by raven667 (subscriber, #5198) [Link]

There is a lot of sharing between ext* as well and even compatibility of disk structures but they are still differently named.

XFS2?

Posted Jan 24, 2012 10:18 UTC (Tue) by dgm (subscriber, #49227) [Link]

Also, it will make the life of everybody easier, specially kernel developers trying to bisect a kernel (look at what happened when btrfs changed disk format).

You can always deprecate XFS1 format in the future if you wish.

XFS2?

Posted Jan 26, 2012 4:59 UTC (Thu) by sandeen (subscriber, #42852) [Link]

> You can always deprecate XFS1 format in the future if you wish.

Just like we successfully deprecated ext2 and ext3 - right? ;)

There is no reason to fork XFS, IMHO.

Get your facts straight

Posted Jan 26, 2012 8:27 UTC (Thu) by khim (subscriber, #9252) [Link]

Well, original extfs was eventually deprecated and we had no way to deprecate ext2 and/or ext3 till ext4 was mature enough. In fact ext2/ext3 switch shows perfectly why such changes are better to do that way: when you introduce disk format change you usually do that to provide some new features and if people don't need these features they can continue to use old format and old, stable, codebase.

Get your facts straight

Posted Jan 27, 2012 1:07 UTC (Fri) by dgc (subscriber, #6611) [Link]

> In fact ext2/ext3 switch shows perfectly why such changes are better
> to do that way: when you introduce disk format change you usually do
> that to provide some new features and if people don't need these
> features they can continue to use old format and old, stable, codebase.

Actually, the ext2/3/4 splits show exactly why this model doesn't work. Instead of having a single code base to maintain, you have independent code bases that have to be maintained and fixes ported across all trees. What has really happened is that the "old stable" code bases have become "old stale" code bases as ext4 has moved on.

IOWs, history has already shown that we (developers) are poor at pushing fixes made in the ext4 code base back to the ext3 and ext2 code bases,and that's often because the person that made the fix is completely unaware that the problem also exists in ext2/3. There's a reason that the "use ext4 for ext2/3" config option exists - so that one code base can be used to support all three different filesystem types.

That's a big reason for not forking or renaming XFS just for changing 1% of the code base - ongoing maintenance is far simpler and less burdensome when only one code base is used for all different versions of the filesystem....

Dave.

XFS2?

Posted Jan 23, 2012 21:26 UTC (Mon) by sandeen (subscriber, #42852) [Link]

Note also that this is not the first XFS disk format change; XFS can handle such changes gracefully. But not all changes are compatible in both directions; trying to maintain forward & backwards compatibility across all features is a trail of woe.

However the statement "no forward or backward compatibility" sounds a bit extreme; old filesystems can still be mounted & run by the newer code, the new features just won't be there, and there won't be a facility to "upgrade" an older filesystem.

Newer filesystems with newer features won't, however, be mountable by older kernels. Again, not the first time XFS has gone down this path, no forks required.

Extreme?

Posted Jan 23, 2012 22:08 UTC (Mon) by corbet (editor, #1) [Link]

His slide read "No attempt to provide backwards or forwards compatibility for format changes." The next one read "Flag day!". So it may be extreme, but I don't think the extremeness came from me...:)

Not Extreme

Posted Jan 23, 2012 22:11 UTC (Mon) by sandeen (subscriber, #42852) [Link]

I didn't say it came from you, Jonathan ;)

I just wanted to make sure people knew that old xfs filesystems were not being left in the dust (or, left only to old kernels).

Extreme?

Posted Jan 23, 2012 23:26 UTC (Mon) by dgc (subscriber, #6611) [Link]

I guess I didn't explain that particularly well :/

By "flag day" I mean this is a complete superblock version bump, not a new set of feature bits that some subset of the new features can be selected at mkfs time. IOWs, the on-disk format changes are an all-or-nothing change that will all land in a single release, so the functionality is either supported or it isn't....

By "No attempt to provide backwards or forwards compatibility for format changes" I mean that there is no attempt to:

- restrain the change of format in a way that older kernels can still read the metadata even in read only mode (no backwards compatibility)
- locate (and potentially limit) the new metadata fields in a way that an old format can be upgraded in place just via an on-the-fly RMW cycle (no forwards compatibility)

Hence we can simply modify the metadata block headers or records to add the new information rather than have to try to find random unused holes or padding in the metadata to locate everything. And because all XFS metadata has magic numbers in it, we can easily tell the difference between metadta formats by changing the magic numbers - that's another reason why self-describing metadata is important...

FWIW, the "find and use holes/padding" is the approach ext4 is taking to adding CRCs, so in some places they are having to use a weaker CRC (CRC16 instead of CRC32c) because there is only 16 bits of unused space in the metadata block. This is being done because the ext4 developers want to support offline addition of CRCs to a filesystem.

This, however, is exactly the sort of "compromised on-disk format" we want to avoid because it makes the implementation more complex and it doesn't provide all the information we need to detect errors that we know can happen and need to protect against. That's not an acceptable trade-off for the storage capacities we expect to be supporting in 5 years time....

I hope it's a bit clearer what I meant now ;)

Dave.

Extreme?

Posted Jan 23, 2012 23:39 UTC (Mon) by dlang (guest, #313) [Link]

so restating what I think you are saying

old kernels will not be able to use the new format (even in read-only mode)

new kernels will be able to use both the old and new format

once a filesystem is converted, there will be no way to 'unconvert' it.

This sounds reasonable (although may still be reason enough to call it XFS2 even if it's the same codebase supporting both on-disk formats)

what you initially said sounded more like

old kernels will not be able to use the new format

new kernels will not be able to use the old format

when you upgrade the kernel you will be forced to convert the filesystem, and there is no way to unconvert it.

This is not acceptable and is the reason people were concerned.

Extreme?

Posted Jan 24, 2012 0:18 UTC (Tue) by dgc (subscriber, #6611) [Link]

> old kernels will not be able to use the new format (even in read-only mode)

Correct.

> new kernels will be able to use both the old and new format

Correct.

> once a filesystem is converted, there will be no way to 'unconvert' it.

There is no "convert" or "unconvert" operation - the format is selected at mkfs time and it is fixed for the life of the filesystem. That's an explicit design decision (as I've previously described)....

> This sounds reasonable (although may still be reason enough to call it
> XFS2 even if it's the same codebase supporting both on-disk formats)

A name change would only cause confusion. XFS has a history of mkfs-only selectable format changes over time and this change is being handled in exactly the same manner as all the previous ones that have been made. If XFS was renamed every time the on-disk format changed then it would be XFS 17 by now. ;)

Dave.

XFS: the filesystem of the future?

Posted Jan 25, 2012 18:00 UTC (Wed) by sbergman27 (guest, #10767) [Link]

The video is available on Youtube. I would suggest watching it. Then watching it with the sound turned down, looking at the graphs and asking "Which filesystem looks best for *my* workloads?", at each step of the way. Even with all the new improvements to XFS, and despite the common disparagement of EXT4 as being old fashioned, EXT4 still beats XFS handily in Chinners own graphs for 1, 2, and 4 threads. XFS's domain begins at 8 threads. Why do we need EXT4? The question is almost absurd. We need it because it's still the best fs out there for most machines today and for some finite time into the future.

Sometimes I wish that certain people could get their heads out of their Fortune 100 Datacenters for a bit and take a look at how the rest of us live.

For balance, I'd recommend Avi Miller's "Best of #1" talk "I can't believe this is Butter", about btrfs. One of his messages is "use the right fs for the job". Far less defensive, dismissive, and confrontational than Dave Chinner's talk. And perhaps a bit more pertinent to those of use out here managing a few Dell T310's without fancy petabyte RAID arrays.

XFS: the filesystem of the future?

Posted Jan 26, 2012 4:12 UTC (Thu) by dgc (subscriber, #6611) [Link]

> Then watching it with the sound turned down

That's just absurd. When you strip something of all context, you can make it mean anything you want regardless of what the original message was. I was expecting to be quoted badly out of context, but that just takes the cake....

> And perhaps a bit more pertinent to those of use out here managing a
> few Dell T310's without fancy petabyte RAID arrays.

That's *exactly* the point of my talk - to smash this silly stereotype that XFS is only for massive, expensive servers and storage arrays. It is simply not true - there are more consumer NAS devices running XFS in the world than there are servers running XFS. Not to mention DVRs, or the fact that even TVs these days run XFS.

But to address you real world concern, all those benchmarks were run on a low-spec Dell R510 that cost a bit over $AU10,000 _two years ago_. You find them in datacenters everywhere. Hence those results are completely relevant to the low end server users that only have a handful of disks. The graphs clearly show that ext4 cannot scale to the capabilities of even low end servers and storage, let alone the massive servers where XFS dominates.

What you get in a $10000 server these days is the equivalent of a half million dollar supercomputer from five years ago. XFS worked (and still works) better than anything else on those 5 year old supercomputers. The talk showed that XFS also works better than anything else on new, low end hardware that, just co-incidentally, has the with equivalent capability of that five year old supercomputer....

The point is I then extrapolated from there - if ext4 can't handle current low end hardware, then what about the sort of hardware that will be available in 5 years time? There is no coherent plan to acheive this for ext4, while for XFS we are already running on supercomputers that will be the low end of the server market in 5 years time.

The difference is that the XFS developers tend to look years ahead to what we'll need in 5 years time to continue to be relevant to our users. Much of what I talked about I documented in mid-2008 on the XFS wiki. At the time I knew there was 5-10 years worth of work in everything I documented, but it would be necessary to implement it in that same time frame.

In contrast, the ext4 guys are still trying play catchup with what we could do more than 10 years ago and are now trying to work around architectural deficiencies to keep up. Do we really need to keep throwing good money after bad trying to re-invent the wheel with ext4, or should we just cut our losses and move on?

That's the question I'm asking, and to just look at a few of the graphs and ignore the rest of the context of the talk is doing everyone a disservice. We need to talk about issues, not perpetuate myths and stereotypes that are long out of date....

Dave.

XFS: the filesystem of the future?

Posted Jan 26, 2012 15:51 UTC (Thu) by SLi (subscriber, #53131) [Link]

I doubt that 5 years from now the norm for *desktops* will be 12 disks on a RAID and 8 threads writing on it.

I think <8 threads writing concurrently is going to be the normal case for desktop systems for quite a while, and I'm not sure any amount of hand-waving is going to dismiss XFS being 50% slower compared to ext4 in your benchmarks in the case that is important to desktop users.

No, 3 seconds compared to 2 seconds doesn't sound so bad, but I'd expect that to mean also that the operations that now take 2 minutes on my ext4 to take 3 minutes on XFS. A 50% difference certainly means it is appreciably slower.

Simply put, I don't think scalability in the way it is exercised in your tests is relevant to desktop users currently or in the near future. More than 8 threads may be imaginable in the future, but 12 disks is not. At least currently I care way more about a single-threaded workload being much slower on XFS (per your benchmarks). It *is* the absolute numbers that matter, after all.

XFS: the filesystem of the future?

Posted Jan 26, 2012 20:33 UTC (Thu) by dlang (guest, #313) [Link]

today most SSD drives internally act as if they are a raid set, so saying that in 5 years that RAID will be a common desktop/laptop feature is actually a fair statement.

as for multiple thread writing to it, that's more common than you think as well (media player streaming, desktop environment saving state, word processor saving state, spreadsheet saving state, .... )

XFS: the filesystem of the future?

Posted Jan 28, 2012 2:48 UTC (Sat) by sbergman27 (guest, #10767) [Link]

"today most SSD drives internally act as if they are a raid set"

Maybe. But SSD drives act, externally, as 1 drive. A single drive with excellent seek time and a sometimes very good data rate. If the drive can do parallelization of a single stream of data that means the FS and block i/o layers don't have to be so fancy. +1 for EXT4.

Regarding multiple threads writing... do you really think media player streaming and word processor saving state is going to require a half gig a second via 5+ threads for extended periods in the near future? If so, I think I'll go dust off my TV and typewriter. +1 for Magnavox and Royal.

XFS: the filesystem of the future?

Posted Jan 27, 2012 0:50 UTC (Fri) by dgc (subscriber, #6611) [Link]

> I doubt that 5 years from now the norm for *desktops* will be 12 disks
> on a RAID and 8 threads writing on it.

That device used RAID to get the capacity over 16TB, and 512MB of BBWC to keep the IOPS and bandwidth high to/from spinning disk storage. It sustained up to 10,000 IOPS and 600MB/s on some of those workloads. That's the only reason RAID was used.

To put that IO capability in context, my desktop already has 8 processor threads and a pair of SSDs that sustain 70,000 random write IOPS and 500MB/s. That cost me about $1500 a couple of years ago. It has more capability from the IO perspective than the hardware that I ran the tests on (except for capacity). And I do 8-way kernel builds on it, so I'm already writing with 8 threads to the filesystem on my desktop machine.

IOWs, the test system and tests I described are relevant not only to people with current low end servers but also relevant to those with current medium-to-high-end desktops and laptops. As I've said previously, that was the whole point of the exercise - to show how capable cheap, low end storage is these days and how only XFS can make full use of it....

> Now, 3 seconds compared to 2 seconds doesn't sound so bad, but I'd expect
> that to mean also that the operations that now take 2 minutes on my
> ext4 to take 3 minutes on XFS. A 50% difference certainly means it is
> appreciably slower.

You simply can't make that sort of generic extrapolation from one workload to all workloads. Because the main tests I ran reflected one -specific- aspect of metadata workloads (creates in isolation, read in isolation, unlink in isolation) you can use those numbers to get an idea of how a workload that is heavy in those behaviours might perform and scale. But you cannot say that all workloads will end up with a specific performance differential just because one workload had that differential...

Dave.

XFS: the filesystem of the future?

Posted Jan 28, 2012 2:32 UTC (Sat) by sbergman27 (guest, #10767) [Link]

"That's just absurd. When you strip something of all context, you can make it mean anything you want regardless of what the original message was."

I suggested watching the video with the sound up first. And then turned down to remove the salesman's pitch and allow one to concentrate on the actual data being presented. Nothing absurd about that.

I administer point of sale and accounting servers that typically handle ~100 users or registers. And also servers which handle ~80 users NX desktops. Not one of them could benefit from > 4 thread write performance. More than anything, I might could use improved fs performance for 1 or 2 threads.

Mine are not the sorts of customers who pay Red Hat the big bucks. But customers *like mine* outnumber Red Hat's customers.

I don't really care how XFS might perform on hardware that will be available in 5 years time. I care what makes sense for our current hardware. In 5 years, we'll decide what FS is suitable for our new hardware at that time. And as always, I will consider XFS along with the other options. Just because it has never been quite suitable before does not mean that it can't be next time.

-Steve

XFS: the filesystem of the future?

Posted Jan 28, 2012 17:01 UTC (Sat) by sbergman27 (guest, #10767) [Link]

"there are more consumer NAS devices running XFS in the world than there are servers running XFS"

BTW, can you point me to the reasons for chosing XFS in such an environment? The kernel module for XFS is huge. 1.7MB on the SL 6.1 machine in front of me. By "consumer" are you referring to the sub-$200 stuff that can be bought at, say, BEST BUY? I would be very interested in the rationale. I have a few Buffalo router/NAS devices. 32MB flash/64MB ram. USB 2.0. What would you say the advantages of running XFS rather than EXT4 might be on this device?

-Steve

XFS: the filesystem of the future?

Posted Feb 6, 2012 1:05 UTC (Mon) by dgc (subscriber, #6611) [Link]

> BTW, can you point me to the reasons for chosing XFS in such an
> environment?

XFS is very reliable and performs extremely consistently over the expected life of such devices? And XFS tends to perform better on NAS workloads than ext3/4 on the same hardware? And that mkfs doesn't make the user wait for a *long time* before they can use the device? And that there are tools in xfsprogs designed to optmise the manufacturing process (e.g. xfs_copy), and so on?

> By "consumer" are you referring to the sub-$200 stuff that can be
> bought at, say, BEST BUY?

Yes, exactly. Brands like thecus, netgear, dlink, etc either use XFS by default or recommend it over ext3/4 in most situations.

> I have a few Buffalo router/NAS devices.

I'm pretty sure they use XFS, too. Certainly google indicates that even the low end buffalo NAS boxes go faster with XFS on them....

Dave.

XFS: the filesystem of the future?

Posted Jan 26, 2012 9:27 UTC (Thu) by razb (guest, #43424) [Link]

NTFSv4 ? Why is there a need for the underlying file system to know about nfs ?

XFS: the filesystem of the future?

Posted Jan 26, 2012 10:53 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

Sometimes a software stack operates more smoothly and effectively if its designers allow themselves to commit carefully selected layering violations.

XFS: the filesystem of the future?

Posted Jan 26, 2012 18:41 UTC (Thu) by nix (subscriber, #2304) [Link]

Because NFS is stateless, and does everything via cookies, including its readdir() operation, the filesystem has to be able to implement seekdir()/telldir() properly. NFS is pretty much the only thing that actually needs this: pretty much nobody uses seekdir() or telldir() themselves in real code. (There are other similar strangenesses hidden away in there as well, but this is the most obvious.)

XFS: the filesystem of the future?

Posted Jan 27, 2012 1:18 UTC (Fri) by dgc (subscriber, #6611) [Link]

> Because NFS is stateless

Right and wrong.

The NFSv3 server is stateless, but the NFSv4 server is stateful to support functionality like file delegations to clients. Such functioanlity may require storing per-file state. e.g. so that the server can revoke a file delegation from a client when another client wants to access it.

Dave.

XFS: the filesystem of the future?

Posted Jan 27, 2012 22:01 UTC (Fri) by nix (subscriber, #2304) [Link]

True. I'd forgotten all about NFSv4, and you're right that statefulness actually imposes *more* requirements on the underlying fs than statelessness does, because that state has to get stored somewhere, and it has to survive server reboots...

XFS: the filesystem of the future?

Posted Jan 27, 2012 1:12 UTC (Fri) by dgc (subscriber, #6611) [Link]

> NFSv4 ? Why is there a need for the underlying file system to know about
> nfs ?

Because NFSv4 requires certain functionality from the underlying filesystem to be able to detect that a file has changed. This has to be valid across NFS server restarts (e.g. power off/on) so needs to be stored in stable storage. i.e. in the underlying filesystem. ext4 already has this functionality as it's on-disk format was changed at about the same time the NFSv4 functionality was introduced. Now that we are doing a major XFS format change, we can add this functionality as well.

Dave.

XFS: the filesystem of the future?

Posted Jan 27, 2012 14:38 UTC (Fri) by bfields (subscriber, #19510) [Link]

I doubt ext4's implementation is actually completely right, by the way; we need to take another look at it.

XFS: the filesystem of the future?

Posted Jan 29, 2012 0:58 UTC (Sun) by sbergman27 (guest, #10767) [Link]

No implementation is ever perfect. Do you have some specific reason for thinking the Ext4 implementation is bad? If so, what is it? Or was this just an opportunity to disparage the competition?

XFS: the filesystem of the future?

Posted Jan 29, 2012 2:32 UTC (Sun) by jrn (subscriber, #64214) [Link]

Check who you're replying to?

XFS: the filesystem of the future?

Posted Jan 29, 2012 4:15 UTC (Sun) by sbergman27 (guest, #10767) [Link]

J. Bruce Fields of Red Hat, if I'm not mistaken.

XFS: the filesystem of the future?

Posted Jan 29, 2012 10:49 UTC (Sun) by nix (subscriber, #2304) [Link]

Or, perhaps, he just has an uneasy feeling about it because it's changed a lot without a good look over. No need for conspiracy theories. Uneasy feelings can be surprisingly reliable when you're familiar enough with a codebase.

XFS: the filesystem of the future?

Posted Jan 29, 2012 23:14 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"Or, perhaps, he just has an uneasy feeling about it because it's changed a lot without a good look over."

Perhaps. I've spent enough time refuting baseless conspiracy theories that I am not about to promote one. I fear that I let myself become a bit of an Ext4 cheerleader, there.

Not intended. As I'm keenly interested in Ext4, Btrfs, and XFS. Clearly, all three of those have parts to play in the coming years. That Ext4 happens to be at center stage now should not detract from the fact that the trio is a team.

That's the message which I would like to convey.

Odd iozone results

Posted Jan 28, 2012 19:30 UTC (Sat) by sbergman27 (guest, #10767) [Link]

I decided to take Dave's claim of XFS being ood for small systems seriously. The small system I have handy today is my Scientific Linux 6.1 desktop machine. Q660 Quad core processor. 4GB ram. A single 1.5TB 7200rpm seagate SATA disk. Scientific Linux 6.1 uses the latest RHEL 6.2 kernel (2.6.32-220.4.1.el6.x86_64), which Dave says has all the patches he covered in his talk.

I put it into runlevel 3 and disabled all unnecessary services. Set up a 25GB logical volume formatted XFS. Mounted with Dave's suggested options: inode64,logbsize=262144

I ran:

iozone -i0 -i1 -i2 -i4 -i8 -l 4 -u 4 -F 1 2 3 4 -s 2g

Which is 4 threads each working with a 2GB file, for a total of 8GB of data.

I got the following throughput numbers:

59794 KB/s Write
60376 KB/s Rewrite
64557 KB/s Read
66706 KB/s Reread
----- Mixed Workload

I stopped the mixed workload test after 45 minutes. vmstat was reporting about 700k/s in, 2k/s out. The drive light was on solid. Iozone was using only a about 2% of processor. This seemed quite puzzling. Tried again. Same result. Switching from CFQ to Deadline or Noop i/o scheduler made no difference. No error messages in the log. And I was on a text console, so would have seen any kernel messages immediately.

I reformatted to ext4 and mounted with default options. I got these results:

68559 KB/s Write
60561 KB/s Rewrite
67697 KB/s Read
69353 KB/s Reread
----- Mixed workload

Better performance than XFS in the other tests. But same problem with Mixed workload.

Obviously, the drive was getting seeked to death. (What else *could* it be?)

So I ran:

iozone -i0 -i8 -l 1 -u 1 -F 1 -s 8g

which runs a single thread on 1 8GB file. Same problem.

I used this same iozone on a Scientific Linux 6.1 server running 3 drive software RAID1 just a few days ago and got really nice numbers for all tests, including mixed workload.

What in the world is going on here???

-Steve

Odd iozone results

Posted Jan 30, 2012 0:15 UTC (Mon) by dgc (subscriber, #6611) [Link]

Hi Steve,

> I decided to take Dave's claim of XFS being ood for small systems
> seriously.
....

> I ran:
>
> iozone -i0 -i1 -i2 -i4 -i8 -l 4 -u 4 -F 1 2 3 4 -s 2g

That is a data IO only workload - it has no metadata overhead at all. You're comparing apples to oranges considering the talk was all about metadata performance....

> I stopped the mixed workload test after 45 minutes.
.....
> What in the world is going on here???

The mixed workload does random 4KB write IO - it even tells you that when it starts. Your workload doesn't fit in memory, so it will run at disk speed. 700KB/s is 175 4k IOs per second, which is about 6ms per IO. That's close to the average seek time of a typical 7200rpm SATA drive. Given that your workload is writing 8GB of data at that speed, it will take around 3 hours to run to completion.

IOWs, there is nothing wrong with your system - it's writing data in exactly the way you asked it to do, albeit slowly. This is not the sort of problem that an experienced storage admin would fail to diagnose....

Dave.

Odd iozone results

Posted Jan 30, 2012 2:48 UTC (Mon) by raven667 (subscriber, #5198) [Link]

The mixed workload does random 4KB write IO - it even tells you that when it starts. Your workload doesn't fit in memory, so it will run at disk speed. 700KB/s is 175 4k IOs per second, which is about 6ms per IO. That's close to the average seek time of a typical 7200rpm SATA drive. Given that your workload is writing 8GB of data at that speed, it will take around 3 hours to run to completion. IOWs, there is nothing wrong with your system - it's writing data in exactly the way you asked it to do, albeit slowly. This is not the sort of problem that an experienced storage admin would fail to diagnose....

in my experience the average server admin wouldn't be able to characterize or understand this issue. I've seen a lot of magical thinking from admins and developers when it comes to storage, sometimes for networking as well although the diagnostics tend to be better for that (wireshark, tcpdump).

Odd iozone results

Posted Jan 30, 2012 6:10 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Nice jab. But you'll want to see my update to Dave.

Odd iozone results

Posted Jan 30, 2012 23:08 UTC (Mon) by raven667 (subscriber, #5198) [Link]

Sorry, I didn't mean to sound insulting but on re-reading I can see how it could be read that way. I'm not trying to jab to score points or anything, just trying to have an interesting conversation.

Odd iozone results

Posted Jan 30, 2012 22:07 UTC (Mon) by sbergman27 (guest, #10767) [Link]

As per my request to you in the /usr thread, I'm picking up here.

Firstly, for context, please read my most recent post to Dave.

I/O scheduler issues aside, the idea of the problem being, fundamentally, raw seek time is out the window.

Else how would it perform so well as it does with 16k records rather than 4k records? A 4x improvement might be expected. Not 60+ times.

-Steve

Odd iozone results

Posted Jan 30, 2012 23:15 UTC (Mon) by raven667 (subscriber, #5198) [Link]

Am I misreading the numbers, I don't see a 60x difference anywhere, it looks like about 60000-70000KB/s in all cases with minor differences between xfs or ext4.

Odd iozone results

Posted Jan 31, 2012 3:49 UTC (Tue) by sbergman27 (guest, #10767) [Link]

You're looking at the numbers after I changed the record size to 16k. See the original, running at the default record size of 4k:

http://lwn.net/Articles/477865/

I say 60x because vmstat was reporting 700k to 1000k per second. Real throughput may or may not have been far worse. It looked like it was going to take hours so I stopped it after 45 minutes.

The track to track seek on this drive is ~0.8ms. And /sys/block/sdX/queue/nr_requests is at the default 128. (Changing it to 200,000 doesn't help.) At 128 one would expect one pass over the drive platter surface, read/writing random requests, to take less than 0.1 seconds and read/write 1 MB for a read/write total of ~ 10,000KB/s. This assumes all requests to be on different cylinders.

Clearly, however, Dave is incorrect about Mixed Workload being a random read/write test. There is no way I would be seeing 60,000KB/s throughput with a 16k record size on a dataset 2x the size of ram on such a workload. And in the 4k record case I would expect to see a lot better than I do.

Clearly there is something odd going on. I'm curious what it might be.

-Steve

Odd iozone results

Posted Jan 31, 2012 4:56 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Yes I see that and the numbers in

https://lwn.net/Articles/478014/

But I don't see where you get 60x. The second might be doing 4x more work completed per io leading to fewer iops, less expensive seeks needed to complete the test if the total data size it is re/writing is fixed

The track to track seek is 0.8 but that is the minimum for tracks which are adjacent, the latency goes up the farther away the seek distance. Average is probably closer to 6. Your estimates of a full seek across the disk are way off, that is probably closer to 15 or more and that's if you don't stop and write data 128 times along the way, like stopping at every bathroom on a road trip. That's what we mean when we say that the drive can probably only do that about 175 times a second which is the limiting factor, the time it actually takes to read/write a track is less (but of course not zero)

I haven't thought about why you are getting the specific numbers you see, I have mostly used sysbench and not iozone, but the vmstat numbers seem fine for a random io workout

Odd iozone results

Posted Jan 31, 2012 5:43 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Look at it this way. You have 128 requests in a sorted queue. On average, each seek is going to be over about 1/128th of the platter surface. About 0.8%. Completely random seeks are going to average a distance of ~50% of the platter surface. The completely random seek time for this drive is about 8ms. The track to track is ~0.8ms according to the manufacturer's specs. Seeking over 0.8% of the platter surface is going to take a lot closer to the track to track seek time than to the completely random seek time.

The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time. That's the whole purpose of the elevator algorithm, and it works no matter how large your dataset is in relation to system ram.

Each read/write total on the affected tracks is going to get you 4k of read and 4k of write, for a total of 8k per track. 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s. Now, I might should have included an extra 1/120th second to account for an extra rotation of the platter between read and write. It really all depends upon exactly what the random r/w benchmark is doing. If we do that now, we get 5MB/s.

What vmstat is showing is 600k/s - 1000k/s read. And about 2k/s write. You don't think that's odd? And doesn't it seem odd that increasing the record size by 4x increases the throughput by 60+ times. (Again, I stopped the 4k test after 45 minutes and am basing that, optimistically, upon the vmstat numbers, so the reality is uncertain. The 16k record Mixed Workload test takes 2.5 minutes to complete.)

At any rate, the Mixed Workload test is *not* a random read/write benchmark. Somewhere Dave got the idea that it said it was when it started. But it does not. And that does not even make sense in light of the *name* of the benchmark. Unfortunately, the man page for iozone is of little help. It just says it's a "Mixed Workload".

I get similar results when running Mixed Workload with 4k records on the server rather than the desktop machine.

Are you beginning to appreciate the puzzle now? I've tried to make it as clear as possible. If I have not been clear enough on any particular point, please let me know and I will explain further.

-Steve

Odd iozone results

Posted Jan 31, 2012 6:55 UTC (Tue) by jimparis (guest, #38647) [Link]

> The elevator algorithm basically takes you from the worst case, totally random 8ms per request time, down closer to the 0.8ms track to track seek time.

You're forgetting about rotational latency. After arriving at a particular track, you still need the platter to rotate to the angle at which your requested data is actually stored. For a 7200 RPM disk, that's an average of 4ms (half a rotation). I don't think you can expect to do better than that.

> 128 seeks at, 0.8ms per seek works out to 0.1 seconds for a pass over the surface. And 128 requests at 8k r/w per request works out to ~10mb/s.

With 4.8ms per seek, that same calculation gives ~1600kb/s. Which is only twice what you were seeing, and it was assuming that you're really hitting that 0.8ms track seek time. I'd really expect that the time for a seek over 0.1% of the disk surface is actually quite a bit higher than the time for seek between adjacent tracks (which is something like ~0.0001% of the disk surface, assuming 1000 KTPI and 1 inch usable space).

Odd iozone results

Posted Jan 31, 2012 7:59 UTC (Tue) by sbergman27 (guest, #10767) [Link]

That's true enough. I had not actually forgotten about it, but when I did the calculation I was off by a factor of 10. (Still getting used to the fact that my new HP50g sometimes drops keystrokes if I enter too rapidly.) It didn't seem worth messing with.

Doing a linear interpolation between for 0.8ms to seek over 0.8% of the platter, and 8ms to seek over 50% of the platter, yields ~0.9ms for the seek. (And then we can add the 4ms for 1/2 rotation.) Whether using a linear interpolation is justified here is another matter.

However, let's not forget that there is *no* reason to think that the "Mixed Workload" phase is pure random read/write. The whole random seek thing is a separate side-question WRT the actual benchmark numbers I get. If the numbers *are* purely random access time for the drive, why do 4k records get me such incredibly dismal results, and 16 records get me near the peak sequential read/write rate of the drive? One would expect the random seek read/write rate to scale linearly with the record size.

BTW, thanks for the post. This has been a very interesting exercise.

-Steve

Odd iozone results

Posted Feb 2, 2012 20:52 UTC (Thu) by dgc (subscriber, #6611) [Link]

> However, let's not forget that there is *no* reason to think that the
> "Mixed Workload" phase is pure random read/write

You should run strace on it and see what the test is really doing. That's what I did, and it's definitely doing random writes followed by random reads at the syscall level....

Dave.

Odd iozone results

Posted Jan 31, 2012 7:36 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Maybe I missed it but I think thats the first time I have seen mention of the runtime for the 16kb workload 2.5m is smaller than 3h or whatever the estimate was for the previous test. Tat would be interesting to characterize but It might also be an error in testing. I notice that Jim has responded and pointed out hat some of your estimates of seek time are orders of magnitude high which is causing some of the misunderstanding.

Odd iozone results

Posted Jan 31, 2012 7:40 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Gah! Spelling errors from writing late on a touchscreen, don't judge too harshly 8-)

Odd iozone results

Posted Jan 31, 2012 8:30 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Let he who has never made a spelling blunder cast the first stone. Earlier, I was posting from the default Firefox in SL 6.1. Expecting it to catch my speeling errors inline. I made several posts before realizing that the damned thing was completely nonfunctional, and was too mortified to go back and look.

I'm in Chrome now. And it caught "speeling" quite handily.

RickMoen mentioned something about a "sixty second window" for LWN.net posts. I've never noticed any sort of window at all. As soon as I hit "Publish comment" my mistakes are frozen for all eternity.

Sometimes it's a comfort to know that no one really cares about my posts. ;-)

-Steve

Odd iozone results

Posted Jan 31, 2012 16:28 UTC (Tue) by raven667 (subscriber, #5198) [Link]

I think that was a different thing, IIUC Rick just posted something incorrect and someone called him on it 60s before he replied to his own post with the correction and he seemed a little embarrassed and miffed.

Odd iozone results

Posted Jan 31, 2012 8:19 UTC (Tue) by sbergman27 (guest, #10767) [Link]

We're probably focusing a bit too much on random read/write times when there is no evidence or rationale for thinking that the Mixed Workload phase is random read/write. Dave threw that in for reasons which are unclear. There is no evidence from the iozone output, or from its man page, to support the assertion.

If random i/o were the major factor in Mixed Workload, why does moving from a 4k block size take me from such a dismal data rate to nearly the peak sequential read/write data rate for the drive? It makes no sense.

That's the significant question. Why does increasing the record size by a factor of 4 result in such a dramatic increase in throughput?

And BTW, I have not yet explicitly congratulated the XFS team on fact that XFS is now nearly as fast as EXT4 on small system hardware. At least in this particular benchmark. So I will offer my congratulations now.

-Steve

Odd iozone results

Posted Jan 31, 2012 10:59 UTC (Tue) by dlang (guest, #313) [Link]

one thing to remember, in the general case writes can be cached, but the application pauses for reads.

If you do a sequential read test, readahead comes to your rescue, but if you are doing a mixed test, especially where the working set exceeds ram, you can end up with the application stalling waiting for a read and thus introducing additional delays between disk actions.

I don't know exactly what iozone is doing for it's mixed test, but this sort of drastic slowdown is not uncommon.

Also, if you are using a journaling filesystem, there are additional delays as the journal is updated (each write that touches the journal turns in to at least two writes, potentially with an expensive seek between them)

I would suggest running the same test on ext2 (or ext4 with the journal disabled) and see what you get.

Odd iozone results

Posted Jan 31, 2012 17:55 UTC (Tue) by jimparis (guest, #38647) [Link]

> We're probably focusing a bit too much on random read/write times when there is no evidence or rationale for thinking that the Mixed Workload phase is random read/write. Dave threw that in for reasons which are unclear. There is no evidence from the iozone output, or from its man page, to support the assertion.

From what I can tell by reading the source code, it is a mix where half of the threads are doing random reads, and half of the threads are doing random writes.

Odd iozone results

Posted Jan 30, 2012 6:08 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Hi Dave,

Thank you for the reply.

No. There is something else going on, here. If I add "-r 16384" to use 16k records rather than 4k records, the mixed workload test flies:

XFS:
59985 KB/s Write
60182 KB/s Rewrite
58812 KB/s Mixed Workload

Ext4:
69673 KB/s Write
60372 KB/s Rewrite
60678 KB/s Mixed Workload

The 16k record case is at least 60x faster than the 4k record case. Probably more. I find this to be very odd.

You had already covered the metadata case. I didn't want to replicate that. So the fact that this is not metadata intensive was intentional. This is closer XFS's traditional stomping grounds, but with small system hardware rather than large RAID arrays.

"""
The mixed workload does random 4KB write IO - it even tells you that when it starts.
"""

Check again. It doesn't say that at all. Some confusion may have been caused by my cutting and pasting the wrong command line into my post. You can safely ignore the "-i 4" which I believe *does* do random read/writes.

I should also mention that what prompted me to try 16k records was reviewing my aforementioned server benchmark, where it turns out I had specified 16k records.

Rerunning the benchmark on the 8GB ram server with 3 drive RAID1 and with a 16GB dataset, I see similar "seek death" behavior.

-Steve

Odd iozone results

Posted Jan 30, 2012 6:49 UTC (Mon) by sbergman27 (guest, #10767) [Link]

Addendum to my previous post: All the benchmarks I have mentioned use a dataset size that is 2x ram size. And the benchmark I reran on the server is identical to the original, except for the use of 4k records rather than 16k records. That may not have been clear in my previous post.

XFS: the filesystem of the future?

Posted Feb 8, 2012 2:23 UTC (Wed) by koguma (guest, #82796) [Link]

XFS is an absolute win for DB servers. I don't know what I'd do without LVM+XFS. If you're using any of the ext*'s for a db server, good luck with that.

For further reading: http://www.mysqlperformanceblog.com/2011/06/09/aligning-i...

XFS+LVM pretty much blows away anything else out there. If you're in the cloud, LVM+XFS is the only way to safely use EBS volumes.

-Kogs

XFS: the filesystem of the future?

Posted Mar 1, 2012 15:35 UTC (Thu) by XTF (guest, #83255) [Link]

Could any of the fsync advocates post real code that does the atomic variant of open, write, close?

Hint: it's not possible without tons of regressions.

Linux devs should really provide a proper solution (like O_ATOMIC) instead of blaming app devs for not doing the impossible.

XFS: the filesystem of the future?

Posted Mar 2, 2012 21:29 UTC (Fri) by dlang (guest, #313) [Link]

>Could any of the fsync advocates post real code that does the atomic variant of open, write, close?
>
> Hint: it's not possible without tons of regressions.

depending on how much you want to write, and what your definition of atomic is, what you are asking for may not be possible.

it also depends on what you mean about 'tons of regressions'

but what you can do is

open temp file
write to temp file
close temp file
fsync temp file (may require fsync of directory)
mv temp file to name of real file
if you want to be sure the change has taken place, fsync directory.

if you are doing this in a shell script, you have to do 'sync' instead of fsync (on ext3, the result is the same, on other filesystems fsync is significantly faster)

no, this isn't "real code", but translating it into your language of choice is not that hard.

There are a lot of programs out there that do this right today. Most mail, nntp, and database apps do this right because their users are unwilling to loose data, and they need to work across different flavors of Unix.

for the longer answer, lookup the lwn.net article on safely saving data from a few months back

XFS: the filesystem of the future?

Posted Mar 2, 2012 22:18 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

The problem is that if the target is a symlink or a hardlink, it gets clobbered by this process and the link is destroyed. Applications that I use that do this are harder to use with my dotfiles system because of it (finch, gnupg, and others as well). What needs to happen is that the target file has readlink() done to its path to get the *actual* file that is wanted before this process starts.

XFS: the filesystem of the future?

Posted Mar 2, 2012 23:33 UTC (Fri) by dlang (guest, #313) [Link]

doing an open, write, close on an existing file that is a hardlink may not be able to be atomic (depending on the size of the write, the hardware, etc) in any case.

so I think that your requirement results in error O_PONY

While something along the lines of what you are talking about may be able to be made to work in some more limited cases, arguing that it's a requirement because the temp-file approach doesn't work in all cases is a bad argument.

In fact, in the cases where the file is a symlink or hardlink, it's questionable as to what the 'right' think to do is.

In some cases you should follow the link and replace the 'master' file, but in other cases you should not. arguably, breaking the links is the safer thing to do (the system doesn't know what the effect of the changes are ot the other things accessing the file), but there's no question that sometimes you wish it did something different.

XFS: the filesystem of the future?

Posted Mar 5, 2012 14:34 UTC (Mon) by XTF (guest, #83255) [Link]

> depending on how much you want to write, and what your definition of atomic is, what you are asking for may not be possible.

Why not?
At the hardware level it's quite simple: write the new data to new blocks, then write the meta-data.

> it also depends on what you mean about 'tons of regressions'

Assuming the target is not a symlink to a different volume
Assuming you are allowed to create the tmp file
Assuming you are allowed to overwrite an existing file having the same name as your tmp file
Assuming it's ok to reset meta-data, like file owner, permissions, acls, creation timestamp, etc.
Assuming the performance regression due to fsync is ok (request was for atomic, not durable)

> fsync temp file (may require fsync of directory)

How does one check that requirement in a portable way?

> if you want to be sure the change has taken place, fsync directory.

> no, this isn't "real code", but translating it into your language of choice is not that hard.

But it's far from trivial either.

> There are a lot of programs out there that do this right today.

But there are even more ones that don't. There's also no tool to detect these ones (AFAIK).

> lookup the lwn.net article on safely saving data from a few months back

It also failed to address any of the assumptions / regressions.

XFS: the filesystem of the future?

Posted Mar 5, 2012 20:34 UTC (Mon) by dlang (guest, #313) [Link]

> At the hardware level it's quite simple: write the new data to new blocks, then write the meta-data.

writing the data to the new blocks may not be atomic.

writing the metadata to disk may not be atomic.

If your writes to disk can't be atomic, how can the entire transaction?

some of the data you write to a file may be visible before the metadata gets changed, which would make the change overall not be atomic.

XFS: the filesystem of the future?

Posted Mar 5, 2012 21:04 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

- Write new data to disk blocks which are unallocated on disk, but allocated in memory.
- Allocate new blocks on disk.
- Force in-order I/O (e.g. flush the disk cache).
- Atomically update inode to point to new data blocks, in memory and on disk.
- Force in-order I/O (e.g. flush the disk cache).
- De-allocate old data blocks on disk.

Between the "Allocated on disk" and "De-allocate on disk" steps there are potentially two version of the file data, only one of which is connected to a real file. A basic fsck tool or journal feature can clean up the disconnected version in the event that the process is interrupted.

This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC. A copy-on-write filesystem could implement O_ATOMIC efficiently without truncation, but not all filesystems have the necessary flexibility in the on-disk format for copy-on-write.

XFS: the filesystem of the future?

Posted Mar 5, 2012 21:36 UTC (Mon) by dlang (guest, #313) [Link]

The process you list will badly fragment the file on disk, destroying performance (unless you re re-writing the entire file, in which case just writing a temporary file and renaming it will work)

It's also not clear that this is what the original poster was looking for when he said "Linux devs should really provide a proper solution (like O_ATOMIC) instead of blaming app devs for not doing the impossible."

I've been trying to make two points in this thread

1. it's not obvious what the "proper solution" that the kernel should provide looks like

2. it's not impossible to do this today (since there are many classes of programs that do this)

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:45 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

>> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.
> The process you list will badly fragment the file on disk, destroying performance (***unless you re re-writing the entire file***, in which case just writing a temporary file and renaming it will work)

Emphasis added. Yes, I assumed the entire file was being rewritten. If not, one could add an online defragmentation step after the metadata update, though online defragmentation introduces atomicity issues of its own.

Creating and renaming a temporary file has its own issues, which have already been mentioned, particularly relating to symlinks and hard links. Even for ordinary files, it's not guaranteed that there exists a directory on the same filesystem where you have permission to create a temporary file. Bind mounts (which can apply to individual files) are another potential sore spot. How much work should applications be expected to do just to find a place to put their temporary file such that rename() can be guaranteed atomic?

An O_ATOMIC option to open() would ensure that you are really replacing the original file, and that the temporary space comes from the same filesystem.

1) No, there are no obvious solutions yet, but there have been several reasonable proposals.

2) Current applications can do atomic replacement in common but limited circumstances using rename(). They all depend on creating a temporary file on the same filesystem, which assumes both that you can locate that filesystem (see: bind mounts) and that you can create new files there. They also tend to break hard links, which may or may not be a desired behavior. There is no general, straightforward solution to the problem of atomically replacing the data associated with a specific inode.

XFS: the filesystem of the future?

Posted Mar 5, 2012 23:37 UTC (Mon) by dlang (guest, #313) [Link]

> it's not guaranteed that there exists a directory on the same filesystem where you have permission to create a temporary file.

this sounds like a red hearing to me.

If you aren't allowed to create a new file in the directory of the file, are you sure you have permission to overwrite the file you are trying to modify?

as for the symlink/hardlink 'issue', in my sysadmin experience, more problems are caused by editors that modify files in place (not using temp files and renaming them) than by breaking links. Editors that break links when modifying a file are referred to as 'well behaved' in this area.

XFS: the filesystem of the future?

Posted Mar 6, 2012 0:28 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> If you aren't allowed to create a new file in the directory of the file, are you sure you have permission to overwrite the file you are trying to modify?

These are orthogonal permissions. To overwrite the file you need write permission on the file. To create a new file in the same directory you need write permission on the directory. It's easily possible to have one without the other:

root# mkdir a
root# touch a/b
root# chown user a/b

user$ echo test > a/b # no error
user$ touch a/c
touch: cannot touch `a/c': Permission denied

> as for the symlink/hardlink 'issue', in my sysadmin experience, more problems are caused by editors that modify files in place (not using temp files and renaming them) than by breaking links.

Obviously that would remain an option. The point is that it should be an *option*, alongside the ability to atomically update a file in place. Symlinks are often used to add version control over configuration files (without putting the entire home / etc directory in the repository), while bind mounts are more often used with namespaces and chroot environments. Usually you want the latter to be read-only, but if the file is read/write it makes sense to allow it to be updated without breaking the link. (You can't just delete or rename over a bind target, either; it has to be unmounted.)

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:32 UTC (Mon) by XTF (guest, #83255) [Link]

> The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

They don't and that's not how atomicity is guaranteed. Atomicity is guaranteed via the journal.

> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.

Not really, actually. You merely have to ensure that the old state / blocks remain valid, so you have to do all writes to new blocks.

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:55 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

>> The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

> They don't and that's not how atomicity is guaranteed. Atomicity is guaranteed via the journal.

I can only assume you have a particular filesystem in mind. It is possible to arrange for inodes (or at least the data block portions) to fit within one sector, and to have atomic metadata updates without a journal. If you have a journal, great; atomic updates shouldn't be a problem. However, this system can also be retrofitted onto filesystem which do not support journals.

>> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.

> Not really, actually. You merely have to ensure that the old state / blocks remain valid, so you have to do all writes to new blocks.

True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.

XFS: the filesystem of the future?

Posted Mar 6, 2012 0:10 UTC (Tue) by XTF (guest, #83255) [Link]

> I can only assume you have a particular filesystem in mind

Not really

> True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.

What do you mean by full copy-on-write semantics?
What on-disk structure changes would be required to do this in ext4 for example?

XFS: the filesystem of the future?

Posted Mar 6, 2012 0:49 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

>> I can only assume you have a particular filesystem in mind
> Not really

In that case, why did you say that on-disk inodes do not fit within one physical sector? That is filesystem-specific, and I certainly know of some where the full inode size is less than or equal to 512 bytes; ext2 is at least capable of being configured that way.

>> True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.
> What do you mean by full copy-on-write semantics?
> What on-disk structure changes would be required to do this in ext4 for example?

Perhaps none. I didn't mean to imply that it was impossible to implement atomic replacement of partial files without changing the on-disk structure; I simply hadn't proved to myself that it could be done easily. In retrospect it probably could be done, though you would run into the aforementioned fragmentation issues common to most C.O.W. filesystems.

The biggest complication is not on disk but in memory; you would need to modify the filesystem code to account for the shared data blocks, of which there may be as many alternate versions as there are O_ATOMIC file descriptors--unless O_ATOMIC is exclusive, of course.

XFS: the filesystem of the future?

Posted Mar 6, 2012 12:55 UTC (Tue) by XTF (guest, #83255) [Link]

> In that case, why did you say that on-disk inodes do not fit within one physical sector?

You're right, inodes typically do fit in a sector.
What I wanted to say is that a meta-data transaction usually involves multiple parts / sectors. Providing consistency guarantees after a crash is hard without a journal.

> unless O_ATOMIC is exclusive, of course.

Having a reader and a writer or multiple writers at the same time is always problematic.

XFS: the filesystem of the future?

Posted Mar 6, 2012 21:36 UTC (Tue) by dgc (subscriber, #6611) [Link]

> - Write new data to disk blocks which are unallocated on disk, but
> allocated in memory.

open(tmpfile)
write(tmpfile)

> - Allocate new blocks on disk.
> - Force in-order I/O (e.g. flush the disk cache).

fsync(tmpfile)

> - Atomically update inode to point to new data blocks, in memory and
> on disk.
> - Force in-order I/O (e.g. flush the disk cache).
> - De-allocate old data blocks on disk.

rename(tmpfile, destination)

It seems that there are lots of people with ideas of how to "improve" overwrites but few of those people really understand the mechanisms that filesystems already provide via the POSIX interface. If you ever wonder why filesystem developers are a little bit sick of this topic, your post is a perfect example.

Dave.

XFS: the filesystem of the future?

Posted Mar 6, 2012 21:43 UTC (Tue) by dlang (guest, #313) [Link]

per the other messages in this thread, the OP is concerned about rather esoteric situations where

the file is a bind mount

the program doesn't have permission to make a new file in the directory (only to modify an existing one)

where the file is a special link (symlink, multiple hardlinks, bind mounted, etc) and the 'right thing' is to maintain those links instead of breaking them to have the modified version be a local copy

these situations do exist, but they are rather rare and the 'right thing' to do is not always and obvious and consistent thing.

XFS: the filesystem of the future?

Posted Mar 6, 2012 22:09 UTC (Tue) by XTF (guest, #83255) [Link]

> about rather esoteric situations where

Since when is not reseting meta-data (like file owner, permissions, acls, creation timestamp, etc) an esoteric situation

Each case by itself might be insignificant, but all cases together are not, IMO:
Assuming the target is not a symlink to a different volume
Assuming you are allowed to create the tmp file
Assuming you are allowed to overwrite an existing file having the same name as your tmp file
Assuming it's ok to reset meta-data, like file owner, permissions, acls, creation timestamp, etc.
Assuming the performance regression due to fsync is ok (request was for atomic, not durable)

XFS: the filesystem of the future?

Posted Mar 7, 2012 11:19 UTC (Wed) by dgc (subscriber, #6611) [Link]

> the OP is concerned about rather esoteric situations

I know. But what was described was a specific set of block and disk cache manipulations. It wasn't a set of requirements; all I was doing is pointing out if I treated it as a set of requirements then it can be implemented with the existing POSIX API.

> These situations do exist, but they are rather rare and the 'right
> thing' to do is not always and obvious and consistent thing.

And that is precisely why the problem can't be solved by some new magic filesystem operation until the desired behaviour is defined and specified. The weird corner cases that make it hard for userspace also make it just as hard to implement it in the kernel. I see lots of people pointing out why using rename is hard, but their underlying assumption is that moving all this into the kernel will solve all these problems. Well, it doesn't. For example: how does moving the overwrite into the kernel make it any easier to decide whether we should break hard links or not?

IOWs, the issue here is that nobody has defined what the operations being asked for are supposed to do, the use cases, the expected behaviour, constraints, etc.

Before asking kernel developers to do something the problem space needs to be scoped and specified. For the people that want this functionality in the kernel: write a man page for the syscall. Refine it. Implement it as a library function and see if you can use effectively for your use cases. Work out all the kinks in the syscall API. Ask linux-fsdevel for comments and if it can be implemented. Go back to step 1 with all the review comments you receive, then continue looping until there's consensus and somebody implements the syscall with support from a filesystem or two. Once the syscall is done, over time more filesystems will then implement support and then maybe 5 years down the track we can assume this functionality is always available.

But what really needs to happen first is that someone asking for this new kernel functionality steps up and takes responsibility for driving this process.....

Gesta non verba.

XFS: the filesystem of the future?

Posted Mar 7, 2012 16:54 UTC (Wed) by XTF (guest, #83255) [Link]

> but their underlying assumption is that moving all this into the kernel will solve all these problems. Well, it doesn't.

Why not? Some of the issues should be easy to avoid in kernel space.

> For example: how does moving the overwrite into the kernel make it any easier to decide whether we should break hard links or not?

What does the non-atomic case do? I'd do the same in the atomic case.

> IOWs, the issue here is that nobody has defined what the operations being asked for are supposed to do, the use cases, the expected behaviour, constraints, etc.

My request was quite clear: Could any of the fsync advocates post real code that does the atomic variant of open, write, close?
Isn't that quite well-defined?

> Before asking kernel developers to do something the problem space needs to be scoped and specified

Before that, we should agree that these are valid issues / assumptions / regressions.

> But what really needs to happen first is that someone asking for this new kernel functionality steps up and takes responsibility for driving this process.....

That'd be great, but at the moment most (FS) devs (and others) just declare these non-issues and refuse to discuss.

XFS: the filesystem of the future?

Posted Mar 6, 2012 22:00 UTC (Tue) by XTF (guest, #83255) [Link]

> It seems that there are lots of people with ideas of how to "improve" overwrites but few of those people really understand the mechanisms that filesystems already provide via the POSIX interface. If you ever wonder why filesystem developers are a little bit sick of this topic, your post is a perfect example.

Other people appear to have trouble reading. ;)
Your code was posted already and didn't suffice.

XFS: the filesystem of the future?

Posted Mar 6, 2012 22:05 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> open(tmpfile)
> write(tmpfile)
> fsync(tmpfile)
> rename(tmpfile, destination)

Now find a way to do that *without* the need for a temporary file, and you might have something relevant to contribute to the thread.

A temporary file is not always an acceptable option; it presumes that you know of a directory in which you can create a temporary file guaranteed to be on the same filesystem as the file that it's replacing, so that rename() can be implemented atomically, and that either there are no hard links to the file or that those links should be broken by the rename(). Moreover, the rename() method resets portions of the security context of the original file, including ownership and security labels, which you can't restore without superuser capabilities.

This thread exists because people who do know plenty about the POSIX interfaces also know that they don't provide a general solution.

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:18 UTC (Mon) by XTF (guest, #83255) [Link]

> writing the data to the new blocks may not be atomic.

It doesn't need to be

> writing the metadata to disk may not be atomic.

That's why you use a journal

> If your writes to disk can't be atomic, how can the entire transaction?

Heard of TCP? It creates a reliable connection over an unreliable network.
Or databases? Atomic transactions on unatomic disks are very possible.

> some of the data you write to a file may be visible before the metadata gets changed, which would make the change overall not be atomic.

No, because you write the new data to *new* blocks. Blocks not references by any file yet.

XFS: the filesystem of the future?

Posted Mar 30, 2018 7:19 UTC (Fri) by vcmohans (guest, #123360) [Link]

I am trying to achieve the best performance to find a string from the 12 different files with 12 dedicated threads.For that I have configured 120GB of two samsung SSDs in RAID0(striping) configuration with XFS file system.But the result is same as EXT4 file system.Why the performace is not improved?


Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds