Linux Filesystem, Storage, and Memory Management Summit, Day 2

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
April 6, 2011

This article covers the second day of the 2011 Linux Filesystem, Storage, and Memory Management Summit, held on April 5, 2011 in San Francisco, California. Those who have not yet seen the first day coverage may want to have a look before continuing here.

The opening plenary session was led by Michael Cornwall, the global director for technology standards at IDEMA, a standards organization for disk drive manufacturers. His talk, which was discussed in a separate article, covered the changes that are coming in the storage industry and how the Linux community can get involved to make things work better.

I/O resource management

The main theme of the memory management track often appeared to be "control groups"; for one session, though, the entire gathering got to share the control group fun as Vivek Goyal, Fernando Cao, and Chad Talbott led a discussion on I/O bandwidth management. There are two I/O bandwidth controllers in the kernel now: the throttling controller (which can limit control groups to an absolute bandwidth value) and the proportional controller (which divides up the available bandwidth between groups according to an administrator-set policy). Vivek was there to talk about the throttling controller, which is in the kernel and working, but which still has a few open issues.

One of those is that the throttling controller does not play entirely well with journaling filesystems. I/O ordering requirements will not allow the journal to be committed before other operations have made it to disk; if some of those other operations have been throttled by the controller, the journal commit stalls and the whole filesystem slows down. Another is that the controller can only manage synchronous writes; writes which have been buffered through the page cache have lost their association with the originating control group and cannot be charged against that group's quota. There are patches to perform throttling of buffered writes, but that is complicated and intrusive work.

Another problem was pointed out by Ted Ts'o: the throttling controller applies bandwidth limits on a per-device basis. If a btrfs filesystem is in use, there may be multiple devices which make up that filesystem. The administrator would almost certainly want limits to apply to the volume group as a whole, but the controller cannot do that now. A related problem is that some users want to be able to apply global limits - limits on the amount of bandwidth used on all devices put together. The throttling controller also does not work with NFS-mounted filesystems; they have no underlying device at all, so there is no place to put a limit.

Chad Talbott talked about the proportional bandwidth controller; it works well with readers and synchronous writers, but, like the throttling controller, it is unable to deal with asynchronous writes. Fixing that will require putting some control group awareness into the per-block-device flushing threads. The system currently maintains a set of per-device lists containing inodes with dirty pages; those lists need to be further subdivided into per-control-group lists to enable the flusher threads to write out data according to the set policy. This controller also does not yet properly implement hierarchical group scheduling, though there are patches out there to add that functionality.

The following discussion focused mostly on whether the system is accumulating too many control groups. Rather than a lot of per-subsystem controllers, we should really have a cross-subsystem controller mechanism. At this point, though, we have the control groups (and their associated user-space API which cannot be broken) that are in the kernel. So, while some (like James Bottomley) suggested that we should maybe dump the existing control groups in favor of something new which gets it right, that will be a tall order. Beyond that, as Mike Rubin pointed out, we don't really know how control groups should look even now. There has been a lack of "taste and style" people to help design this interface.

Working set estimation

Back in the memory management track, Michel Lespinasse discussed Google's working set estimation code. Google has used this mechanism for some time as a way of optimally placing new jobs in its massive cluster. By getting a good idea of how much memory each job is really using, they can find the machines with the most idle pages and send new work in that direction. Working set estimation, in other words, helps Google to make better decisions on how to overcommit its systems.

The implementation is a simple kernel thread which scans through the physical pages on the system, every two minutes by default. It looks at each page to determine whether it has been touched by user space or not and remembers that state. The whole idea is to try to figure out how many pages could be taken away from the system without causing undue memory pressure on the jobs running there.

The kernel thread works by setting a new "idle" flag on each page which looks like it has not been referenced. That bit is cleared whenever an actual reference happens (as determined by looking at whether the VM subsystem has cleared the "young" bit). Pages which are still marked idle on the successive scan are deemed to be unused. The estimation code does not take any action to reclaim those pages; it simply exports statistics on how many unused pages there are through a control group file. The numbers are split up into clean, swap-backed dirty, and file-backed dirty pages. It's then up to code in user space to decide what to do with that information.

There were questions about the overhead of the page scanning; Michel said that scanning every two minutes required about 1% of the available CPU time. There were also questions about the daemon's use of two additional page flags; those flags are a limited resource on 32-bit systems. It was suggested that a separate bitmap outside of the page structure could be used. Google runs everything in 64-bit mode, though, so there has been little reason to care about page flag exhaustion so far. Rik van Riel suggested that the feature could simply not be supported on 32-bit systems. He also suggested that the feature might be useful in other contexts; systems running KVM-virtualized guests could use it to control the allocation of memory with the balloon driver, for example.

Virtual machine sizing

Rik then led a discussion on a related topic: allocating the right amount of memory to virtual machines. As with many problems, there are two distinct aspects: policy (figuring out what the right size is for any given virtual machine) and mechanism (actually implementing the policy decisions). There are challenges on both sides.

There are a number of mechanisms available for controlling the memory available to a virtual machine. "Balloon drivers" can be used to allocate memory in guests and make it available to the host; when a guest needs to get smaller, the balloon "inflates," forcing the guest to give up some pages. Page hinting is a mechanism by which the guest can inform the host that certain pages do not contain useful data (for example, they are on the guest's free list). The host can then reclaim memory used for so-hinted pages without the need to write them out to backing store. The host can also simply swap the guest's pages out without involving the guest operating system at all. The KSM mechanism allows the kernel to recover pages which contain duplicated contents. Compression can be used to cram data into a smaller number of pages. Page contents can also simply be moved around between systems or stashed into some sort of transcendent memory scheme.

There seem to be fewer options on the policy side. The working set estimation patches are certainly one possibility. One can control memory usage simply through decisions on the placement of virtual machines. The transcendent memory mechanism also allows the host to make policy decisions on how to allocate its memory between guests.

One interesting possibility raised by Rik was to make the balloon mechanism better. Current balloon drivers tend to force the release of random pages from the guest; that leads to fragmentation in the host, thwarting attempts to use huge pages. A better approach might be to use page hinting, allowing the guest to communicate to the host which pages are free. The balloon driver could then work by increasing the free memory thresholds instead of grabbing pages itself; that would force the guest to keep more pages free. Even better, memory compaction would come into play, so the guest would be driven to free up contiguous ranges of pages. Since those pages are marked free, the host can grab them (hopefully as huge pages) and use them elsewhere. With this approach, there is no need to pass pages directly to the host; the hinting is sufficient.

There are other reasons to avoid the direct allocation of pages in balloon drivers; as Pavel Emelyanov pointed out, that approach can lead to out-of-memory situations in the guest. Andrea Arcangeli stated that, when balloon drivers are in use, the guest must be configured with enough swap space to avoid that kind of problem; otherwise things will not be stable. The policy implemented by current balloon drivers is also entirely determined by the host system; it's not currently possible to let the guest decide when it needs to grow.

There is also a simple problem of communication; the host has no comprehensive view of the memory needs of its guest systems. Fixing that problem will not be easy; any sort of intrusive monitoring of guest memory usage will fail to scale well. And most monitoring tends to fall down when a guest's memory usage pattern changes - which happens frequently.

Few conclusions resulted from this session. There will be a new set of page hinting patches from Rik in the next few weeks; after that, thought can be put into doing ballooning entirely through hinting without having to call back to the host.

Dirty limits and writeback

The memory management track had been able to talk for nearly a full hour without getting into control groups, but that was never meant to last; Greg Thelen brought the subject back during his session on the management of dirty limits within control groups. He made the claim that keeping track of dirty memory within control groups is relatively easy, but then spent the bulk of his session talking about the subtleties involved in that tracking.

The main problem with dirty page tracking is a more general memory controller issue: the first control group to touch a specific page gets charged for it, even if other groups make use of that page later. Dirty page tracking makes that problem worse; if control group "A" dirties a page which is charged to control group "B", it will be B which is charged with the dirty page as well. This behavior seems inherently unfair; it could also perhaps facilitate denial of service attacks if one control group deliberately dirties pages that are charged to another group.

One possible solution might be to change the ownership of a page when it is dirtied - the control group which is writing to the page would then be charged for it thereafter. The problem with that approach is pages which are repeatedly dirtied by multiple groups; that could lead to the page bouncing back and forth. One could try a "charge on first dirty" approach, but Greg was not sure that it's all worth it. He does not expect that there will be a lot of sharing of writable pages between control groups in the real world.

The bigger problem is what to do about control groups which hit their dirty limits. Presumably they will be put to sleep until their dirty page counts go below the limit, but that will only work well if the writeback code makes a point of writing back pages which are associated with those control groups. Greg had three possible ways of making that happen.

The first of those involved creating a new memcg_mapping structure which would take the place of the address_space structure used to describe a particular mapping. Each control group would have one of these structures for every mapping in which it has pages. The writeout code could then find these mappings to find specific pages which need to be written back to disk. This solution would work, but is arguably more complex than is really needed.

An approach which is "a little dumber" would have the system associating control groups with inodes representing pages which have been dirtied by those control groups. When a control group goes over its limit, the system could just queue writeback on the inodes where that group's dirty pages reside. The problem here is that this scheme does not handle sharing of inodes well; it can't put an inode on more than one group's list. One could come up with a many-to-one mechanism allowing the inode to be associated with multiple control groups, but that code does not exist now.

Finally, the simplest approach is to put a pointer to a memory control group into each inode structure. When the writeback code scans through the list of dirty inodes, it could simply skip those which are not associated with control groups that have exceeded their dirty limit. This approach, too, does not do sharing well; it also suffers from the disadvantage that it causes the inode structure to grow.

Few conclusions were reached in this session; it seems clear that this code will need some work yet.

Kernel memory accounting and soft limits

The kernel's memory control group mechanism is concerned with limiting user-space memory use, but kernel memory can matter too. Pavel Emelyanov talked briefly about why kernel memory is important and how it can be tracked and limited. The "why" is easy; processes can easily use significant amounts of kernel memory. That usage can impact the system in general; it can also be a vector for denial of service attacks. For example, filling the directory entry (dentry) cache is just a matter of writing a loop running "mkdir x; cd x". For as long as that loop runs, the entire chain of dentries representing the path to the bottommost directory will be pinned in the cache; as the chain grows, it will fill the cache and prevent anything else from performing path lookups.

Tracking every bit of kernel data used by a control group is a difficult job; it also becomes an example of diminishing returns after a while. Much of the problem can be solved by looking at just a few data structures. Pavel's work has focused on three structures in particular: the dentry cache, networking buffers, and page tables. The dentry cache controller is relatively straightforward; it can either be integrated into the memory controller or made into a separate control group of its own.

Tracking network buffers is harder due to the complexities of the TCP protocol. The networking code already does a fair amount of tracking, though, so the right solution here is to integrate with that code to create a separate controller.

Page tables can occupy large amounts of kernel memory; they present some challenges of their own, especially when a control group hits its limit. There are two ways a process can grow its page tables; one is via system calls like fork() or mmap(). If a limit is hit there, the kernel can simply return ENOMEM and let the process respond as it will. The other way, though, is in the page fault handler; there is no way to return a failure status there. The best the controller can do is to send a segmentation fault signal; that usually just results in the unexpected death of the program which incurred the page fault. The only alternative would be to invoke the out-of-memory killer, but that may not even help: the OOM killer is designed to free user-space memory, not kernel memory.

Pavel plans to integrate the page table tracking into the memory controller; patches are forthcoming.

Ying Han got a few minutes to discuss the implementation of soft limits in the memory controller. As had been mentioned on the first day, soft limits differ from the existing (hard) limits in that they can be exceeded if the system is not under global memory pressure. Once memory gets tight, the soft limits will be enforced.

That enforcement is currently suboptimal, though. The code maintains a red-black tree in each zone containing the control groups which are over their soft limits, even though some of those groups may not have significant amounts of memory in that specific zone. So the system needs to be taught to be more aware of allocations in each zone.

The response to memory pressure is also not perfect; the code picks the control group which has exceeded its soft limit by the largest amount and beats on it until it goes below the soft limit entirely. It would probably be better to add some fairness to the algorithm and spread the pain among all of the control groups which have gone over their limits. Some sort of round-robin algorithm which would cycle through those groups would probably be a better way to go.

There was clearly more to discuss on this topic, but time ran out and the discussion had to end.

Transparent huge page improvements

Andrea Arcangeli had presented the transparent huge page (THP) patch set at the 2010 Summit and gotten some valuable feedback in return. By the 2011 event, that code had been merged for the 2.6.38 kernel; it still had a number of glitches, but those have since been fixed up. Since then, THP has gained some improved statistics support under /proc; there is also an out-of-tree patch to add some useful information to /proc/vmstat. Some thought has been put into optimizing libraries and applications for THP, but there is rarely any need to do that; applications can make good use of the feature with no changes at all.

There are a number of future optimizations on Andrea's list, though he made it clear that he does not plan to implement them all himself. The first item, though - adding THP support to the mremap() system call - has been completed. Beyond that, he would like to see the process of splitting huge pages optimized to remove some unneeded TLB flush operations. The migrate_pages() and move_pages() system calls are not yet THP-aware, so they split up any huge pages they are asked to move. Adding a bit of THP awareness to glibc could improve performance slightly.

The big item on the list is THP support for pages in the page cache; currently only anonymous pages are supported. There would be some big benefits beyond another reduction in TLB pressure; huge pages in the page cache would greatly reduce the number of pages which need to be scanned by the reclaim code. It is, however, a huge job which would require changes in all filesystems. Andrea does not seem to be in a hurry to jump into that task. What might happen first is the addition of huge page support to the tmpfs filesystem; that, at least, would allow huge pages to be used in shared memory applications.

Currently THP only works with one size of huge pages - 2MB in most configurations. What about adding support for 1GB pages as well? That seems unlikely to happen anytime soon. Working with those pages would be expensive - a copy-on-write fault on a 1GB page would take a long time to satisfy. The code changes would not be trivial; the buddy allocator cannot handle 1GB pages, and increasing MAX_ORDER (which determines the largest chunk managed by the buddy allocator) would not be easy to do. And, importantly, the benefits would be small to the point that they would be difficult to measure. 2MB pages are enough to gain almost all of the performance benefits which are available, so supporting larger page sizes is almost certainly not worth the effort. The only situation in which is might happen is if 2MB pages become the basic page size for the rest of the system.

Might a change in the primary page size happen? Not anytime soon. Andrea actually tried it some years ago and ran into a number of problems. Among other things, a larger page size would change a number of system call interfaces in ways which would break applications. Kernel stacks would become far more expensive; their implementation would probably have to change. A lot of memory would be wasted in internal fragmentation. And a lot of code would have to change. One should not expect a page size change to happen in the foreseeable future.

NUMA migration

Non-uniform memory access systems are characterized by the fact that some memory is more expensive to access than the rest. For any given node in the system, memory which is local to that node will be faster than memory found elsewhere in the system. So there is a real advantage to keeping processes and their memory together. Rik van Riel made the claim that this is often not happening. Long-running processes, in particular, can have their memory distributed across the system; that can result in a 20-30% performance loss. He would like to get that performance back.

His suggestion was to give each process a "home node" where it would run if at all possible. The home node differs from CPU affinity in that the scheduler is not required to observe it; processes can be migrated away from their home node if necessary. But, when the scheduler performs load balancing, it would move processes back to their homes whenever possible. Meanwhile, the process's memory allocations would be performed on the home node regardless of where the process is running at the time. The end result should be processes running with local memory most of the time.

There are some practical difficulties with this scheme, of course. The system may end up with a mix of processes which all got assigned to the same home node; there may then be no way to keep them all there. It's not clear what should happen if a process creates more threads than can be comfortably run on the home node. There were also concerns about predictability; the "home node" scheme might create wider variability between identical runs of a program. The consensus, though, was that speed beats predictability and that this idea is worth experimenting with.

Stable pages

What happens if a process (or the kernel) modifies the contents of a page in the time between when that page is queued for writing to persistent storage and when the hardware actually performs the write? Normally, the result would be that the newer data is written, and that is not usually a problem. If, however, something depends on the older contents, the result could be problematic. Examples which have come up include checksums used for integrity checking or pages which have been compressed or encrypted. Changing those pages before the I/O completes could result in an I/O operation failure or corrupted data - neither of which is desirable.

The answer to this problem is "stable pages" - a rule that pages which are in flight cannot be changed. Implementing stable pages is relatively easy (with one exception - see below). Pages which are written to persistent storage are already marked read-only by the kernel; if a process tries to write to the page, the kernel will catch the fault, mark the page (once again) dirty, then allow the write to proceed. To implement stable pages, the kernel need only force that process to block until any outstanding I/O operations have completed.

The btrfs filesystem implements stable pages now; it needs them for a number of reasons. Other filesystems do not have stable pages, though; xfs and OCFS implement them for metadata only, and the rest have no concept of stable pages at all. There has been some resistance to the idea of adding stable pages because there is some fear that performance could suffer; processes which could immediately write to pages under I/O would slow down if they are forced to wait.

The truth of the matter seems to be that most of the performance worries are overblown; in the absence of a deliberate attempt to show problems, the performance degradation is not measurable. There are a few exceptions; applications using the Berkeley database manager seem to be one example. It was agreed that it would be good to have some better measurements of potential performance issues; a tracepoint may be placed to allow developers to see how often processes are actually blocked waiting for pages under I/O.

It turns out that there is one place where implementing stable pages is difficult. The kernel's get_user_pages() function makes a range of user-space pages accessible to the kernel. If write access is requested, the pages are made writable at the time of the call. Some time may pass, though, before the kernel actually writes to those pages; in the meantime, some of them may be placed under I/O. There is currently no way to catch this particular race; it is, as Nick Piggin put it, a real correctness issue.

There was some talk of alternatives to stable pages. One is to use bounce buffers for I/O - essentially copying the page's contents elsewhere and using the copy for the I/O operation. That would be expensive, though, so the idea was not popular. A related approach would be to use copy-on-write: if a process tries to modify a page which is being written, the page would be copied at that time and the process would operate on the copy. This solution may eventually be implemented, but only after stable pages have been shown to be a real performance problem. Meanwhile, stable pages will likely be added to a few other filesystems, possibly controlled by a mount-time option.

Closing sessions

Toward the end of the day, Qian Cai discussed the problem of sustainable testing. There are a number of ways in which our testing is not as good as it could be. Companies all have their own test suites; they duplicate a lot of effort and tend not to collaborate in the development of the tests or sharing of the results. There are some public test suites (such as xfstests and the Linux Testing Project), but they don't work together and each have their own approach to things. Some tests need specific hardware which may not be generally available. Other tests need to be run manually, reducing the frequency with which they are run.

The subsequent discussion ranged over a number of issues without resulting in any real action items. There was some talk of error injection; that was seen as a useful feature, but a hard thing to implement well. It was said that our correctness tests are in reasonably good shape, but that there are fewer stress tests out there. The xfstests suite does some stress testing, but it runs for a relatively short period of time so it cannot catch memory leaks; xfstests is also not very useful for catching data corruption problems.

The biggest problem, though, is one which has been raised a number of times before: we are not very good at catching performance regressions. Ted Ts'o stated that the "dirty secret" is that kernel developers do not normally stress filesystems very much, so they tend not to notice performance problems.

In the final set of lightning talks, Aneesh Kumar and Venkateswararao Jujjuri talked about work which is being done with the 9p filesystem. Your editor has long wondered why people are working on this filesystem, which originally comes from the Plan9 operating system. The answer was revealed here: 9p makes it possible to export filesystems to virtualized guests in a highly efficient way. Improvements to 9p have been aimed at that use case; it now integrates better with the page cache, uses the virtio framework to communicate with guests, can do zero-copy I/O to guests running under QEMU, and supports access control lists. The code for all this is upstream and will be shipping in some distributions shortly.

Amir Goldstein talked about his snapshot code, which now works with the ext4 filesystem. The presentation consisted mostly of benchmark results, almost all of which showed no significant performance costs associated with the snapshot capability. The one exception appears to be the postmark benchmark, which performs a lot of file deletes.

Mike Snitzer went back to the "advanced format" discussion from the morning's session on future technology. "Advanced format" currently means 4k sectors, but might the sector size grow again in the future? How much pain would it take for Linux to support sector sizes which are larger than the processor's page size? Would the page size have to grow too?

The answer to the latter question seems to be "no"; there is no need or desire to expand the system page size to support larger disk sectors. Instead, it would be necessary to change the mapping between pages in memory and sectors on the disk; in many filesystems, this mapping is still done with the buffer head structure. There are some pitfalls, including proper handling of sparse files and efficient handling of page faults, but that is just a matter of programming. It was agreed that it would be nice to do this programming in the core system instead of having each filesystem solve the problems in its own way.

The summit concluded with an agreement that things had gone well, and that the size of the event (just over 70 people) was just about right. The summit, it said, should be considered mandatory for all maintainers working in this area. It was also agreed that the memory management developers (who have only been included in the summit for the last couple of meetings) should continue to be invited. That seems inevitable for the next summit; the head of the program committee, it was announced, will be memory management hacker Andrea Arcangeli.

Index entries for this article
Kernel	Block layer
Kernel	Filesystems/Workshops
Kernel	Memory management/Conference sessions
Conference	Storage Filesystem & Memory Management/2011

(Log in to post comments)

Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted Apr 6, 2011 16:56 UTC (Wed) by scotthall (guest, #73671) [Link]

What about mmap_sem scalability which Nick discussed last time? Is he still working on it?

Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted Apr 6, 2011 23:18 UTC (Wed) by walken (subscriber, #7089) [Link]

There were a couple aspects to that work. One is to reduce mmap_sem hold times - I have done some work in that area, which got in 2.6.37 and 2.6.38. There are still a few situations where mmap_sem is held while we want for a disk access, but it's much less common than it was last year.

The other direction Nick proposed last year was to have smaller granularity for the mmap_sem - AFAIK there has been no major work in that direction.

Home nodes

Posted Apr 6, 2011 17:46 UTC (Wed) by andikleen2 (guest, #52506) [Link]

I already tried home nodes in 2.4. They didn't work.
There was also another implementation from NEC. They saw some success
on very large systems -- with large numa factors -- but they did
poorly on the more common low NUMA factor two and four socket servers.

The reason is that on these systems not using a core is always much
worse than using remote memory. And if you give the scheduler too
many conflicting inputs it will become schizo and schedule poorly and
not use all cores well anymore.

This is worst on dynamic workloads, for more static workloads it's
not quite as bad.

A better approach is some form of automatic migration, e.g. as
implemented by Lee Schermerhorn:
http://permalink.gmane.org/gmane.linux.kernel.numa/590
This can actually fix up imbalances and also allow some other
optimizations. Unfortunately it also doesn't work for all workloads,
so it would need to be an optional knob.

-Andi

Home nodes

Posted Apr 6, 2011 20:45 UTC (Wed) by riel (subscriber, #3142) [Link]

The solutions we tried in the past seem to be "big hammer" style solutions, that try to be fairly rigid in what the kernel is allowed to do.

I want to see how little change we can get away with, and still get a decent performance improvement. A home node would only be the node that memory allocations start on, and that the process is preferentially run on - the CPU scheduler does need to be able to run processes elsewhere temporarily.

Only when a node is permanently overloaded, is it time to move some tasks elsewhere and eventually migrate over some of their memory (maybe with Lee's patches, or something based on them).

My plan is to start small and only add things as needed, trying to stay away from a large, complete & heavy plan.

Home nodes

Posted Apr 6, 2011 23:13 UTC (Wed) by andikleen2 (guest, #52506) [Link]

Actually the home nodes patches were quite simple.

Good luck reinventing the flat tire.

Home nodes

Posted Apr 6, 2011 23:51 UTC (Wed) by martinfick (subscriber, #4455) [Link]

While I have no idea if you are right or not, don't you think that attempting to improve where other's have failed is potentially worthy of many tries? Especially if there is no fundamental proof that something won't work? And even more when a real unsolved problem is attempting to be solved?

The analogy to reinventing the wheel is inappropriate, since in the case of a working solution, it is a waste of time to reinvent it. But, in the case of failures, "reinventing it" (and potentially no longer failing), should be praised, not ridiculed, no? (again, with the proof caveat above, and even then some... proofs can sometimes be disproved)

Home nodes

Posted Apr 7, 2011 20:49 UTC (Thu) by cmccabe (guest, #60281) [Link]

Are there any tools out there to show Linux programmers a timeline of when their threads have been migrated between CPUs?

I know that valgrind can show you cache misses, but I'm not really aware of any tools that can display where the scehduler has put your threads over time. Maybe LTT-ng?

Tools

Posted Apr 7, 2011 23:02 UTC (Thu) by corbet (editor, #1) [Link]

perf timechart can generate some nice output which shows thread migration.

Tools

Posted Apr 8, 2011 14:01 UTC (Fri) by sbohrer (guest, #61058) [Link]

I personally much prefer to use kernelshark for this.

Stable pages at Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted Apr 6, 2011 23:01 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> Examples which have come up include checksums used for integrity checking or pages which have been compressed or encrypted.

Checksums are an obvious need, but I don't understand the reference to compressed or encrypted pages.
Surely if the kernel is compressing or encrypting it has to do it to a bounce buffer (or maybe extract the page from the page-cache completely before transforming it). So I don't see what this has to do with stable pages...

Of course RAID5 has always used bounce buffers to get the stability needed for xor calculations. If an incoming dirty page was known to be stable already, I could avoid the copy operation!! So a bio flag saying "these pages are stable" would be good.

And I think this is an amusing juxtaposition:

> The truth of the matter seems to be that most of the performance worries are overblown; in the absence of a deliberate attempt to show problems, the performance degradation is not measurable.

> Ted Ts'o stated that the "dirty secret" is that kernel developers do not normally stress filesystems very much, so they tend not to notice performance problems.

The performance issue caused by stable pages would be increased latency for a 'write' system call when there is lots of free memory. With delayed allocation this should not normally need to wait for IO at all, so occasionally having to wait for IO could cause unwelcome latency spikes in some applications. However I suspect most current filesystems already have latency spikes for write for one reason or another, so I suspect it would be hard to notice a regression here.

Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted Apr 10, 2011 16:08 UTC (Sun) by ccurtis (guest, #49713) [Link]

Page tables can occupy large amounts of kernel memory; [...] when a control group hits its limit. [...] The other way, though, is in the page fault handler; there is no way to return a failure status there. The best the controller can do is to send a segmentation fault signal; [...]

As an application programmer I'd hate to see a SEGV if the system was unable to page in memory. I think most people equate SEGV with a wild pointer and chances seem good that a debugger would show a pointer access somewhere nearby in an application backtrace.

Wouldn't SIGBUS be a more appropriate signal to send in this case?

Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted Apr 11, 2011 17:35 UTC (Mon) by jospoortvliet (guest, #33164) [Link]

Despite my lack of deep technical knowledge, it is articles like these which make me a happy paid subscriber. Awesome, mr C!

Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted Apr 15, 2011 22:01 UTC (Fri) by oak (guest, #2786) [Link]

"Greg was not sure that it's all worth it. He does not expect that there will be a lot of sharing of writable pages between control groups in the real world."

On a typical desktop Linux computer most of that is probably GL buffers / memory shared between application, X server and compositor... X & compositor might be in separate group from applications. But I don't think the amounts are normally that significant compared to the whole memory in the computer.

Linux Filesystem, Storage, and Memory Management Summit, Day 2

Posted May 6, 2011 23:15 UTC (Fri) by zlynx (guest, #2285) [Link]

Not sure if it is still true with more recent Firefoxes but I remember that the Firefox web browser around version 2 would easily use over 300 MB of X memory on heavy image sites. On a 512 MB machine this was more than 50%.

I wasn't using control groups back then of course.