2012 Linux Storage, Filesystem, and Memory Management Summit - Day 1

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jake Edge
April 3, 2012

Day one of the Linux Storage, Filesystem, and Memory Management Summit (LSFMMS) was held in San Francisco on April 1. What follows is a report on the combined and MM sessions from the day largely based on Mel Gorman's write-ups, with some editing and additions from my own notes. In addition, James Bottomley sat in on the Filesystem and Storage discussions and his (lightly edited) reports are included as well. The plenary session from day one, on runtime filesystem consistency checking, was covered in a separate article.

Writeback

Fengguang Wu began by enumerating his work on improving the writeback situation and instrumenting the system to get better information on why writeback is initiated. James Bottomley quickly pointed out that we've talked about writeback for several years at LSFMMS and specifically asked where are we right now. Unfortunately many people spoke at the same time, some without microphones making it difficult to follow. They did focus on how and when sync takes place, what impact it has, and whether anyone should care about how dd benchmarks behave. The bulk of the comments focused on the fairness of dealing with multiple syncs coming from multiple sources. Ironically despite the clarity of the question, the discussion was vague. As concrete examples were not used by each audience member it could be only concluded that "on some filesystems for some workloads depending on what they do, writeback may do something bad".

Wu brought it back on topic by focusing on I/O-less dirty throttling and the complexities that it brings. However, the intention is to minimize seeks, and to provide less lock contention and low latency. He maintains that there were some impressive performance gains with some minor regressions. There are issues around integration with task/cgroup I/O controllers but considering the current state of I/O controllers, this was somewhat expected.

Bottomley asked about how much complexity this added; Dave Chinner pointed out that the complexity of the code was irrelevant because the focus should be on the complexity of the algorithm. Wu countered that the coverage of his testing was pretty comprehensive, covering a wide range of hardware, filesystems, and workloads.

For dirty reclaim, there is now a greater focus on pushing pageout work to the flusher threads with some effort to improve interactivity by focusing dirty reclaim on the tasks doing the dirtying. He stated that dirty pages reaching the end of the LRU are still a problem and suggested the creation of a dirty LRU list. With current kernels, dirty pages are skipped over by direct reclaimers, which increases CPU cost, making it a problem that varies between kernel versions. Moving them to a separate list unfortunately requires a page flag which is not readily available.

Memory control groups bring their own issues with writeback, particularly around flusher fairness. This is currently beyond control with only coarse options available such as limiting the number of operations that can be performed on a per-inode basis or limiting the amount of IO that can be submitted. There was mention of throttling based on the amount of IO a process completed but it was not clear how this would work in practice.

The final topic was on the block cgroup (blkcg) I/O controller and the different approaches to throttling based on I/O operations/second (IOPS) and access to disk time. Buffered writes are a problem, as is how they could possibly be handled via balance_dirty_pages(). A big issue with throttling buffered writes is still identifying the I/O owner and throttling them at the time the I/O is queued, which happens after the I/O owner has already executed a read() or write(). There was a request to clarify what the best approach might be but there were few responses. As months, if not years, of discussion on the lists imply, it is just not a straightforward topic and it was suggested that a spare slot be stolen to discuss it further (see the follow-up in the filesystem and storage sessions below).

At the end, Bottomley wanted an estimate of how close writeback was to being "done". After some hedging, Wu estimated that it was 70% complete.

Stable pages

The problems surrounding stable pages were the next topic under discussion. As was noted by Ted Ts'o, making writing processes wait for writeback to complete on stable pages can lead to unexpected and rather long latencies, which may be unacceptable for some workloads. Stable pages are only really needed for some systems where things like checksums calculated on the page require that the page be unchanged when it actually gets written.

Sage Weil and Boaz Harrosh listed the three options for handling the problem. The first was to reissue the write for pages that have changed while they were undergoing writeback, but that can confuse some storage systems. Waiting on the writeback (which is what is currently done) or doing a copy-on-write (COW) of the page under writeback were the other two. The latter option was the initial focus of the discussion.

James Bottomley asked if the cost of COW-ing the pages had been benchmarked and Weil said that they hadn't been. Weil and Harrosh are interested in workloads that really require stable writes and whether they were truly affected by waiting for the writeback to complete. Weil noted that Ts'o can just turn off stable pages, which fixes his problem. Bottomley asked: could there just be a mount flag to turn off stable pages? Another way to approach that might be to have the underlying storage system inform the filesystem if it needed stable writes or not.

Since waiting on writeback for stable pages introduces a number of unexpected issues, there is a question of whether replacing it with something with a different set of issues is the right way to go. The COW proposal may lead to problems because it results in there being two pages for the same storage location floating around. In addition, there are concerns about what would happen for a file that gets truncated after its pages have been copied, and how to properly propagate that information.

It is unclear whether COW would be always be a win over waiting, so Bottomley suggested that the first step should be to get some reporting added into the stable writeback path to gather information on what workloads are being affected and what those effects are. After that, someone could flesh out a proposal on how to implement the COW solution that described how to work out the various problems and corner cases that were mentioned.

Memory vs. performance

While the topic name of Dan Magenheimer's slot, "Restricting Memory Usage with Equivalent Performance", was not of his choosing, that didn't deter him from presenting a problem for memory management developers to consider. He started by describing a graph of the performance of a workload as the amount of RAM available to it increases. Adding RAM reduces the amount of time the workload takes, to a certain point. After that point, adding more memory has no effect on the performance.

It is difficult or impossible to know the exact amount of RAM required to optimize the performance of a workload, he said. Two virtual machines on a single host are sharing the available memory, but one VM may need the additional memory that the other does not really need. Some kind of balance point between the workloads being handled by the two VMs needs to be found. Magenheimer has some ideas on ways to think about the problem that he described in the session.

He started with an analogy of two countries, one of which wants resources that the other has. Sometimes that means they go to war, especially in the past, but more recently economic solutions have been used rather than violence to allocate the resource. He wonders if a similar mechanism could be used in the kernel. There are a number of sessions in the memory management track that are all related to the resource allocation problem, he said, including memory control groups soft-limits, NUMA balancing, and ballooning.

The top-level question is how to determine how much memory an application actually needs vs. how much it wants. The idea is try to find the point where giving some memory to another application has a negligible performance impact on the giver while the other application can use it to increase its performance. Beyond tracking the size of the application, Magenheimer posited that one could use calculus and calculate the derivative of the size growth to gain an idea of the "velocity" of the workload. Rik van Riel noted that this information could be difficult to track when the system is thrashing, but Magenheimer thought that tracking refaults could help with that problem.

Ultimately, Magenheimer wants to apply these ideas to RAMster, which allows machines to share "unused" memory between them. RAMster would allow machines to negotiate storing pages for other machines. For example, in an eight machine system, seven machines could treat the remaining machine as a memory server, offloading some of their pages to that machine.

Workload size estimation might help, but the discussion returned to the old chestnut of trying to shrink memory to find at what point the workload starts "pushing" back by either refaulting or beginning to thrash. This would allow the issue to be expressed in terms of control theory. A crucial part of using control theory is having a feedback mechanism. By and large, virtual machines have almost non-existent feedback mechanisms for establishing the priority of different requests for resources. Further, performance analysis on resource usage is limited.

Glauber Costa pointed out that potentially some of this could be investigated using memory cgroups that vary in size to act as a type of feedback mechanism even if it lacked a global view of resource usage.

In the end, this session was a problem statement - what feedback mechanisms does a VM need to assess how much memory the workload on a particular machine requires? This is related to workload working set size estimation but that is sufficiently different from Magenheimer's requirement that they may not share that much in common.

Ballooning for transparent huge pages

Rik van Riel began by reminding the audience that transparent huge pages (THP) gave a large performance gain in virtual machines by virtue of the fact that VMs use nested page tables, which doubles the normal cost of translation. Huge pages, by requiring far fewer translations, can make much of the performance penalty associated with nested page tables go away.

Once ballooning enters the picture though it rains on the parade by fragmenting memory and reducing the number of huge pages that can be used. The obvious approach is to balloon in 2M contiguous chunks. However, this has its own problems because compaction can only do so much. If a guest must shrink its memory by half, it may use all the regions that are capable of being defragmented. This would reduce or eliminate the number of 2M huge pages that could be used.

Van Riel's solution requires that balloon pages become movable within the guest, which requires changes to both the balloon driver and potentially the hypervisor. However, no one in the audience saw a problem with this as such. Balloon pages are not particularly complicated, because they just have one reference. They need a new page mapping with a migration callback to release the reference to the page and the contents do not need to be copied so there is an optimization available there.

Once that is established, it would also be nice to keep balloon pages within the same 2M regions. Dan Magenheimer mentioned a user that has a similar type of problem, but that problem is very closely related to what CMA does. It was suggested that Van Riel may need something very similar to MIGRATE_CMA except where MIGRATE_CMA forbids unmovable pages within their pageblocks, balloon drivers would simply prefer that unmovable pages were not allocated. This would allow further concentration of balloon pages within 2M regions without using compaction aggressively.

There was no resistance to the idea in principle so one would expect that some sort of prototype will appear on the lists during the next year.

Finding holes for mmap()

Rik van Riel started a discussion on the problem of finding free virtual areas quickly during mmap() calls. Very simplistically, an mmap() requires a linear search of the virtual address space by virtual memory area (VMA) with some minor optimizations for caching holes and scan pointers. However, there are some workloads that use thousands of VMAs so this scan becomes expensive.

VMAs are already organized by a red-black tree (RB tree). Andrea Arcangeli had suggested that information about free areas near a VMA could be propagated up the RB tree toward the root. Essentially it would be an augmented RB tree that stores both allocated and free information. Van Riel was considering a simpler approach using a callback on a normal RB tree to store the hole size in the VMA. Using that, each RB node would know the total free space below it in an unsorted fashion.

That potentially introduces fragmentation as a problem but that is inconsequential to Van Riel in comparison to the problem where a hole of a particular alignment is required. Peter Zijlstra maintained that augmented trees should be usable to do this, but that was disputed by Van Riel who said that augmented RB tree users have significant implementation responsibilities so this detail needs further research.

Again, there was little resistance to the idea in principle but there are likely to be issues during review about exactly how it gets implemented.

AIO/DIO in the kernel

Dave Kleikamp talked about asynchronous I/O (AIO) and how it is currently used for user pages. He wants to be able to initiate AIO from within the kernel, so he wants to convert struct iov_iter to contain either an iovec or bio_vec and then convert the direct I/O path to operate on iov_iter. He maintains that this should be a straightforward conversion based on the fact that it is the generic code that does all the complicated things with the various structures.

He tested the API change by converting a loop device to set O_DIRECT and submit via AIO. This eliminated caching in the underlying filesystem and assured consistency of the mounted file.

He sent out patches a month ago but did not get much feedback and was looking to figure out why that was. He was soliciting input on the approach and how it might be improved but it seemed like many had either missed the patches or otherwise not read them. There might be greater attention in the future.

The question was asked whether it would be a compatible interface for swap-over-arbitrary-filesystem. The latest swap-over-NFS patches introduced an interface for pinning pages for kernel I/O but Dave's patches appear to go further. It would appear that swap-over-NFS could be adapted to use Dave's work.

Dueling NUMA migration schemes

Peter Zijlstra started the session by talking about his approach for improving performance on NUMA machines. Simplistically, it assigns processes to a home node that allocation policies will prefer to allocate from and load balancer policies to keep the threads near the memory it is using. System calls are introduced to allow assignment of thread groups and VMAs to nodes. Applications must be aware of the API to take advantage of it.

Once the decision has been made to migrate threads to a new node, their pages are unmapped and migrated as they are faulted, minimizing the number of pages to be migrated and correctly accounting for the cost of the migration to the process moving between nodes. As file pages may potentially be shared, the scheme focuses on anonymous pages. In general, the scheme is expected to work well for the case where the working set fits within a given NUMA node but be easier to implement than the hard binding support currently offered by the kernel. Preliminary tests indicate that it does what it is supposed to do for the cases it handles.

One key advantage Zijlstra cited for his approach was that he maintains information based on thread and VMA, which is predictable. In contrast, Andrea Arcangeli's approach requires storing information on a per-page basis and is much heavier in terms of memory consumption. There were few questions on the specifics of how it was implemented with comments from the room focusing instead on comparing Zijlstra and Arcangeli's approaches.

Hence, Arcangeli presented on AutoNUMA which consists of a number of components. The first is the knuma_scand component which is a page table walker that tracks the RSS usage of processes and the location of their pages. To track reference behavior, a NUMA page fault hinting component changes page table entries (PTEs) in an arrangement that is similar but not identical to PROT_NONE temporarily. Faults are then used to record what process is using a given page in memory. knuma_migrateN is a per-node thread that is responsible for migrating pages if a process should move to a new node. Two further components move threads near the memory they are using or alternatively, move memory to the CPU that is using it. Which option it takes depends on how memory is currently being used by the processes.

There are two types of data being maintained for decisions. sched_autonuma works on a task_struct basis and the data is collected by NUMA hinting page faults. The second is mm_autonuma which works on an mm_struct basis and gathers information on the working set size and the location of the pages it has mapped, which is generated by knuma_scand.

The details on how it decides whether to move threads or memory to different NUMA nodes is involved but Arcangeli expressed a high degree of confidence that it could make close to optimal decisions on where threads and memory should be located. Arcangeli's slide that describes the AutoNUMA workflow is shown at right.

When it happens, migration is based on per-node queues and care is taken to migrate pages at a steady rate to avoid bogging the machine down copying data. While Arcangeli acknowledged the overall concept was complicated, he asserted that it was relatively well-contained without spreading logic throughout the whole of MM.

As with Zijlstra's talk, there were few questions on the specifics of how it was implemented, implying that not many people in the room have reviewed the patches, so Arcangeli moved on to explaining the benchmarks he ran. The results of the benchmarks looked as if performance was within a few percent of manually binding memory and threads to local nodes. It was interesting to note that for one benchmark, specjbb, it was clear that how well AutoNUMA does varies, which shows its non-deterministic behavior. But its performance never dropped below the base performance. He explained that the variation could be partially explained by the fact that AutoNUMA currently does not migrate THP pages, instead it splits them and migrates the individual pages depending on khugepaged to collapse the huge pages again.

Zijlstra pointed out that, for some of the benchmarks that were presented, his approach potentially performed just as well without the algorithm complexity or memory overhead. He asserted this was particularly true for KVM-based workloads as long as the workload fits within a NUMA node. He pointed out that the history of memcg led to a situation where it had to be disabled by default in many situations because of the overhead and that AutoNUMA was vulnerable to the same problem.

When it got down to it, the discussed points were not massively different to discussions on the mailing list except perhaps in terms of tone. Unfortunately there was little discussion on was whether there was any compatibility between the two approaches and what logic could be shared. This was due to time limitations but future reviewers may have a clearer view of the high-level concepts.

Soft limits in memcg

Ying Han began by introducing soft reclaim and stated she wanted to find what blockers existed for merging parts of it. It has reached the point where it is getting sufficiently complicated that it is colliding with other aspects of the memory cgroup (memcg) work.

Right now, the implementation of soft limits allows memcgs to grow above a soft limit in the absence of global memory pressure. In the event of global memory pressure then memcgs get shrunk if they are above their soft limit. The results for shrinking are similar to hierarchical reclaim for hard limits. In a superficial way, this concept is similar to what Dan Magenheimer wanted for RAMSter except that it applies to cgroups instead of machines.

Rik van Riel pointed out that it is possible that a task can be fitting in a node and within its soft limit. If there are other cgroups on the same node, the aggregate soft limit can be above the node size and, in some cases, that cgroup should be shrunk even if it is below the soft limit. This has a quality-of-service impact; Han recognizes that this needs to be addressed. This is somewhat of an administrative issue. The total of all hard limits can exceed physical memory with the impact being that global reclaim shrinks cgroups before they hit their hard limit. This may be undesirable from an administrative point of view. For soft limits, it makes even less sense if the total soft limits exceed physical memory as it would be functionally similar to if the soft limits were not set at all.

The primary issue was to decide what to set the ratio to reclaim pages from cgroups at. If there is global memory pressure and all cgroups are under their soft limit then a situation potentially arises whereby reclaim is retried indefinitely without forward progress. Hugh Dickins pointed out that soft reclaim has no requirement that cgroups under their soft limit never be reclaimed. Instead, reclaim from such cgroups should simply be resisted and the question is how it should be resisted. This may require that all cgroups get scanned to discover that they are all under their soft limit and then require burning more CPU rescanning them. Throttling logic is required but ultimately this is not dissimilar to how kswapd or direct reclaimers get throttled when scanning too aggressively. As with many things, memcg is similar to the global case but the details are subtly different.

Even then, there was no real consensus on how much memory should be reclaimed from cgroups below their soft limit. There is an inherent fairness issue here that does not appear to have changed much between different discussions. Unfortunately, discussions related to soft reclaim are separated by a lot of time and people need to be reminded of the details. This meant that little forward progress was made on whether to merge soft reclaim or not but there were no specific objections during the session. Ultimately, this is still seen as being a little Google-specific particularly as some of the shrinking decisions were tuned based on Google workloads. New use-cases are needed to tune the shrinking decisions and to support the patches being merged.

Kernel interference

Christoph Lameter started by stating that each kernel upgrade resulted in slowdowns for his target applications (which are for high-speed trading). This generates a lot of resistance to kernels being upgraded on their platform. The primary sources of interference were from faults, reclaim, inter-processor interrupts, kernel threads, and user-space daemons. Any one of these can create latency, sometimes to a degree that is catastrophic to their application. For example, if reclaim causes an additional minor fault to be incurred, it is in fact a major problem for their application.

The reason this happens is due to some trends. Kernels are simply more complex with more causes of interference leaving less processor time available to the user. Other trends which affect them are larger memory sizes leading to longer reclaim as well as more processors meaning that for-all-cpu loops take longer.

One possible measure would be to isolate OS activities to a subset of CPUs possibly including interrupt handling. Andi Kleen pointed out that even with CPU isolation, if unrelated processes are sharing the same socket, they can interfere with each other. Lameter maintained that while this was true such isolation was still of benefit to them.

For some of the examples brought up, there are people working on the issues but they are still works in progress and have not been merged. The fact of the matter is that the situation is less than ideal with kernels today. This is forcing them into a situation where they fully isolated some CPUs and bypass the OS as much as possible, which turns Linux into a glorified boot loader. It would be in the interest of the community to reduce such motivations by watching the kernel overhead, he said.

Filesystem and storage sessions

Copy offload

Frederick Knight, who is the NetApp T10 (SCSI) standards guy, began by describing copy offload, which is a method for allowing SCSI devices to copy ranges of blocks without involving the host operating system. Copy offload is designed to be a lot faster for large files because wire speed is no longer the limiting factor. In fact, in spite of the attention now, offloaded copy has been in SCSI standards in some form or other since the SCSI-1 days. EXTENDED COPY (abbreviated as XCOPY) takes two descriptors for the source and destination and a range of blocks. It is then implemented in a push model (source sends the blocks to the target) or a pull model (target pulls from source) depending on which device receives the XCOPY command. There's no requirement that the source and target use SCSI protocols to effect the copy (they may use an internal bus if they're in the same housing) but should there be a failure, they're required to report errors as if they had used SCSI commands.

A far more complex command set is TOKEN based copy. The idea here is that the token contains a ROD (Representation of Data) which allows arrays to give you an identifier for what may be a snapshot. A token represents a device and a range of sectors which the device guarantees to be stable. However, if the device does not support snapshotting and the region gets overwritten (or in fact, for any other reason), it may decline to accept the token and mark it invalid. This, unfortunately, means you have no real idea of the token lifetime, and every time the token goes invalid, you have to do the data transfer by other means (or renew the token and try again).

There was a lot of debate on how exactly we'd make use of this feature and whether tokens would be exposed to user space. They're supposed to be cryptographically secure, but a lot of participants expressed doubt on this and certainly anyone breaking a token effectively has access to all of your data.

NFS and CIFS are starting to consider token-based copy commands, and the token format would be standardized, which would allow copies from a SCSI disk token into an NFS/CIFS volume.

Copy offload implementation

The first point made by Hannes Reinecke is that identification of source and target for tokens is a nightmare if everything is done in user space. Obviously, there is a need to flush the source range before constructing the token, then we can possibly use FIEMAP to get the sectors. Chris Mason pointed out this wouldn't work for Btrfs and after further discussion the concept of a ref-counted FIETOKEN operation emerged instead.

Consideration then moved to hiding the token in some type of reflink() and splice()-like system calls. There was a lot more debate on the mechanics of this, including whether the token should be exposed to user space (unfortunately, yes, since NFS and CIFS would need it). Discussion wrapped up with the thought that we really needed to understand the user-space use cases of this technology.

RAID unification

pNFS is beginning to require complex RAID-ed objects which require advanced RAID topologies. This means that pNFS implementations need an advanced, generic, composable RAID engine that can implement any topology in a single compute operation. MD was rejected because composition requires layering within the MD system and that means you can't do advanced topologies in a single operation.

This proposal was essentially for a new interface that would unify all the existing RAID systems by throwing them away and writing a new one. Ted Ts'o pointed out that filesystems making use of this engine don't want to understand how to reconstruct the data, so the implementation should "just work" for the degraded case. If we go this route, we definitely need to ensure that all existing RAID implementations work as well as they currently do.

The action summary was to start with MD and then look at Btrfs. Since we don't really want new administrative interfaces exposed to users, any new implementation should be usable by the existing LVM RAID interfaces.

Testing

Dave Chinner reminded everyone that the methodology behind xfstest is "golden output matching". That means that all XFS tests produce output which is then filtered (to remove extraneous differences like timestamps or, rather, to fill them in with X's) and the success or failure indicated by seeing if the results differ from the expected golden result file. This means that the test itself shouldn't process output.

Almost every current filesystem is covered by xfstest in some form and all the code in XFS is tested at 75-80% coverage. (Dave said we needed to run the code coverage tools to determine what the code coverage of the tests in other filesystems actually is). Ext4, XFS and Btrfs regularly have the xfstest suite run as part of their development cycle.

Xfstest consists of ~280 tests which run in 45-60 minutes (depending on disk speed and processing power). Of these tests, about 100 are filesystem-independent. One of the problems is that the tests are highly dependent on the output format of tools, so, if that changes, the test reports false failures. On the other hand, it is easily fixed by constructing a new golden output file for the tests.

One of the maintenance nightmares is that the tests are numbered rather than named (which means everyone who writes a new test adds it as number 281 and Dave has to renumber). This should be fixed by naming tests instead. The test space should also become hierarchical (grouping by function) rather than the current flat scheme. Keeping a matrix of test results over time allows far better data mining and makes it easier to dig down and correlate reasons for intermittent failures, Chinner said.

Flushing and I/O back pressure

This was a breakout session to discuss some thoughts that arose during the general writeback session (reported above).

The main concept is that writeback limits are trying to limit the amount of time (or IOPS, etc.) spent in writeback. However, the flusher threads are currently unlimited because we have no way to charge the I/O they do to the actual tasks. Also, we have problems accounting for metadata (filesystems with journal threads) and there are I/O priority inversion problems (can't have high priority task blocked because of halted writeout on a low priority one which is being charged for it).

There are three problems:

Problems between CFQ and block flusher. This should now be solved by tagging I/O with the originating cgroup.
CFQ throws all I/O into a single queue (Jens Axboe thinks this isn't a problem).
Metadata ordering causes priority inversion.

On the last, the thought was that we could use transaction reservations as an indicator for whether we had to complete the entire transaction (or just throttle it entirely) regardless of the writeback limits which would avoid the priority inversions caused by incomplete writeout of transactions. For dirty data pages, we should hook writeback throttling into balance_dirty_pages(). For the administrator, the system needs to be simple, so there needs to be a single writeback "knob" to adjust.

Another problem is that we can throttle a process which uses buffered I/O but not if it uses AIO or direct I/O (DIO), so we need to come up with a throttle that works for all I/O.

Index entries for this article
Conference	Storage Filesystem & Memory Management/2012

(Log in to post comments)

memory vs performance

Posted Apr 5, 2012 8:34 UTC (Thu) by dlang (guest, #313) [Link]

This is a really hard problem to identify, on many systems there is no thrashing to indicate that more memory will help.

I recently upgraded the memory on a machine based on my judgement that it would help (and it did, a nightly processing run went from 10+ hours to <2 hours), but this system had no paging taking place, the only performance stats that changed were disk utilization (and I/O wait) and the fact that the CPU utilization shot up.

I've run into this same sort of situation multiple times where the real underlying cause is that the disk cache no longer holds the working set and so disk I/O skyrockets and performance crashes. This is sometimes complicated that 'working set' may require looking at a very long timeframe. In my most recent situation it was analysing log files for the day, so data untouched for the last 24 hours was still part of the overall 'working set'

This is a very hard problem

memory vs performance

Posted Apr 5, 2012 12:45 UTC (Thu) by alankila (guest, #47141) [Link]

Perhaps the problem could be diagnosed by observing the read/write ratio. A fully cached scheme results in no read I/O whatsoever, while uncached case could issue far more reads than writes. In any case, one has to form some kind of estimate to determine what the acceptable read/write ratio should be for that particular job.

memory vs performance

Posted Apr 5, 2012 20:21 UTC (Thu) by dlang (guest, #313) [Link]

the problem is how can the system tell if the reads are needed because the data hasn't been available, or if they are wasteful because the data was available in the past (and then the question is how far in the past, in my case it was 24 hours + processing time, which is an eternity to the system)

memory vs performance

Posted Apr 6, 2012 18:26 UTC (Fri) by alankila (guest, #47141) [Link]

I'm not suggesting this is an automatic scheme. I doubt it is even possible to design. I'm just suggesting that if you have an understanding of the behavior you can use this sort of metric to determine if the system is behaving poorly or well.

memory vs performance

Posted Apr 6, 2012 19:39 UTC (Fri) by dlang (guest, #313) [Link]

ahh, I misunderstood you. Yes, watching for this sort of thing is part of tuning the system.

2012 Linux Storage, Filesystem, and Memory Management Summit - Day 1

Posted Apr 5, 2012 14:14 UTC (Thu) by etienne (guest, #25256) [Link]

> Stable pages

I am not a specialist, but isn't any journaled filesystem need stable pages to store the journal itself?

2012 Linux Storage, Filesystem, and Memory Management Summit - Day 1

Posted Apr 5, 2012 14:55 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link]

> isn't any journaled filesystem need stable pages to store the journal itself?

Yes; but unstable pages are modified by applications after calling write() without notifying the kernel.
I'd hope that pages containing journal data are not exported to userspace, and that filesystem code knows what it's doing.

2012 Linux Storage, Filesystem, and Memory Management Summit - Day 1

Posted Apr 17, 2012 13:22 UTC (Tue) by intgr (subscriber, #39733) [Link]

Posting to an old comment, but: journalling filesystems generally only journal *metadata* operations, not file operations. ext3+ by default, too. But ext3+ is an exception here, you can configure it to journal data too with the data=journal flag. Not sure how it behaves wrt the application modifying a pending page underneath it though.

2012 Linux Storage, Filesystem, and Memory Management Summit - Day 1

Posted Apr 5, 2012 22:14 UTC (Thu) by krakensden (guest, #72039) [Link]

> Christoph Lameter started by stating that each kernel upgrade resulted in slowdowns for his target applications

You know, Phoronix has access to a wide variety of hardware, does all sorts of tests regularly, and has occasionally done really detailed bisecting work. They also have this "openbenchmarking.org" thing that could almost be a public graph of performance over time for open source projects. It's a real shame the opportunity for something beautiful with regards to detecting performance regressions has been squandered by incoherent graphs, bad statistics, and cheeseball attempts at yellow journalism.

Copy Offload

Posted Apr 17, 2012 21:50 UTC (Tue) by feknight8 (guest, #84191) [Link]

> certainly anyone breaking a token effectively has access to all of your data

Tokens are a minimum of 512 bytes in length (4,096 bits). Tokens are typically short lived (typically seconds to a small number of minutes). To guess all possible combinations of a 4096 bit value (2^4096 possible values) within even 300 seconds (which would typically be a long lived token) isn't computationally possible.

Next, even if you did happen to randomly hit one during its valid life time, you do not gain access to "all of your data", you only get access to the subset of data that is represented by the token during that valid window.

Web browsers today typically use 128 or 256 bit public key exchange, and they live for much longer time periods. For tokens, there is no plain text associated with the secure portion, and only private keys are used to create the secure portion of the token (only the creator of the token ever has to perform a decode operation; therefore public keys are not needed).

Copy Offload

Posted Apr 18, 2012 7:01 UTC (Wed) by eternaleye (guest, #67051) [Link]

Brute-forcing 4096 bits is infeasible, true.

But you only need to use brute force if the algorithm is strong. If the algorithm is weak, or uses a predictable source where there should be a random one, breaking it can become orders of magnitude easier. Case in point, brute-forcing 128 bits is a hassle, even if it is feasible. THis doesn't prevent MD5, a 128-bit hash function, from being so broken that it is feasible to create collisions in ~10 seconds on a 2.6GHz Pentium 4 ( http://www.win.tue.nl/hashclash/On%20Collisions%20for%20M... [see conclusion])

Copy Offload

Posted Apr 18, 2012 9:02 UTC (Wed) by ekj (guest, #1524) [Link]

That has to be the understatement of the year. There are on the order of 2^265 atoms in the universe. Even if every single one of them was a CPU, capable of testing tokens at a rate of 1Thz, thus giving you an aggregate rate of 2^300/s you'd still need the age of the universe times 2^3738 to check them all.

Hitting one by accident won't happen, for the same reason. Now, weaknesses in the algorithm is an entirely different kettle of fish.

Copy Offload - Security

Posted Apr 19, 2012 14:57 UTC (Thu) by feknight8 (guest, #84191) [Link]

The Token == the data. Applications must treat the token in the same way they treat data (if you wouldn't give someone the data, then don't give them the token).

As for devices that build these tokens, yes, dumb designs are possible, as are dumb implementations or buggy implementations. All the standard can do it describe how it is supposed to work, and the standard makes the following statement about the contents of the token:

"The EXTENDED ROD TOKEN DATA field shall contain at least 256 bits of secure random number material (see 4.5) generated when the ROD token was created..."

Those "at least 256 bits" are contained within the 4096 bit structure - which also contains other information.

Sub-clause 4.5 states: "Secure Random numbers should be generated as specified by RFC 4086 (e.g., see FIPS 140-2 Annex C: Approved Random Number Generators)."

Therefore, the token contents are intended to be as secure as FIPS 140 can make it.