Closed Bug 509052 Opened 15 years ago Closed 12 years ago

-moz-box-shadow applied to <html> tag in CSS stylesheet causes significant slowdown

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: smartell, Assigned: bas.schouten)

References

(Depends on 3 open bugs, Blocks 1 open bug, )

Details

Attachments

(4 files, 3 obsolete files)

On my blog, http://blog.seanmartell.com/, I've created a vignette effect around the edge of the site using -moz-box-shadow at 200px in width.

The resulting look has caused major slowdown when scrolling and resizing of the window.

It may be an extreme use case of -moz-box-shadow, but I've filed this as a heads up.
Sure is slow to scroll!

Shark says 71.5% of the fun is in:

libthebes.dylib	gfxAlphaBoxBlur::Paint(gfxContext*, gfxPoint const&)	

Moving to GFX:Thebes
Component: General → GFX: Thebes
Product: Firefox → Core
QA Contact: general → thebes
Version: unspecified → Trunk
Blocks: 478721
Fixing this might not be easy. Basically we're currently redrawing the the shadow every time we scroll and the shadow drawing process is very expensive. One way to fix this might be to somehow cache the shadow on a layer, though I'm not exactly sure how that will work out.

It would be useful if you could attach a demo and also compare the performance with Safari.
Blocks: 527728
Depends on: 469589
I'd like to see a simple testcase. The first thing is to check to make sure we're not drawing any more of the shadow than we really need to. Then there are a couple of obvious things we can do to speed up gfxAlphaBoxBlur:
1) SSE2 love
2) computing the blur of a large box is completely naive right now. We draw the box to a temporary surface and explicitly compute the blur of every pixel of the result. But when one or both of the box side lengths are significantly larger than the blur kernel diameter, you can optimize a lot. The blur result can be divided into three parts:
-- the interior: all points which are further than the blur radius away from any point on the outside of the box. These points are all solid.
-- edges: all points which are within the blur radius of a straight horizontal or vertical edge, but no other points outside the box. These consist of rows and columns of identical pixels
-- corners: all other points. These need to be calculated explicitly.
3) symmetry. we can probably speed things up by nearly a factor of 4 in almost all cases just by observing that boxes almost always have 4-way symmetry, therefore we can just compute one quarter of the blur and mirror that.

It'd be slightly tricky to extend the gfxAlphaBoxBlur interface to express the information you need for optimizations 2 and 3. Perhaps for 3 we would pass in optional x and y values indicating axes of symmetry, and for 2 we would pass in optional (x1,x2) and (y1,y2) ranges indicating ranges of identical columns and rows of pixels in the source image.

This is a nice little problem actually. I think we could make common shadows super-fast this way.
Attached patch sse2 blurSplinter Review
Here's an sse2 version of the blur code I did a while ago. I haven't looked at it for probably a year so I'm sure it's rotted some and needs some cleanup.
Hi Jeff. Regarding the safari performance when using box shadow, it seems to perform fine. Apple uses that CSS declaration on a lot of the pages of apple.com. See http://www.apple.com/imac/the-new-imac/ as an example.
Another optimization:
4) for small radii (which are common), apply an explicit Gaussian kernel in one pass instead of the 6-pass box-blur approach. I suspect the explicit kernel will be faster at least for radii <= 2.
Depends on: 544099
A friend of mine, Mike Herf, has a very fast box shadow algorithm: 

http://www.stereopsis.com/shadowrect/
That's nice, but it's a method for fast shadows *on rectangles*. We often deal with nonrectangular shapes due to border-radius, and when we draw shadows on form controls styled with native themes (where we have no idea what the shape is, we just get a pixmap with alpha values). (The latter capability is new in Gecko 2.0 FWIW.)

Since this bug was filed, we implemented another optimization in bug 544099 that stopped us from computing shadow values where we know they will be covered by the box itself. That solved most of our performance problems.

If we need to optimize more, I would focus on optimization #2 above. In particular I would let callers pass in a range of rows over which each row of the shape mask is known to be identical, and a similar range of columns. We can use that information to compute a range of shadow rows and columns that are known to be identical. Then after we've computed the first row or column of the shadow in that range, we can memcpy in the rest.

But we need testcases to justify further optimizations. Sean's blog is fast to scroll for me, but maybe it's changed since this bug was filed.
Depends on: 633627
Depends on: 648449
Blocks: 511739
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #8)
> But we need testcases to justify further optimizations. 

that's easy enough: think about CSS animations or transitions involving shadows - I am using them f.e. in the welcome screen of my extension (https://github.com/CAFxX/ImageTweak/blob/master/content/welcome.htm)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #4)
> Created attachment 413324 [details] [diff] [review]
> sse2 blur

This looks like something we'd still want to have. Did this patch work? 

A version for ARM NEON intrinsics would be good to have as well :)
(In reply to Jet Villegas (:jet) from comment #10)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #4)
> > Created attachment 413324 [details] [diff] [review]
> > sse2 blur
> 
> This looks like something we'd still want to have. Did this patch work? 
> 
> A version for ARM NEON intrinsics would be good to have as well :)

I believe it did, but the patch has bitrotted substantially since I wrote it. Currently though, our mobile shadow performance is killed being by bug 754985.
Depends on: 758825
I've rebased the attached patch, and it works. It's for SVG, though. I'll
benchmark it to check that it's faster than the normal version.  If it performs
better, I'll try to port it over to gfx/2d/. However since it tries to operate
on columns, it needs to shuffle a lot and fetch way too many cache line than necessary, so I'm not sure how fast it is.

I think we can avoid all that complicated code and unoptimal memory access by transposing the image between the horizontal and vertical passes (meaning rewriting the whole thing):

blur -> blur -> blur -> transpose -> blur -> blur -> blur -> transpose

That way, we do only sequential data access so:
- we are more cache-efficient
- we can use SIMD instruction without too much shuffling
- we avoid code quasi-duplication like in gfx/2d/Blur.cpp

I'll also have a look on the points roc mentions on comment 3 and 6.
Quick update on this work.

Analyzed in cachegrind, the current algorithm shows a 95% cache miss on the vertical pass, which is alright for tiny blurs (for example, in the asteroids benchmark, the surface to blur is smaller than 40x40px, and all the temporary surfaces we use fit in the L1 cache), but totally kill performance on big images, since we need to pull a full cache line to get the next pixel (that is, pull 64 times the necessary amount of memory for a 64 bytes cache line), and because this cache line will never be in L1, which is on the order of 4k, and is unlikely to be in L2 (around 32k), if the stride is big enough, because we use temporary buffer. On my machine, I've saved by the L3 cache that is huge, but there is no such thing on older machines.

On mobile, depending if we use an expensive device or not, the situation can change a lot.

I've reworked the algorithm we use, using the approach of comment 12 (plus a few tricks), and I've got a speedup between 1.7 and 2.2 so far. It is currently out of tree to be able to profile properly, but I plan to put it back soon (the code need a bit of cleaning).

An in place version for the blur algorithm would be even faster, I guess, to be able to cache even more, but it seems a bit trickier to write.

Before going the SSE2 and NEON route to go faster, I guess roc suggestions would make more sense to implement anyways.
What that describes is almost exactly what we do, plus comment #12. Though he does have some SIMD code.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #15)
> What that describes is almost exactly what we do, plus comment #12. Though
> he does have some SIMD code.

The code above has a much faster inner kernel than ours. Currently our inner loop is not very quick.
(In reply to Paul ADENOT (:padenot) [on vacation until around the 2012-10-21] from comment #13)
> Quick update on this work.
> 
> Analyzed in cachegrind, the current algorithm shows a 95% cache miss on the
> vertical pass, which is alright for tiny blurs (for example, in the
> asteroids benchmark, the surface to blur is smaller than 40x40px, and all
> the temporary surfaces we use fit in the L1 cache), but totally kill
> performance on big images, since we need to pull a full cache line to get
> the next pixel (that is, pull 64 times the necessary amount of memory for a
> 64 bytes cache line), and because this cache line will never be in L1, which
> is on the order of 4k, and is unlikely to be in L2 (around 32k), if the
> stride is big enough, because we use temporary buffer. On my machine, I've
> saved by the L3 cache that is huge, but there is no such thing on older
> machines.
> 
> On mobile, depending if we use an expensive device or not, the situation can
> change a lot.
> 
> I've reworked the algorithm we use, using the approach of comment 12 (plus a
> few tricks), and I've got a speedup between 1.7 and 2.2 so far. It is
> currently out of tree to be able to profile properly, but I plan to put it
> back soon (the code need a bit of cleaning).
> 
> An in place version for the blur algorithm would be even faster, I guess, to
> be able to cache even more, but it seems a bit trickier to write.
> 
> Before going the SSE2 and NEON route to go faster, I guess roc suggestions
> would make more sense to implement anyways.

Another approach to this could be doing the blur in 'blocks' which have a size of max the cache line size. This will probably improve performance a lot as it would mean the vertical pass is much less likely to have cache misses.
I've played around here with simplifying our code significantly by using a single pass blur where I compute an integral image and then do a box blur. This, in a very naive implementation already performs better than our current code, and consists of a lot less code.

I'm going to explore this approach a little further, since it should boost performance both on Mobile and Desktop.
Bas, I've got a patch ready for review (green on try) that does what you describe, and has been extensively benchmarked against other techniques. It speed up the blur by a factor of two. I'll post the patch as it is, and I've also got numbers to support some things. I was planning to finish this, as I start again on the fifth of November.

Anyway, last time I checked, doing the vertical pass by blocks was slower that the approach I took, that is transposing a few scanlines in a little buffer.
Attached patch PatchSplinter Review
The approach here is to remove a maximum of tests from the inner loop (in this patch, located in BoxBlurScanline). To do so, we implement the skip box optimization outside of the loop (that implies a bit of ugliness at the call site, though, but I'm sure this can be improved). I've got a few numbers that back the choice of the NUMBER_SCANLINE_TRANSPOSE to 32, and other graphs that shows that this code consistently gives a speedup between 1.8 and 2.3.

This code is based on roc's blur in the svg code (simplified to operate on the alpha channel only). I've benchmarked a few blurs (the original one, an another one that does in-place computation), and this version is the fastest. There is certainly some things I overlooked, though.

There is a bit of printf-ing and a few asserts that are not intended to stay.
This is the other idea I suggested, this uses the integral image to calculate the blur, it's not 100% correct let as the corners where we need to clamp are not implemented. And I didn't spend a lot of time optimizing this, but it already also features a ~2x speedup and it has the advantage of being very simple code.

It should be noted this will only work for surfaces under 8 MPixels. It also uses a little more memory obviously.
(In reply to Bas Schouten (:bas.schouten) from comment #21)
> Created attachment 675944 [details] [diff] [review]
> Use integral image for blurring
> 
> This is the other idea I suggested, this uses the integral image to
> calculate the blur, it's not 100% correct let as the corners where we need
> to clamp are not implemented. And I didn't spend a lot of time optimizing
> this, but it already also features a ~2x speedup and it has the advantage of
> being very simple code.
> 
> It should be noted this will only work for surfaces under 8 MPixels. It also
> uses a little more memory obviously.

I've got a basic SSE2 version of this approach working. I'm not sure I'm doing everything the best way I can do it. But at this point it looks to be about a 4-5x speedup over the original blurring code.
(In reply to Bas Schouten (:bas.schouten) from comment #22)
> (In reply to Bas Schouten (:bas.schouten) from comment #21)
> > Created attachment 675944 [details] [diff] [review]
> > Use integral image for blurring
> > 
> > This is the other idea I suggested, this uses the integral image to
> > calculate the blur, it's not 100% correct let as the corners where we need
> > to clamp are not implemented. And I didn't spend a lot of time optimizing
> > this, but it already also features a ~2x speedup and it has the advantage of
> > being very simple code.
> > 
> > It should be noted this will only work for surfaces under 8 MPixels. It also
> > uses a little more memory obviously.
> 
> I've got a basic SSE2 version of this approach working. I'm not sure I'm
> doing everything the best way I can do it. But at this point it looks to be
> about a 4-5x speedup over the original blurring code.

Having further improved my code my SSE2 version is now about 6x faster than the original code.
I've almost finished my patch and will attach a final, working patch here soon. Here's my test-data:

SSE2:
Time taken to execute blur: 0,038924ms Size: 38x30
Time taken to execute blur: 0,023525ms Size: 38x30
Time taken to execute blur: 0,027803ms Size: 49x30
Time taken to execute blur: 0,024381ms Size: 38x30
Time taken to execute blur: 0,046623ms Size: 101x30
Time taken to execute blur: 0,029086ms Size: 44x22
Time taken to execute blur: 0,025664ms Size: 42x20
Time taken to execute blur: 5,564833ms Size: 2476x200
Time taken to execute blur: 8,142364ms Size: 2476x276
Time taken to execute blur: 8,065372ms Size: 2476x276
Time taken to execute blur: 7,992229ms Size: 2476x276
Time taken to execute blur: 7,819852ms Size: 2476x276
Time taken to execute blur: 8,060239ms Size: 2476x276
Blur radius 120px:
Time taken to execute blur: 15,016066ms Size: 2618x418
Time taken to execute blur: 14,333401ms Size: 2618x418
Time taken to execute blur: 14,517754ms Size: 2618x418
Time taken to execute blur: 15,402310ms Size: 2618x418
Time taken to execute blur: 14,791077ms Size: 2618x418

New C++:
Time taken to execute blur: 0,028231ms Size: 38x30
Time taken to execute blur: 0,026092ms Size: 38x30
Time taken to execute blur: 0,040207ms Size: 49x30
Time taken to execute blur: 0,026947ms Size: 38x30
Time taken to execute blur: 0,111211ms Size: 101x30
Time taken to execute blur: 0,032080ms Size: 44x22
Time taken to execute blur: 0,023098ms Size: 42x20
Time taken to execute blur: 16,822391ms Size: 2476x235
Time taken to execute blur: 19,097513ms Size: 2476x276
Time taken to execute blur: 19,320363ms Size: 2476x276
Time taken to execute blur: 18,412710ms Size: 2476x276
Time taken to execute blur: 18,315614ms Size: 2476x276
Time taken to execute blur: 18,123133ms Size: 2476x276
Blur radius 120px:
Time taken to execute blur: 35,863871ms Size: 2618x418
Time taken to execute blur: 34,409572ms Size: 2618x418
Time taken to execute blur: 34,192710ms Size: 2618x418
Time taken to execute blur: 34,634560ms Size: 2618x418
Time taken to execute blur: 35,317653ms Size: 2618x418

Old:
Time taken to execute blur: 0,055606ms Size: 38x30
Time taken to execute blur: 0,054322ms Size: 38x30
Time taken to execute blur: 0,062022ms Size: 49x30
Time taken to execute blur: 0,051328ms Size: 38x30
Time taken to execute blur: 0,157834ms Size: 101x30
Time taken to execute blur: 0,053467ms Size: 44x22
Time taken to execute blur: 0,032508ms Size: 42x20
Time taken to execute blur: 8,223206ms Size: 2476x77
Time taken to execute blur: 48,733990ms Size: 2476x276
Time taken to execute blur: 47,485431ms Size: 2476x276
Time taken to execute blur: 45,725729ms Size: 2476x276
Time taken to execute blur: 47,477732ms Size: 2476x276
Time taken to execute blur: 47,254027ms Size: 2476x276
Blur radius 120px:
Time taken to execute blur: 90,573747ms Size: 2618x418
Time taken to execute blur: 89,504409ms Size: 2618x418
Time taken to execute blur: 88,345675ms Size: 2618x418
Time taken to execute blur: 89,500132ms Size: 2618x418
Time taken to execute blur: 86,703600ms Size: 2618x418
Attached patch Fast C++ and SSE2 blur code (obsolete) — Splinter Review
Attachment #676930 - Flags: review?(tterribe)
(In reply to Bas Schouten (:bas.schouten) from comment #25)
> Created attachment 676930 [details] [diff] [review]
> Fast C++ and SSE2 blur code

It would be nice if we also added tests for the blur code.
Comment on attachment 676930 [details] [diff] [review]
Fast C++ and SSE2 blur code

Review of attachment 676930 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/BlurSSE2.cpp
@@ +49,5 @@
> +// { 30, 80, 160, 260 }. This seems to be the fastest way to do this after
> +// much testing.
> +MOZ_ALWAYS_INLINE
> +__m128i AccumulatePixelSums(__m128i aPixels)
> +{

It might be worth adding a comment that _mm_slli_si128 shifts by bytes and not bits. That threw me off when reading this.
Attached patch Fast C++ and SSE2 blur code v2 (obsolete) — Splinter Review
Attachment #676930 - Attachment is obsolete: true
Attachment #676930 - Flags: review?(tterribe)
Attachment #677195 - Flags: review?(tterribe)
Attached patch Fast C++ and SSE2 blur code v3 (obsolete) — Splinter Review
Fixed some more small issues, all tests green now!
Attachment #677274 - Flags: review?(tterribe)
Attachment #677195 - Attachment is obsolete: true
Attachment #677195 - Flags: review?(tterribe)
Comment on attachment 677274 [details] [diff] [review]
Fast C++ and SSE2 blur code v3

Review of attachment 677274 [details] [diff] [review]:
-----------------------------------------------------------------

There's a bunch of minor issues, but a couple are worrying, so r- for now.

::: gfx/2d/Blur.cpp
@@ +402,5 @@
>    if (stride.isValid()) {
>      mStride = stride.value();
>  
>      CheckedInt<int32_t> size = CheckedInt<int32_t>(mStride) * mRect.height *
> +                               sizeof(unsigned char) + 20;

I don't understand why you're multiplying by sizeof(unsigned char) if you're calling new uint8_t[].

@@ +483,5 @@
>  
>      if (mSpreadRadius.width > 0 || mSpreadRadius.height > 0) {
> +      // No need to use CheckedInt here - we have validated it in the constructor.
> +      size_t szB = stride * size.height * sizeof(unsigned char);
> +      unsigned char* tmpData = new uint8_t[szB];

Again the types here are inconsistent. Please pick one.

@@ +503,5 @@
> +    int32_t maxLeftLobe = RoundUpToMultipleOf4(horizontalLobes[0][0] + 1).value();
> +
> +    IntSize integralImageSize(size.width + maxLeftLobe + horizontalLobes[1][1],
> +                              size.height + verticalLobes[0][0] + verticalLobes[1][1] + 1);
> +    if ((integralImageSize.width * integralImageSize.height) > (1 << 23)) {

Why is this 1<<23 instead of 1<<24? Isn't the actual maximum here UINT32_MAX/255 == 0x1010101?

@@ +505,5 @@
> +    IntSize integralImageSize(size.width + maxLeftLobe + horizontalLobes[1][1],
> +                              size.height + verticalLobes[0][0] + verticalLobes[1][1] + 1);
> +    if ((integralImageSize.width * integralImageSize.height) > (1 << 23)) {
> +      // Fallback to old blurring code when the surface is so large it may
> +      // overflow out integral image!

Grammar fail.

@@ +508,5 @@
> +      // Fallback to old blurring code when the surface is so large it may
> +      // overflow out integral image!
> +
> +      // No need to use CheckedInt here - we have validated it in the constructor.
> +      size_t szB = stride * GetSize().height * sizeof(unsigned char);

Don't you already have the GetSize() value in "size"? Also see above about the types.

But really, you should probably just factor this out into common code above the two if blocks. Then you would only need to allocate things once.

@@ +510,5 @@
> +
> +      // No need to use CheckedInt here - we have validated it in the constructor.
> +      size_t szB = stride * GetSize().height * sizeof(unsigned char);
> +      unsigned char* tmpData = new uint8_t[szB];
> +      if (!tmpData)

Huh? Is this supposed to be a fallible allocation? The one above certainly isn't, though the malloc() that used to be used here was.

@@ +517,5 @@
> +      memset(tmpData, 0, szB);
> +
> +      if (mBlurRadius.width > 0) {
> +        int32_t lobes[3][2];
> +        ComputeLobes(mBlurRadius.width, lobes);

You've already computed these. Don't compute them again.

@@ +522,5 @@
> +        BoxBlurHorizontal(mData, tmpData, lobes[0][0], lobes[0][1], stride, GetSize().height, mSkipRect);
> +        BoxBlurHorizontal(tmpData, mData, lobes[1][0], lobes[1][1], stride, GetSize().height, mSkipRect);
> +        BoxBlurHorizontal(mData, tmpData, lobes[2][0], lobes[2][1], stride, GetSize().height, mSkipRect);
> +      } else {
> +        memcpy(tmpData, mData, stride * GetSize().height);

Instead of a gigantic memcpy, you should just do some pointer swapping.

@@ +525,5 @@
> +      } else {
> +        memcpy(tmpData, mData, stride * GetSize().height);
> +      }
> +      if (mBlurRadius.height > 0) {
> +        int32_t lobes[3][2];

You've already computed these. Don't compute them again.

@@ +531,5 @@
> +        BoxBlurVertical(tmpData, mData, lobes[0][0], lobes[0][1], stride, GetSize().height, mSkipRect);
> +        BoxBlurVertical(mData, tmpData, lobes[1][0], lobes[1][1], stride, GetSize().height, mSkipRect);
> +        BoxBlurVertical(tmpData, mData, lobes[2][0], lobes[2][1], stride, GetSize().height, mSkipRect);
> +      } else {
> +        memcpy(mData, tmpData, stride * GetSize().height);

Instead of a gigantic memcpy, you should just do some pointer swapping.

@@ +539,2 @@
>      } else {
> +      int32_t integralImageStride = GetAlignedStride<16>(integralImageSize.width * 4);

Why int32_t instead of, say, size_t (which, being unsigned, would allow the compiler to optimize the "/ 4" below much better).

@@ +581,5 @@
> +    uint32_t *intPrevRow = aIntegralImage + (y - 1) * stride32bit;
> +    uint32_t *intFirstRow = aIntegralImage;
> +
> +    for (int x = 0; x < integralImageSize.width; x++) {
> +      intRow[x] = intFirstRow[x] + intPrevRow[x];

You're increasing your cache footprint by 33% here to save 1 add/pixel. I think you should just take the contents of the loop below and put it in a static function, and then call it repeatedly with a pointer to the first row of the source image (and do the same with the last row of the image below). You'll get less code (i.e., LoadIntegralRowFromRow goes away) and allow wider rows before you exhaust L1 cache.

@@ +597,5 @@
> +      currentRowSum += pixel;
> +      *intRow++ = currentRowSum + *intPrevRow++;
> +    }
> +    for (int x = aLeftInflation; x < (aSize.width + aLeftInflation); x += 4) {
> +        uint32_t pixel = *(uint32_t*)(sourceRow + (x - aLeftInflation));

1) This code is little-endian specific, but I don't see an endianess check/fallback anywhere.
2) This isn't a "pixel", it's the alpha channel values from 4 of them, so a better name would be nice.
3) It's currently shadowing the pixel variable you declared just above.

@@ +599,5 @@
> +    }
> +    for (int x = aLeftInflation; x < (aSize.width + aLeftInflation); x += 4) {
> +        uint32_t pixel = *(uint32_t*)(sourceRow + (x - aLeftInflation));
> +        currentRowSum += pixel & 0xff;
> +        *intRow++ = *intPrevRow++ + currentRowSum;

It's probably a small thing, but compilers are generally able to optimize intRow[x], intRow[x+1], etc. better than pointer arithmetic (or at least that's what the AMD software optimization guide tells me, and generally matches my experience).

@@ +659,5 @@
> +  aLeftLobe++;
> +  aTopLobe++;
> +  int32_t boxSize = (aLeftLobe + aRightLobe) * (aTopLobe + aBottomLobe);
> +
> +  if (boxSize == 1) {

Maybe "<= 1" for safety?

@@ +665,5 @@
> +  }
> +
> +  uint32_t stride32bit = aIntegralImageStride / 4;
> +
> +  int32_t leftInflation = RoundUpToMultipleOf4(aLeftLobe).value();

It bothers me that we compute this in two places. You already had to do it once to compute aIntegralImageStride, and you're relying on the fact that you're doing the same calculation here, which is dangerous if someone wants to modify the alignment later. You should just pass both aLeftLobe and aLeftInflation to this function.

@@ +671,5 @@
> +  GenerateIntegralImage_C(leftInflation, aRightLobe, aTopLobe, aBottomLobe,
> +                          aIntegralImage, aIntegralImageStride, mData,
> +                          mStride, size);
> +
> +  uint32_t *innerIntegral = aIntegralImage + (aTopLobe * stride32bit) + leftInflation;

See above about pointer arithmetic. I suspect that a single pointer and five offsets will optimize better than 5 pointers. As a bonus, the offsets are constant and can be hoisted out of the loop below.

@@ +683,5 @@
> +    uint32_t *bottomLeftBase = innerIntegral + ((y + aBottomLobe) * stride32bit - aLeftLobe);
> +
> +    for (int32_t x = 0; x < size.width; x++) {
> +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> +        x = mSkipRect.XMost();

Is this right? The continue will increment x again, leaving it at mSkipRect.XMost()+1. Either that's wrong, or your test for inclusion in mSkipRect is wrong.

@@ +685,5 @@
> +    for (int32_t x = 0; x < size.width; x++) {
> +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> +        x = mSkipRect.XMost();
> +        // Trigger early jump on coming loop iterations, this will be reset
> +        // next line anyway.

I had to read this comment three times before I could parse what you were saying.

@@ +697,5 @@
> +
> +      int32_t value = bottomRight - topRight - bottomLeft;
> +      value += topLeft;
> +
> +      mData[mStride * y + x] = value / boxSize;

You can (and should) replace this division with a multiplication, as I don't think the compiler will do it for you. It can be done without any approximation error.

::: gfx/2d/BlurSSE2.cpp
@@ +10,5 @@
> +
> +namespace mozilla {
> +namespace gfx {
> +
> +MOZ_ALWAYS_INLINE

Shouldn't this be static?

@@ +14,5 @@
> +MOZ_ALWAYS_INLINE
> +uint32_t DivideAndPack(__m128i aValues, __m128i aDivisor)
> +{
> +  __m128i multiplied1 = _mm_srli_epi64(_mm_mul_epu32(aValues, aDivisor), 32); // 00p300p1
> +  __m128i multiplied2 = _mm_srli_epi64(_mm_mul_epu32(_mm_srli_epi64(aValues, 32), aDivisor), 32); // 00p400p2

Instead of shifting multiplied2 down by 32, if you PAND with a constant 0xFFFFFFFF00000000FFFFFFFF00000000 mask, you can simply POR these back together. That replaces a shift, two shuffles, and an unpack with two simple bitwise ops that can issue from any port.

@@ +46,5 @@
> +
> +// This function calculates an integral of four pixels stored in the 4
> +// 32-bit integers on aPixels. i.e. for { 30, 50, 80, 100 } this returns
> +// { 30, 80, 160, 260 }. This seems to be the fastest way to do this after
> +// much testing.

I assume you tried

__m128i sumPixels = aPixels;
__m128i currentPixels = _mm_slli_si128(aPixels, 4);
sumPixels = _mm_add_epi32(sumPixels, currentPixels);
currentPixels = _mm_unpacklo_epi64(_mm_setzero_si128(), sumPixels);
return _mm_add_epi32(sumPixels, currentPixels);

Yes?

@@ +60,5 @@
> +
> +  return _mm_add_epi32(sumPixels, currentPixels);
> +}
> +
> +// NOTE: aLeftInflation needs to be a multiple of 4 here!

Instead of just adding a note, maybe you should ASSERT it?

@@ +72,5 @@
> +
> +  IntSize integralImageSize(aSize.width + aLeftInflation + aRightInflation,
> +                            aSize.height + aTopInflation + aBottomInflation);
> +
> +  LoadIntegralRowFromRow(aIntegralImage, aSource, aSize.width, aLeftInflation, aRightInflation);

I'm less sure about the applicability of the advice I gave for the C version here, since the main loop has to unpack the pixels and do a prefix sum instead of a simple add, but given the poor memory access/arithmetic balance of _this_ loop, it still might be worth trying.

@@ +80,5 @@
> +    __m128i *intPrevRow = (__m128i*)(aIntegralImage + (y - 1) * stride32bit);
> +    __m128i *intFirstRow = (__m128i*)aIntegralImage;
> +
> +    for (int x = 0; x < integralImageSize.width; x += 4) {
> +      __m128i firstRow = _mm_load_si128(intFirstRow + (x / 4));

"/" is expensive for signed integers. Use ">> 2" instead, or have x count in units of __m128i instead of units of uint32_t to begin with, or use uint32_t pointers instead of __m128i pointers like you do in the loop below.

@@ +94,5 @@
> +    uint8_t *sourceRow = aSource + aSourceStride * (y - aTopInflation);
> +
> +    uint32_t pixel = sourceRow[0];
> +    for (int x = 0; x < aLeftInflation; x += 4) {
> +      __m128i sumPixels = AccumulatePixelSums(_mm_shuffle_epi32(_mm_set1_epi32(pixel), _MM_SHUFFLE(0, 0, 0, 0)));

Doesn't _mm_set1_epi32() _already_ broadcast that value to all 4 slots, eliminating the need for the shuffle? It's _mm_cvtsi32_si128() that only sets the first slot and clears the rest to zero.

@@ +106,5 @@
> +    for (int x = aLeftInflation; x < (aSize.width + aLeftInflation); x += 4) {
> +      uint32_t pixels = *(uint32_t*)(sourceRow + (x - aLeftInflation));
> +
> +      // It's important to shuffle here, when we exit this loop currentRowSum
> +      // has to be set to sumPixels, to that the following loop can get the

s/to that/so that/
Also, run-on sentence.

@@ +121,5 @@
> +    }
> +
> +    pixel = sourceRow[aSize.width - 1];
> +    int x = (aSize.width + aLeftInflation);
> +    if ((aSize.width % 4)) {

"%" is expensive with signed integers. Use "& 3" instead.

@@ +127,5 @@
> +      // see explanation above.
> +      uint32_t intCurrentRowSum = ((uint32_t*)&currentRowSum)[(aSize.width % 4) - 1];
> +      for (; x < integralImageSize.width; x++) {
> +        // We could be unaligned here!
> +        if (!(x % 4)) {

See above.

@@ +129,5 @@
> +      for (; x < integralImageSize.width; x++) {
> +        // We could be unaligned here!
> +        if (!(x % 4)) {
> +          // aligned!
> +          currentRowSum = _mm_set1_epi32(intCurrentRowSum);

See above about _mm_set1_epi32()/_mm_shuffle_epi32().

@@ +194,5 @@
> +  aLeftLobe++;
> +  aTopLobe++;
> +  int32_t boxSize = (aLeftLobe + aRightLobe) * (aTopLobe + aBottomLobe);
> +
> +  if (boxSize == 1) {

See above.

@@ +210,5 @@
> +  GenerateIntegralImage_SSE2(leftInflation, aRightLobe, aTopLobe, aBottomLobe,
> +                             aIntegralImage, aIntegralImageStride, mData,
> +                             mStride, size);
> +
> +  __m128i divisor = _mm_setr_epi32(reciprocal, reciprocal, reciprocal, reciprocal);

_mm_set1_epi32() would be better, no?

@@ +214,5 @@
> +  __m128i divisor = _mm_setr_epi32(reciprocal, reciprocal, reciprocal, reciprocal);
> +
> +  // This points to the start of the rectangle within the IntegralImage that overlaps
> +  // the surface being blurred.
> +  uint32_t *innerIntegral = aIntegralImage + (aTopLobe * stride32bit) + leftInflation;

See comments in the C version.

@@ +225,5 @@
> +    uint32_t *bottomRightBase = innerIntegral + ((y + aBottomLobe) * ptrdiff_t(stride32bit) + aRightLobe);
> +    uint32_t *bottomLeftBase = innerIntegral + ((y + aBottomLobe) * ptrdiff_t(stride32bit) - aLeftLobe);
> +
> +    for (int32_t x = 0; x < size.width; x += 4) {
> +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {

You should really put mSkipRect.x and mSkipRect.XMost() in local variables. The __m128i pointers are, as far as the compiler is concerned, allowed to alias anything, which means it can't prove a member variable hasn't been modified after every iteration, and it will be forced to reload/recompute them.

@@ +226,5 @@
> +    uint32_t *bottomLeftBase = innerIntegral + ((y + aBottomLobe) * ptrdiff_t(stride32bit) - aLeftLobe);
> +
> +    for (int32_t x = 0; x < size.width; x += 4) {
> +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> +        x = mSkipRect.XMost();

See comment in C version. mSkipRect.XMost()+4 is even more worrying. Also, uh, what happens when mSkipRect.XMost() is not a multiple of four?

::: gfx/2d/Tools.h
@@ +109,5 @@
> +
> +  MOZ_ALWAYS_INLINE void Realloc(size_t aSize)
> +  {
> +    delete [] mStorage;
> +    mStorage = new T[aSize + 15];

15? Don't you mean alignment-1?

@@ +111,5 @@
> +  {
> +    delete [] mStorage;
> +    mStorage = new T[aSize + 15];
> +    if (uintptr_t(mStorage) % alignment) {
> +      // Our storage does not start at a 16-byte boundary. Make sure mData does!

16?
Attachment #677274 - Flags: review?(tterribe) → review-
Most of it's clear, few questions!

(In reply to Timothy B. Terriberry (:derf) from comment #30)
> Comment on attachment 677274 [details] [diff] [review]
> Fast C++ and SSE2 blur code v3
> 
> Review of attachment 677274 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> There's a bunch of minor issues, but a couple are worrying, so r- for now.
> 
> ::: gfx/2d/Blur.cpp
> @@ +402,5 @@
> >    if (stride.isValid()) {
> >      mStride = stride.value();
> >  
> >      CheckedInt<int32_t> size = CheckedInt<int32_t>(mStride) * mRect.height *
> > +                               sizeof(unsigned char) + 20;
> 
> I don't understand why you're multiplying by sizeof(unsigned char) if you're
> calling new uint8_t[].

Mixing legacy which stuff I replaced, my bad! (removed malloc and added new here).

> 
> @@ +503,5 @@
> > +    int32_t maxLeftLobe = RoundUpToMultipleOf4(horizontalLobes[0][0] + 1).value();
> > +
> > +    IntSize integralImageSize(size.width + maxLeftLobe + horizontalLobes[1][1],
> > +                              size.height + verticalLobes[0][0] + verticalLobes[1][1] + 1);
> > +    if ((integralImageSize.width * integralImageSize.height) > (1 << 23)) {
> 
> Why is this 1<<23 instead of 1<<24? Isn't the actual maximum here
> UINT32_MAX/255 == 0x1010101?

I thought some of the steps in the blurring could have negative intermediate results. But I might've been mistaking (might've only been true before some changes). Or it's quite possible for me to reorder them to make it not possible. I'll double check this.

> 
> @@ +505,5 @@
> > +    IntSize integralImageSize(size.width + maxLeftLobe + horizontalLobes[1][1],
> > +                              size.height + verticalLobes[0][0] + verticalLobes[1][1] + 1);

> @@ +510,5 @@
> > +
> > +      // No need to use CheckedInt here - we have validated it in the constructor.
> > +      size_t szB = stride * GetSize().height * sizeof(unsigned char);
> > +      unsigned char* tmpData = new uint8_t[szB];
> > +      if (!tmpData)
> 
> Huh? Is this supposed to be a fallible allocation? The one above certainly
> isn't, though the malloc() that used to be used here was.

No it's not, I replaced the malloc and forgot to remove this check.
> 
> @@ +522,5 @@
> > +        BoxBlurHorizontal(mData, tmpData, lobes[0][0], lobes[0][1], stride, GetSize().height, mSkipRect);
> > +        BoxBlurHorizontal(tmpData, mData, lobes[1][0], lobes[1][1], stride, GetSize().height, mSkipRect);
> > +        BoxBlurHorizontal(mData, tmpData, lobes[2][0], lobes[2][1], stride, GetSize().height, mSkipRect);
> > +      } else {
> > +        memcpy(tmpData, mData, stride * GetSize().height);
> 
> Instead of a gigantic memcpy, you should just do some pointer swapping.

I didn't want to change the logic here from the old code. Since I'm technically 'not touching' this, just indenting it. (and making the allocation infallible for consistency)

> @@ +531,5 @@
> > +        BoxBlurVertical(tmpData, mData, lobes[0][0], lobes[0][1], stride, GetSize().height, mSkipRect);
> > +        BoxBlurVertical(mData, tmpData, lobes[1][0], lobes[1][1], stride, GetSize().height, mSkipRect);
> > +        BoxBlurVertical(tmpData, mData, lobes[2][0], lobes[2][1], stride, GetSize().height, mSkipRect);
> > +      } else {
> > +        memcpy(mData, tmpData, stride * GetSize().height);
> 
> Instead of a gigantic memcpy, you should just do some pointer swapping.

Same as above, wanted to avoid additional risk.

> @@ +539,2 @@
> >      } else {
> > +      int32_t integralImageStride = GetAlignedStride<16>(integralImageSize.width * 4);
> 
> Why int32_t instead of, say, size_t (which, being unsigned, would allow the
> compiler to optimize the "/ 4" below much better).

I'm just used to strides being signed at a principle level, but I agree here it's fine to use size_t, will change.

> 
> @@ +581,5 @@
> > +    uint32_t *intPrevRow = aIntegralImage + (y - 1) * stride32bit;
> > +    uint32_t *intFirstRow = aIntegralImage;
> > +
> > +    for (int x = 0; x < integralImageSize.width; x++) {
> > +      intRow[x] = intFirstRow[x] + intPrevRow[x];
> 
> You're increasing your cache footprint by 33% here to save 1 add/pixel. I
> think you should just take the contents of the loop below and put it in a
> static function, and then call it repeatedly with a pointer to the first row
> of the source image (and do the same with the last row of the image below).
> You'll get less code (i.e., LoadIntegralRowFromRow goes away) and allow
> wider rows before you exhaust L1 cache.

I'll test this and see what the perf impact is on my machine at least.

> @@ +599,5 @@
> > +    }
> > +    for (int x = aLeftInflation; x < (aSize.width + aLeftInflation); x += 4) {
> > +        uint32_t pixel = *(uint32_t*)(sourceRow + (x - aLeftInflation));
> > +        currentRowSum += pixel & 0xff;
> > +        *intRow++ = *intPrevRow++ + currentRowSum;
> 
> It's probably a small thing, but compilers are generally able to optimize
> intRow[x], intRow[x+1], etc. better than pointer arithmetic (or at least
> that's what the AMD software optimization guide tells me, and generally
> matches my experience).

Yes, I agree, that's my experience too, not sure why I did this differently :).

> @@ +659,5 @@
> > +  aLeftLobe++;
> > +  aTopLobe++;
> > +  int32_t boxSize = (aLeftLobe + aRightLobe) * (aTopLobe + aBottomLobe);
> > +
> > +  if (boxSize == 1) {
> 
> Maybe "<= 1" for safety?

I'll just assert boxSize > 0.

> @@ +671,5 @@
> > +  GenerateIntegralImage_C(leftInflation, aRightLobe, aTopLobe, aBottomLobe,
> > +                          aIntegralImage, aIntegralImageStride, mData,
> > +                          mStride, size);
> > +
> > +  uint32_t *innerIntegral = aIntegralImage + (aTopLobe * stride32bit) + leftInflation;
> 
> See above about pointer arithmetic. I suspect that a single pointer and five
> offsets will optimize better than 5 pointers. As a bonus, the offsets are
> constant and can be hoisted out of the loop below.

Could you give a little example of what you mean? I'm not 100% sure it's clear to me.

> @@ +683,5 @@
> > +    uint32_t *bottomLeftBase = innerIntegral + ((y + aBottomLobe) * stride32bit - aLeftLobe);
> > +
> > +    for (int32_t x = 0; x < size.width; x++) {
> > +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> > +        x = mSkipRect.XMost();
> 
> Is this right? The continue will increment x again, leaving it at
> mSkipRect.XMost()+1. Either that's wrong, or your test for inclusion in
> mSkipRect is wrong.

mSkipRect.XMost() should be the last pixel -in- the rect to be skipped. So mSkipRect.XMost()+1 would be the next pixel to process.. I think.

> @@ +697,5 @@
> > +
> > +      int32_t value = bottomRight - topRight - bottomLeft;
> > +      value += topLeft;
> > +
> > +      mData[mStride * y + x] = value / boxSize;
> 
> You can (and should) replace this division with a multiplication, as I don't
> think the compiler will do it for you. It can be done without any
> approximation error.

I tried to, but the 64-bit conversion to gain the required precision was slower than doing a simple divide. Or are you saying we could do it in 32-bit without a precision error?

> @@ +46,5 @@
> > +
> > +// This function calculates an integral of four pixels stored in the 4
> > +// 32-bit integers on aPixels. i.e. for { 30, 50, 80, 100 } this returns
> > +// { 30, 80, 160, 260 }. This seems to be the fastest way to do this after
> > +// much testing.
> 
> I assume you tried
> 
> __m128i sumPixels = aPixels;
> __m128i currentPixels = _mm_slli_si128(aPixels, 4);
> sumPixels = _mm_add_epi32(sumPixels, currentPixels);
> currentPixels = _mm_unpacklo_epi64(_mm_setzero_si128(), sumPixels);
> return _mm_add_epi32(sumPixels, currentPixels);
> 
> Yes?

I did not, I will try.

> @@ +94,5 @@
> > +    uint8_t *sourceRow = aSource + aSourceStride * (y - aTopInflation);
> > +
> > +    uint32_t pixel = sourceRow[0];
> > +    for (int x = 0; x < aLeftInflation; x += 4) {
> > +      __m128i sumPixels = AccumulatePixelSums(_mm_shuffle_epi32(_mm_set1_epi32(pixel), _MM_SHUFFLE(0, 0, 0, 0)));
> 
> Doesn't _mm_set1_epi32() _already_ broadcast that value to all 4 slots,
> eliminating the need for the shuffle? It's _mm_cvtsi32_si128() that only
> sets the first slot and clears the rest to zero.

You're right.

> @@ +121,5 @@
> > +    }
> > +
> > +    pixel = sourceRow[aSize.width - 1];
> > +    int x = (aSize.width + aLeftInflation);
> > +    if ((aSize.width % 4)) {
> 
> "%" is expensive with signed integers. Use "& 3" instead.

Won't the compiler do that? This seemed more readable.

> @@ +226,5 @@
> > +    uint32_t *bottomLeftBase = innerIntegral + ((y + aBottomLobe) * ptrdiff_t(stride32bit) - aLeftLobe);
> > +
> > +    for (int32_t x = 0; x < size.width; x += 4) {
> > +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> > +        x = mSkipRect.XMost();
> 
> See comment in C version. mSkipRect.XMost()+4 is even more worrying. Also,
> uh, what happens when mSkipRect.XMost() is not a multiple of four?

This loop is not assuming aligned access. Other than that you're right in this case, at least. See above comment.

> ::: gfx/2d/Tools.h
> @@ +109,5 @@
> > +
> > +  MOZ_ALWAYS_INLINE void Realloc(size_t aSize)
> > +  {
> > +    delete [] mStorage;
> > +    mStorage = new T[aSize + 15];
> 
> 15? Don't you mean alignment-1?

I do!
> @@ +111,5 @@
> > +  {
> > +    delete [] mStorage;
> > +    mStorage = new T[aSize + 15];
> > +    if (uintptr_t(mStorage) % alignment) {
> > +      // Our storage does not start at a 16-byte boundary. Make sure mData does!
> 
> 16?

alignment-byte boundary! :)
(In reply to Timothy B. Terriberry (:derf) from comment #30)
> Comment on attachment 677274 [details] [diff] [review]
> Fast C++ and SSE2 blur code v3
> @@ +121,5 @@
> > +    }
> > +
> > +    pixel = sourceRow[aSize.width - 1];
> > +    int x = (aSize.width + aLeftInflation);
> > +    if ((aSize.width % 4)) {
> 
> "%" is expensive with signed integers. Use "& 3" instead.

Oh ugh, signed, you're right.
(In reply to Bas Schouten (:bas.schouten) from comment #31)
> I thought some of the steps in the blurring could have negative intermediate
> results. But I might've been mistaking (might've only been true before some
> changes). Or it's quite possible for me to reorder them to make it not
> possible. I'll double check this.

AFAICT, you're just computing the sum of a rectangular region of the image, right? That result cannot possibly be larger than the sum over the whole image, and it can't be negative. If intermediate results happen to wrap around, that's not a problem because overflow in unsigned arithmetic is well-defined in C. But yes, you can also re-order them so they never wrap around: (BR-TR)-(BL-TL)

> I didn't want to change the logic here from the old code. Since I'm
> technically 'not touching' this, just indenting it. (and making the
> allocation infallible for consistency)

That sounds like touching to me :). I'm personally a fan of the, "While you're in there, could you fix..." approach instead of the "Don't touch anything you don't have to" approach, but it won't bother me too much if you leave this as-is.

> > Why int32_t instead of, say, size_t (which, being unsigned, would allow the
> > compiler to optimize the "/ 4" below much better).
> 
> I'm just used to strides being signed at a principle level, but I agree here
> it's fine to use size_t, will change.

I don't really have an issue with signed strides, but if you use them you stop using divisions on them.

> > You're increasing your cache footprint by 33% here to save 1 add/pixel. I
> I'll test this and see what the perf impact is on my machine at least.

I suggest looking at image sizes where the rows used here fit in L1 cache in my approach, and don't in your approach (along with your normal test cases, of course).

> > Maybe "<= 1" for safety?
> 
> I'll just assert boxSize > 0.

That works.

> > See above about pointer arithmetic. I suspect that a single pointer and five
> > offsets will optimize better than 5 pointers. As a bonus, the offsets are
> > constant and can be hoisted out of the loop below.
> 
> Could you give a little example of what you mean? I'm not 100% sure it's
> clear to me.

ptrdiff_t topLeftOffset = leftInflation - aLeftLobe;
ptrdiff_t topRightOffset = leftInflation + aRightLobe;
ptrdiff_t bottomLeftOffset = (aTopLobe + aBottomLobe) * stride32bit + topLeftOffset;
ptrdiff_t bottomRightOffset = (aTopLobe + aBottomLobe) * stride32bit + topRightOffset;

for (int32_t x = 0; x < size.width; x++) {
  // ...
  ptrdiff_t rowOffset = y*stride32bit + x;
  uint32_t topLeft = aIntegralImage[rowOffset + topLeftOffset];
  uint32_t topRight = aIntegralImage[rowOffset + topRightOffset];
  uint32_t bottomRight = aIntegralImage[rowOffset + bottomRightOffset];
  uint32_t bottomLeft = aIntegralImage[rowOffset + bottomLeftOffset];
  // ...
}

> > > +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> > > +        x = mSkipRect.XMost();
> > 
> > Is this right? The continue will increment x again, leaving it at
> > mSkipRect.XMost()+1. Either that's wrong, or your test for inclusion in
> > mSkipRect is wrong.
> 
> mSkipRect.XMost() should be the last pixel -in- the rect to be skipped. So
> mSkipRect.XMost()+1 would be the next pixel to process.. I think.

The code you have won't skip the loop for the x==mSkipRect.XMost() case. You have a similar problem with the y test.

> > You can (and should) replace this division with a multiplication, as I don't
> > think the compiler will do it for you. It can be done without any
> > approximation error.
> 
> I tried to, but the 64-bit conversion to gain the required precision was
> slower than doing a simple divide. Or are you saying we could do it in
> 32-bit without a precision error?

You need a 32x32 -> high 32 bit multiply (you don't need the lower 32 bits of the result). See, e.g., <http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1467632>.
But even a 64-bit multiply should be substantially faster than a 32-bit divide. 

Basically, let d be the divisor and m = floor(log2(d)) (though obviously you would compute m with integer math). Then

uint32_t a;
uint32_t b;

if (d == 1<<m){
  a = 0xFFFFFFFFU;
  b = 1;
}
else {
  uint32_t t = (uint32_t)(((uint64_t)1 << (m+32)) / d) + 1;
  uint32_t r = t*d;
  b = r >= (1<<m);
  a = t - b;
}

Then, assuming I didn't screw any of that up,

x/d == (uint32_t)((x + b)*(uint64_t)a >> 32) >> m;

This reduces the maximum permissible value of x by 1 (so the maximum integral image size becomes 0x1010100), unless you want to compute (x+b) with 64-bit math or compute (x*(uint64_t)a + b*a), noting that b*a is a constant.

> > See comment in C version. mSkipRect.XMost()+4 is even more worrying. Also,
> > uh, what happens when mSkipRect.XMost() is not a multiple of four?
> 
> This loop is not assuming aligned access. Other than that you're right in
> this case, at least. See above comment.

I was thinking size.width was a multiple of 4 for some reason, but I guess you'll actually compute (and write!) extra values in any case. I hope that's okay. I also sure hope this function is idempotent in the skip region, because if mSkipRect.x isn't a multiple of 4, you'll compute it for some of the pixels inside the skip region, too.
Thanks for the quick and extensive review! I'll have a new patch up tomorrow.

(In reply to Timothy B. Terriberry (:derf) from comment #33)
> (In reply to Bas Schouten (:bas.schouten) from comment #31)
> > > See above about pointer arithmetic. I suspect that a single pointer and five
> > > offsets will optimize better than 5 pointers. As a bonus, the offsets are
> > > constant and can be hoisted out of the loop below.
> > 
> > Could you give a little example of what you mean? I'm not 100% sure it's
> > clear to me.
> 
> ptrdiff_t topLeftOffset = leftInflation - aLeftLobe;
> ptrdiff_t topRightOffset = leftInflation + aRightLobe;
> ptrdiff_t bottomLeftOffset = (aTopLobe + aBottomLobe) * stride32bit +
> topLeftOffset;
> ptrdiff_t bottomRightOffset = (aTopLobe + aBottomLobe) * stride32bit +
> topRightOffset;

I assume you mean including innerIntegral stuff, but I understood correctly then.
> 
> for (int32_t x = 0; x < size.width; x++) {
>   // ...
>   ptrdiff_t rowOffset = y*stride32bit + x;
>   uint32_t topLeft = aIntegralImage[rowOffset + topLeftOffset];
>   uint32_t topRight = aIntegralImage[rowOffset + topRightOffset];
>   uint32_t bottomRight = aIntegralImage[rowOffset + bottomRightOffset];
>   uint32_t bottomLeft = aIntegralImage[rowOffset + bottomLeftOffset];
>   // ...
> }
> 
> > > > +      if (inSkipRectY && x > mSkipRect.x && x < mSkipRect.XMost()) {
> > > > +        x = mSkipRect.XMost();
> > > 
> > > Is this right? The continue will increment x again, leaving it at
> > > mSkipRect.XMost()+1. Either that's wrong, or your test for inclusion in
> > > mSkipRect is wrong.
> > 
> > mSkipRect.XMost() should be the last pixel -in- the rect to be skipped. So
> > mSkipRect.XMost()+1 would be the next pixel to process.. I think.
> 
> The code you have won't skip the loop for the x==mSkipRect.XMost() case. You
> have a similar problem with the y test.

Hrm, I thought about this again and I think you're right, I think this should be x = mSkipRect.XMost() - 1 (or -4 in the SSE2 case). And the comparison should do x >= mSkipRect.x (this is also what the old code did).

> > > You can (and should) replace this division with a multiplication, as I don't
> > > think the compiler will do it for you. It can be done without any
> > > approximation error.
> > 
> > I tried to, but the 64-bit conversion to gain the required precision was
> > slower than doing a simple divide. Or are you saying we could do it in
> > 32-bit without a precision error?
> 
> You need a 32x32 -> high 32 bit multiply (you don't need the lower 32 bits
> of the result). See, e.g.,
> <http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1467632>.
> But even a 64-bit multiply should be substantially faster than a 32-bit
> divide. 
> 
> Basically, let d be the divisor and m = floor(log2(d)) (though obviously you
> would compute m with integer math). Then
> 
> uint32_t a;
> uint32_t b;
> 
> if (d == 1<<m){
>   a = 0xFFFFFFFFU;
>   b = 1;
> }
> else {
>   uint32_t t = (uint32_t)(((uint64_t)1 << (m+32)) / d) + 1;
>   uint32_t r = t*d;
>   b = r >= (1<<m);
>   a = t - b;
> }
> 
> Then, assuming I didn't screw any of that up,
> 
> x/d == (uint32_t)((x + b)*(uint64_t)a >> 32) >> m;
> 
> This reduces the maximum permissible value of x by 1 (so the maximum
> integral image size becomes 0x1010100), unless you want to compute (x+b)
> with 64-bit math or compute (x*(uint64_t)a + b*a), noting that b*a is a
> constant.

Hrm, what I did was what the old code did, basically:

uint32_t reciprocal = (1 << 32) / d;
uint8_t result = (uint64_t(sum) * reciprocal) >> 32;

Which seemed slower to my surprise in every measurement I did, but I'll re-evaluate this. As well as try and understand the approach you suggested in more detail, so I can understand why it would be faster or more correct than the above approach.

> 
> > > See comment in C version. mSkipRect.XMost()+4 is even more worrying. Also,
> > > uh, what happens when mSkipRect.XMost() is not a multiple of four?
> > 
> > This loop is not assuming aligned access. Other than that you're right in
> > this case, at least. See above comment.
> 
> I was thinking size.width was a multiple of 4 for some reason, but I guess
> you'll actually compute (and write!) extra values in any case. I hope that's
> okay. I also sure hope this function is idempotent in the skip region,
> because if mSkipRect.x isn't a multiple of 4, you'll compute it for some of
> the pixels inside the skip region, too.

This should be the case, the SkipRect, as I understand it, is simply obscured or in some other way irrelevant.
Blocks: 718453
Comment on attachment 677274 [details] [diff] [review]
Fast C++ and SSE2 blur code v3

Review of attachment 677274 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/BlurSSE2.cpp
@@ +10,5 @@
> +
> +namespace mozilla {
> +namespace gfx {
> +
> +MOZ_ALWAYS_INLINE

Since this is C++ inline implies locality to the translation unit I believe. So static inline is only useful in code portable between C and C++ to the best of my knowledge.
Comment on attachment 677274 [details] [diff] [review]
Fast C++ and SSE2 blur code v3

Review of attachment 677274 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/BlurSSE2.cpp
@@ +10,5 @@
> +
> +namespace mozilla {
> +namespace gfx {
> +
> +MOZ_ALWAYS_INLINE

Well, then you should fix the other MOZ_ALWAYS_INLINE calls (e.g., LoadIntegralRowFromRow) to not be static.
(In reply to Timothy B. Terriberry (:derf) from comment #30)
> Comment on attachment 677274 [details] [diff] [review]
> Fast C++ and SSE2 blur code v3
> 
> Review of attachment 677274 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> @@ +14,5 @@
> > +MOZ_ALWAYS_INLINE
> > +uint32_t DivideAndPack(__m128i aValues, __m128i aDivisor)
> > +{
> > +  __m128i multiplied1 = _mm_srli_epi64(_mm_mul_epu32(aValues, aDivisor), 32); // 00p300p1
> > +  __m128i multiplied2 = _mm_srli_epi64(_mm_mul_epu32(_mm_srli_epi64(aValues, 32), aDivisor), 32); // 00p400p2
> 
> Instead of shifting multiplied2 down by 32, if you PAND with a constant
> 0xFFFFFFFF00000000FFFFFFFF00000000 mask, you can simply POR these back
> together. That replaces a shift, two shuffles, and an unpack with two simple
> bitwise ops that can issue from any port.

This does not appear to have a performance impact sadly. But the code you suggested is nicer so I'll still include this change :).

> @@ +46,5 @@
> > +
> > +// This function calculates an integral of four pixels stored in the 4
> > +// 32-bit integers on aPixels. i.e. for { 30, 50, 80, 100 } this returns
> > +// { 30, 80, 160, 260 }. This seems to be the fastest way to do this after
> > +// much testing.
> 
> I assume you tried
> 
> __m128i sumPixels = aPixels;
> __m128i currentPixels = _mm_slli_si128(aPixels, 4);
> sumPixels = _mm_add_epi32(sumPixels, currentPixels);
> currentPixels = _mm_unpacklo_epi64(_mm_setzero_si128(), sumPixels);
> return _mm_add_epi32(sumPixels, currentPixels);
> 
> Yes?

Very clever! And it does seem to make a very small performance difference!
(In reply to Timothy B. Terriberry (:derf) from comment #33)
> (In reply to Bas Schouten (:bas.schouten) from comment #31)
> > > See above about pointer arithmetic. I suspect that a single pointer and five
> > > offsets will optimize better than 5 pointers. As a bonus, the offsets are
> > > constant and can be hoisted out of the loop below.
> > 
> > Could you give a little example of what you mean? I'm not 100% sure it's
> > clear to me.
> 
> ptrdiff_t topLeftOffset = leftInflation - aLeftLobe;
> ptrdiff_t topRightOffset = leftInflation + aRightLobe;
> ptrdiff_t bottomLeftOffset = (aTopLobe + aBottomLobe) * stride32bit +
> topLeftOffset;
> ptrdiff_t bottomRightOffset = (aTopLobe + aBottomLobe) * stride32bit +
> topRightOffset;
> 
> for (int32_t x = 0; x < size.width; x++) {
>   // ...
>   ptrdiff_t rowOffset = y*stride32bit + x;
>   uint32_t topLeft = aIntegralImage[rowOffset + topLeftOffset];
>   uint32_t topRight = aIntegralImage[rowOffset + topRightOffset];
>   uint32_t bottomRight = aIntegralImage[rowOffset + bottomRightOffset];
>   uint32_t bottomLeft = aIntegralImage[rowOffset + bottomLeftOffset];
>   // ...
> }
> 

This makes the code a little over 10% slower on VC11 where I tested. So unless you have another idea I'm going to leave this bit out.
(In reply to Bas Schouten (:bas.schouten) from comment #38)
> This makes the code a little over 10% slower on VC11 where I tested. So
> unless you have another idea I'm going to leave this bit out.

It was worth trying, I guess. I'd be curious to know why, but it's not that important.
There was a lot of churn in this review pass, so I probably missed a spot or two! But I think I got most of it.
Attachment #677274 - Attachment is obsolete: true
Attachment #678338 - Flags: review?(tterribe)
Comment on attachment 678338 [details] [diff] [review]
Fast C++ and SSE2 blur code v4

Review of attachment 678338 [details] [diff] [review]:
-----------------------------------------------------------------

A couple of comments left, but r=me with those addressed.

::: gfx/2d/Blur.cpp
@@ +381,5 @@
>    if (stride.isValid()) {
>      mStride = stride.value();
>  
>      CheckedInt<int32_t> size = CheckedInt<int32_t>(mStride) * mRect.height *
> +                               sizeof(uint8_t) + 20;

I mean, it's nice that this is the right type now, but this still only works if sizeof(uint8_t) is one, because new[] takes an item count, not a byte count. Reading my earlier comment I might not have been clear on this (sorry about that).

I guess technically it would still work if sizeof(uint8_t) is larger than one, but would allocate lots of wasted memory.

@@ +494,5 @@
> +      // Fallback to old blurring code when the surface is so large it may
> +      // overflow our integral image!
> +
> +      // No need to use CheckedInt here - we have validated it in the constructor.
> +      size_t szB = stride * size.height * sizeof(uint8_t);

Again, using sizeof() in a value passed to new[].
Attachment #678338 - Flags: review?(tterribe) → review+
Backed out for timeouts and crashes:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=e89f1fce980d
(Note: the xpcshell failures were due to another push)

https://hg.mozilla.org/integration/mozilla-inbound/rev/3a113808c160
Assignee: nobody → bas
Status: NEW → ASSIGNED
https://hg.mozilla.org/mozilla-central/rev/3169efab0148
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Depends on: 815599
Depends on: 815489
Depends on: 829954
Depends on: 834256
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: