509052 - -moz-box-shadow applied to <html> tag in CSS stylesheet causes significant slowdown

Reporter

Description

•

15 years ago

On my blog, http://blog.seanmartell.com/, I've created a vignette effect around the edge of the site using -moz-box-shadow at 200px in width.

The resulting look has caused major slowdown when scrolling and resizing of the window.

It may be an extreme use case of -moz-box-shadow, but I've filed this as a heads up.

Johnathan Nightingale [:johnath]

Comment 1

•

15 years ago

Sure is slow to scroll!

Shark says 71.5% of the fun is in:

libthebes.dylib	gfxAlphaBoxBlur::Paint(gfxContext*, gfxPoint const&)	

Moving to GFX:Thebes

Component: General → GFX: Thebes

Product: Firefox → Core

QA Contact: general → thebes

Version: unspecified → Trunk

Homo Novis

Updated

•

15 years ago

Blocks: 478721

Jeff Muizelaar [:jrmuizel]

Comment 2

•

15 years ago

Fixing this might not be easy. Basically we're currently redrawing the the shadow every time we scroll and the shadow drawing process is very expensive. One way to fix this might be to somehow cache the shadow on a layer, though I'm not exactly sure how that will work out.

It would be useful if you could attach a demo and also compare the performance with Safari.

Karl Tomlinson (:karlt)

Updated

•

15 years ago

Blocks: 527728

Karl Tomlinson (:karlt)

Updated

•

15 years ago

Depends on: 469589

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 3

•

15 years ago

I'd like to see a simple testcase. The first thing is to check to make sure we're not drawing any more of the shadow than we really need to. Then there are a couple of obvious things we can do to speed up gfxAlphaBoxBlur:
1) SSE2 love
2) computing the blur of a large box is completely naive right now. We draw the box to a temporary surface and explicitly compute the blur of every pixel of the result. But when one or both of the box side lengths are significantly larger than the blur kernel diameter, you can optimize a lot. The blur result can be divided into three parts:
-- the interior: all points which are further than the blur radius away from any point on the outside of the box. These points are all solid.
-- edges: all points which are within the blur radius of a straight horizontal or vertical edge, but no other points outside the box. These consist of rows and columns of identical pixels
-- corners: all other points. These need to be calculated explicitly.
3) symmetry. we can probably speed things up by nearly a factor of 4 in almost all cases just by observing that boxes almost always have 4-way symmetry, therefore we can just compute one quarter of the blur and mirror that.

It'd be slightly tricky to extend the gfxAlphaBoxBlur interface to express the information you need for optimizations 2 and 3. Perhaps for 3 we would pass in optional x and y values indicating axes of symmetry, and for 2 we would pass in optional (x1,x2) and (y1,y2) ranges indicating ranges of identical columns and rows of pixels in the source image.

This is a nice little problem actually. I think we could make common shadows super-fast this way.

Jeff Muizelaar [:jrmuizel]

Comment 4

•

15 years ago

Attached patch sse2 blur — Details — Splinter Review

Here's an sse2 version of the blur code I did a while ago. I haven't looked at it for probably a year so I'm sure it's rotted some and needs some cleanup.

Edwin Khodabakchian

Comment 5

•

15 years ago

Hi Jeff. Regarding the safari performance when using box shadow, it seems to perform fine. Apple uses that CSS declaration on a lot of the pages of apple.com. See http://www.apple.com/imac/the-new-imac/ as an example.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 6

•

15 years ago

Another optimization:
4) for small radii (which are common), apply an explicit Gaussian kernel in one pass instead of the 6-pass box-blur approach. I suspect the explicit kernel will be faster at least for radii <= 2.

Michael Ventnor

Updated

•

14 years ago

Depends on: 544099

Ian Gilman [:iangilman]

Comment 7

•

14 years ago

A friend of mine, Mike Herf, has a very fast box shadow algorithm: 

http://www.stereopsis.com/shadowrect/

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 8

•

14 years ago

That's nice, but it's a method for fast shadows *on rectangles*. We often deal with nonrectangular shapes due to border-radius, and when we draw shadows on form controls styled with native themes (where we have no idea what the shape is, we just get a pixmap with alpha values). (The latter capability is new in Gecko 2.0 FWIW.)

Since this bug was filed, we implemented another optimization in bug 544099 that stopped us from computing shadow values where we know they will be covered by the box itself. That solved most of our performance problems.

If we need to optimize more, I would focus on optimization #2 above. In particular I would let callers pass in a range of rows over which each row of the shape mask is known to be identical, and a similar range of columns. We can use that information to compute a range of shadow rows and columns that are known to be identical. Then after we've computed the first row or column of the shadow in that range, we can memcpy in the rest.

But we need testcases to justify further optimizations. Sean's blog is fast to scroll for me, but maybe it's changed since this bug was filed.

Matt Brubeck (:mbrubeck)

Updated

•

13 years ago

Depends on: 633627

Timothy Nikkel (:tnikkel)

Updated

•

13 years ago

Depends on: 648449

Mats Palmgren (inactive)

Updated

•

13 years ago

Blocks: 511739

Carlo Alberto Ferraris

Comment 9

•

13 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #8)
> But we need testcases to justify further optimizations. 

that's easy enough: think about CSS animations or transitions involving shadows - I am using them f.e. in the welcome screen of my extension (https://github.com/CAFxX/ImageTweak/blob/master/content/welcome.htm)

Jet Villegas (inactive)

Comment 10

•

12 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #4)
> Created attachment 413324 [details] [diff] [review]
> sse2 blur

This looks like something we'd still want to have. Did this patch work? 

A version for ARM NEON intrinsics would be good to have as well :)

Jeff Muizelaar [:jrmuizel]

Comment 11

•

12 years ago

(In reply to Jet Villegas (:jet) from comment #10)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #4)
> > Created attachment 413324 [details] [diff] [review]
> > sse2 blur
> 
> This looks like something we'd still want to have. Did this patch work? 
> 
> A version for ARM NEON intrinsics would be good to have as well :)

I believe it did, but the patch has bitrotted substantially since I wrote it. Currently though, our mobile shadow performance is killed being by bug 754985.

Jeff Muizelaar [:jrmuizel]

Updated

•

12 years ago

Depends on: 758825

Paul Adenot (:padenot)

Comment 12

•

12 years ago

I've rebased the attached patch, and it works. It's for SVG, though. I'll
benchmark it to check that it's faster than the normal version.  If it performs
better, I'll try to port it over to gfx/2d/. However since it tries to operate
on columns, it needs to shuffle a lot and fetch way too many cache line than necessary, so I'm not sure how fast it is.

I think we can avoid all that complicated code and unoptimal memory access by transposing the image between the horizontal and vertical passes (meaning rewriting the whole thing):

blur -> blur -> blur -> transpose -> blur -> blur -> blur -> transpose

That way, we do only sequential data access so:
- we are more cache-efficient
- we can use SIMD instruction without too much shuffling
- we avoid code quasi-duplication like in gfx/2d/Blur.cpp

I'll also have a look on the points roc mentions on comment 3 and 6.

Paul Adenot (:padenot)

Comment 13

•

12 years ago

Quick update on this work.

Analyzed in cachegrind, the current algorithm shows a 95% cache miss on the vertical pass, which is alright for tiny blurs (for example, in the asteroids benchmark, the surface to blur is smaller than 40x40px, and all the temporary surfaces we use fit in the L1 cache), but totally kill performance on big images, since we need to pull a full cache line to get the next pixel (that is, pull 64 times the necessary amount of memory for a 64 bytes cache line), and because this cache line will never be in L1, which is on the order of 4k, and is unlikely to be in L2 (around 32k), if the stride is big enough, because we use temporary buffer. On my machine, I've saved by the L3 cache that is huge, but there is no such thing on older machines.

On mobile, depending if we use an expensive device or not, the situation can change a lot.

I've reworked the algorithm we use, using the approach of comment 12 (plus a few tricks), and I've got a speedup between 1.7 and 2.2 so far. It is currently out of tree to be able to profile properly, but I plan to put it back soon (the code need a bit of cleaning).

An in place version for the blur algorithm would be even faster, I guess, to be able to cache even more, but it seems a bit trickier to write.

Before going the SSE2 and NEON route to go faster, I guess roc suggestions would make more sense to implement anyways.

Jeff Muizelaar [:jrmuizel]

Comment 14

•

12 years ago

http://fgiesen.wordpress.com/2012/07/30/fast-blurs-1/
http://fgiesen.wordpress.com/2012/07/30/fast-blurs-2/

Has good information and a link to an implementation of a fast blur.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 15

•

12 years ago

What that describes is almost exactly what we do, plus comment #12. Though he does have some SIMD code.

Jeff Muizelaar [:jrmuizel]

Comment 16

•

12 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #15)
> What that describes is almost exactly what we do, plus comment #12. Though
> he does have some SIMD code.

The code above has a much faster inner kernel than ours. Currently our inner loop is not very quick.

sse2 blur 15 years ago Jeff Muizelaar [:jrmuizel] 9.70 KB, patch		Details \| Diff \| Splinter Review
Patch 12 years ago Paul Adenot (:padenot) 25.32 KB, patch		Details \| Diff \| Splinter Review
Use integral image for blurring 12 years ago Bas Schouten (:bas.schouten) 12.19 KB, patch		Details \| Diff \| Splinter Review
Fast C++ and SSE2 blur code 12 years ago Bas Schouten (:bas.schouten) 31.49 KB, patch		Details \| Diff \| Splinter Review
Fast C++ and SSE2 blur code v2 12 years ago Bas Schouten (:bas.schouten) 31.64 KB, patch		Details \| Diff \| Splinter Review
Fast C++ and SSE2 blur code v3 12 years ago Bas Schouten (:bas.schouten) 31.72 KB, patch	derf : review-	Details \| Diff \| Splinter Review
Fast C++ and SSE2 blur code v4 12 years ago Bas Schouten (:bas.schouten) 30.61 KB, patch	derf : review+	Details \| Diff \| Splinter Review