Общо показвания

април 14, 2014

dart and performance (a test journy in game land)

I have been playing with Dart and StageXL for 10 days now and I feel like there are thoughts to be shared. Part of this post is also an update to a thread in the StageXL group

The game is really simple clone of the flappy bird. The main idea is that the gameplay should be easy to implement and understandable in order to allow me to concentrate on the internals of the game and the rendering engine instead of toying too much with the game itself.

At one point I have noticed that the way the original game detects the collision of the bird on the trees is a naive one and often gives false negatives. I went on reading about what is possible and present currently in Dart and StageXL in this regard. Turns out only very basic approach is available internally as StageXL is geared as animation library and not as a game development framework. Never the less this allowed me to play a bit more with the language itself.

For collision detection I have utilized the following approach:

  1. Using the DisplayObject.hitTestObject find element(s) that are potentially colliding (in the test game I developed it could be only one out of ~ 10).
  2. Determine the rectangle where the transformed protagonist will fit.
  3. Using detached canvas element, clear it and draw the transformed image of the protagonist.
  4. Using detached canvas element, clear it and draw the potentially colliding element. 
  5. The 2 canvases used are just big enough to contain the protagonist image.
  6. Compare the non-alpha pixels of the protagonist canvas to the colliding element canvas and there is a pixel where alpha is greater than 0 we want to return true (i.e. there is a collision).

Here are the results of this little experiment.

On PC and DartVM this runs so smooth, one would think why people are not using it all the time. The average time to run the code per frame with complete pixel accurate collision is ~0.35ms (with time limit of ~16ms this is pretty good I think).

Compiled to JS it runs at about twice as slow, the frame taking about 0.78ms still pretty good. This is kind of strange since the code is mainly touching webGL, canvas and arithmetics. Canvas and webGL are DOM interfaces and thus there should be no difference, so the only possible place of optimization / speed is the arithmetics (at least this is what I initially thought). I am not an expert in VMs, but having a VM that can run at least twice as fast the calculations is awesome!

However the problems with the 'compile to js' approach start to show as early as here: taking a look at the memory debug panel one could notice a very different pattern in JS world. The game has two different modes: off-game mode where a very simple animation is performed and on-game mode where in addition to the very simple sprite animation of the 'protagonist' obstacles are animated as well as collision detection is performed at each frame.

Here are the images.
DartVM

V8 - Chrome desktop

Notice a difference? Again - the code is only using a single canvas element with 3d context (webGL) to draw the animation on it and internally StageXL is using detached canvases and lots of calculations and canvas transformations. First of all - before the snapshot is taken the memory is force cleaned. Then for the test I let the off-game animation run for a while and then I play a little bit and then I leave it alone again.

On top of using around 4 times more memory (JS VM versus DartVM), JS also shows significant difference in memory allocation patterns when more code paths are executed per frame compared to DartVM. For example in the first image it is impossible to distinguish when I am playing the game and when not, while in the second image taken from JS land it is pretty much very clear - the climbs are much more steeper when the more complex code path is executed. Notice the change in steepness around second 33 in the second picture - this is where I killed my protagonist and the complex animation path is discarded. Notice also the enormous difference in memory allocation - going from 7.2MB up to 10MB in 40 seconds, while in DartVM we went from 1.8MB to 2.9MB in 5 minutes. Actually I had to wait for the garbage collection event to happen in DartVM just to see if it will ever happen, this is why the screenshot of the DartVM is encompassing such a long time period. For a moment there I thought it was broken - how come no GC event....

One negative thing about Dartium - when in debug mode (console open and recording) the performance on the screen was worse and noticeably janky. Turn off dev console and you get perfectly smooth animations.

The above summarized data about the performance and memory footprint were all taken on a core i3 @ 3.3GHz with 4GB onboard memory and a dedicated video card Nvidia. This is all well and nice, but the target platform is actually phones and tablets. It was time to test this on those devices.

I started with nexus 7 as easier to approach (no need to find a Mac computer just to debug some web stuff ha!). The results are pretty sad.

The memory consumption is pretty much identical to what we get in the desktop chrome (not surprising).

V8 - Chrome mobile

You can again easily see when the game is on and when not. Also the memory utilized is in the same range.
A very significant difference is however noticed in the frames timing. While on i3 CPU it takes under a millisecond to complete the code paths in the RAF, here it takes about 5ms on average, peaking at ~9ms!! This is so much more than what we had on the PC. Also the composite time is very different (although the canvas size is kept the same for measurement purposes - 480px x 640px) - ~2.5ms versus 0.5ms on PC. Strangely the time reported for compositing in Dartium is reported to be 0.2ms... (doesn't really make sense here...).

Even with those measurements the game should have been pretty good on the Nexus 7 device. But it is not. Well, remember the memory allocation patterns? This is the worse enemy of any game - having too much garbage. Guess the time for garbage collection events on Nexus 7.... ~80ms on GC cycle!!!! This is killing my game (and possibly will kill yours)!

Before following the logical steps to attempt to alter my code to lower the memory allocations and recollection I wanted to know which are the places where most of the CPU cycles are going. From the above tests I already knew that compositing the scene on Nexus 7 was taking ~3ms, which leaves me with ~13ms time to handle all the game logic. I have also tested with LG phone that claims to have the same hardware specs as the Nexus 7 device. Turns out the JS time is almost the same, but compositing time is 4.2ms. The screen is smaller and I expected the compositing time to be shorter, but instead it is longer, leaving the game in a pretty bad state: that is even in the time where no GC events occur the frame rate is below 60.

Back on my mission to understand performance implications of dart2js code I used nexus 7 to measure where is the CPU time spent.

Surprise - surprise! drawImage was taking 50% of the CPU. This is strange, I would have expected this to be a fast operation, considering the fact that the canvas is detached from the document. Anyways, what I was more surprised about was that 14.30% was spent in the comparison of pixel data of my two hidden canvases, but the actual comparison was only 7%, the other 7.3 was spent in 'convertnativeToDart_ImageData' - guessing from the name it converts the native ImageData object to Dart list of integers. Transforming the canvas is also 7.14% of the total time plus the clearing of the canvas. As a whole the slowest is the drawing of the image data. I believe I have put together the best practices (draw on a full pixel value, do not attach the canvas to the DOM), but still the combination of those actions take ~10ms, with a compositing taking ~3 the game is on the verge of not being playable.

Clearly using a different approach for detecting collision will reduce the amount of time (since we can avoid the most expensive operation - i.e. clearing the canvas and drawing on it).

More interestingly - using profiling from Dartium was not possible. The window.console.profile() call did nothing there. I am not sure if this is a bug or simply it does not work this way with DartVM, but it would have been interesting to see.

Going back to actual search for solution I decided to rework the code and instead of Dart idiomatic code to write what is known to work best in JS land. For more details on how one is supposed to write code in Dart please refer to this article.

One of the things I liked the most was the lexical scoping, combined with the lack of necessity to constantly write 'this' leads to lots of closures, trashing instances in methods and so on - the code becomes really terse and easy to understand and follow. Now, judging from the benchmarks DartVM performs around 2 times better than V8. Well, I have no idea how those benchmarks work and what they do compare, but the facts are like this: If you write your code in the Dart idiomatic way the code works perfectly fine and really fast in DartVM. This includes creating and trashing a lots and lots and lots of instances of small classes in a single stack (i.e. in one tick going deeper), using a lots of closures (think List..forEach()and List..forEach((_) => [_.a, _.b].map()) etc) and dumb objects, no local variables or lots of local variables, deep object nesting (o.o.o.o.o.o.o for example) and so on and so on - things that for years have been condemned and considered a no-no in JavaScript when performance is number one consideration. And to tell you more - i feels GOOD! Not having to write 'this', creating instances all over the place to make your code more readable and understandable (as opposite of creating cache properties all over the place and accessing it in bizarre manners just to avoid allocations to be cleaned later by the GC). Almost like a dream....

But it comes at a price. The dart2js compiler aims to produce code that is as close to the original dart code and its idioms, so basically if you use closures and forEach etc they end up in your code. Of course tree shaking is performed, but I do not see a lot of code rewrite being done, even less code optimization. It is a grey area (means it is not clear who should be optimizing the code in this case, the VM or the compiler. We know that in Google both are employed in different projects (for example GWT produces code that is highly optimised per browser, while Closure compiler produces code that is optimized for size and can potentially lead to more ineffective code when it is executed in V8 (actually there are several bugs submitted about this - function in-lining leading to calls that are de-optimized or cannot be optimized at all)).

When we write in JavaScript it is our responsibility to know all the catches and tweaks and quirks of the underlying VM in order to make most of the hardware and software capabilities. This is also true for transpiled languages (like typescript and coffee-script) but how should we handle this in Dart? Dart has its own VM and from what I have seen already it is optimizing all those 'dart idioms' very well and even with them it outperforms V8. But then the code needs to be compiled to JS and this is where I feel like the authors of the compiler fail us: yes the produced code works in all browsers, but the speed is not what we see from the benchmarks, the performance is 4 and more times worse than that of DartVM. So basically what I said before that Dart is more capable of what we expect - I lied - it is what it is, around twice as fast as V8, but the code dart produces for V8 is far from excellent in regards of memory usage and raw CPU performance. Seasoned JS developers know how to write code that is both memory efficient and performant (and those often mean the same due to those GC pauses), but then again those same developers are hard to find and even harder to make them do dull projects.

Because of the structure of Dart I assumed that it would be much more easy to produce more 'static' JavaScript code than to analyse a whole application and try to optimize it (à la closure compiler). I have been imagining some frivolity when rewriting the Dart code to JS such as turning forEach into for loops, creating bound and cached instances for often run closures etc etc, - things that we know to speed up and lower memory variance in large applications. Instead the code is preserved as close to the original as possible and thus it is again responsibility of the developer to write the same ugly but high performance code if the target platforms is known to be JS.

So what I did to mitigate things: first - remove all closures (so no forEach, no map etc). Second - get rid of all in-method created object instances (mainly Matrix and Rectangle instances for calculations). Instead create 'cache' instances and tie them to the main object (the one where the methods are executed). So now instead of creating several matrices only one instance is used and is mutated several time and reused. Same goes for the rectangles. Third - get rid of local variables, instead use a List instance as cache and put every number needed in there.

Result: the code got as ugly as any 'hand optimized' JavaScript you can see in my repository. In the often called methods there is no object creation or freeing whatsoever, nor variable (well I am not sure about the numbers, in JS those are primitive values, but in Dart those are Objects, so I am not sure if this optimization is really worth it). Memory wise I got this: 7.8 MB going up to 10.2 and back down. So now the logically same code executes with much less variance in heap allocation, which means the GC pauses are shorter and less often. Indeed the game play experience improved and GC time went down to ~30ms in the worse cases. This however is only my code and not the library code (StageXL). While the library is great (I could honestly said I wouldn't be doing this test if it was not for StageXL, so kudos to Bernhard for all his answers to my stupid questions) there are some (well - a lot) of places there the code is written directly in Dart idioms (so would run great in DartVM but not so great in V8). Those I do not intent to try to optimize out and there is no point in it: the point was to see if applying JavaScript idioms for fast code would benefit an application built in Dart and run as JavaScript.

Well the answer is (sadly) - YES.

My assumption is that this would not matter in more static web apps or apps using other means to animate, but for games, while providing great library and excellent tooling as a whole Dart hides too many underwater rocks. You have to already be a JS ninja to write Dart code that will perform best when compiled to JS, which basically nulls out the benefits of Dart IMO. Dart promised to get us rid from JS. And yet we are going back there the moment we need a bit more performance.

I should finish with a positive note: all this will be gone once DartVM is available as  a built-in Chrome option in stable. At the penetration rate of the new stable version to the users if would be a week after that release and everyone will have it. It is another question how will we deploy pure dart code to the users (my project for example uses ~500 dart files, imagine downloading those to the client...). But it could happen and then one will be truly liberated from the JS. But what about other browsers? Mozilla's ex CTO/CEO is directly opposing it, while Microsoft and especially Apple have financial interest of hindering the adoption of Dart. So it is not purely technical problems after all. Anyway, with Dart support in chrome one would have a large user base (Android + ChromeOS devices + all chrome installs) and in the beginning I think it is enough to incline the developers to explore it more as a platform and less as a compilation mid-stage for JS. Just imagine what you could write with those extra CPU cycles...

Ah, the dream..
Публикуване на коментар