zondag 19 april 2015

Grant Application

Hi everybody, I just wanted to point you to my Grant Application at the Perl Foundation to develop a better JIT compiler. Feedback is highly appreciated.

dinsdag 24 maart 2015

Advancing the JIT compiler

It is customary to reintroduce oneself after a long blogging hiatus like the one I displayed here for some 6 months. So here I am, ready to talk JIT compilers again. Or compilers in general, if you wish it. Today I want to discuss possible ways to improve the MoarVM JIT compiler for great performance wins! Or so I hope.

It is no secret (at least not to me) that the MoarVM JIT compiler isn't as good as, say, any of {v8, luajit, HotSpot, Spidermonkey, PyPy}. There are many reasons why and it is helpful to consider the structure of the Rakudo-MoarVM 'stack'.
So to help you do that I've drawn this picture:
A picture may be worth more than a thousand words, but this is just a diagram, and I can summarize the important points as:
  1. MoarVM is the 'end of the pipe'. To your program many optimizations may be applied before your code ever reaches this layer. Many more optimizations cannot be proven safe due to perl6 semantics, however, which is why spesh is a speculative optimizer.
  2. The JIT only kicks in after spesh has applied several optimizations and in the current system only after spesh has generated optimized bytecode. (This is a flaw that we are looking to correct). This is an important point that is often missed, possibly because in many other systems the optimization and JIT compilation steps are not so clearly delineated.
  3. Within MoarVM, 6model plays a central role. 6model is the name for Rakudo's implementation of the perl6 (meta-) object system. Many operations such as array indexes and hash lookups are implemented as 6model operations. In general, these ultimately resolve to 6model 'representation' objects, which are implemented in C. Many steps in the execution of a perl6 program are really (virtually) dispatched to these 'REPR' objects.
So what benefit can an improved JIT compiler bring? The JIT has an advantage compared to all other components in that it has on one hand all information of the layers above and on the other hand it has the full flexibility of the underlying machine. Many tricks are easy and possible at the level of machine code which are awkward (and compiler-specific) to express at higher levels. Generating code at runtime is more-or-less a superpower, which is probably what LISP fans will tell you too.

The current JIT cannot really take advantage of this power because it is really very simple. All operations in a given piece of code are compiled independently and stitched together to form the entire subroutine. You can think of it as a 'cut-and-paste' compiler. Because all pieces are independent, each piece has load and store values directly to memory, causing unnecessary memory traffic. There is also no way to reorder operations or coalesce operations into a smaller amount of instructions. Indeed most forms of machine-level optimization are virtually impossible.

I propose to replace - step-by-step - our current 'cut and paste JIT' with a more advanced 'expression JIT'. The idea is very simple - in order to generate good code for a machine it is first necessary to reify the operations on that machine. Compilation is then the process of ordering these operations in a (directed acyclic) graph and selecting instructions to perform those operations. There is nothing new about that idea, 'real' compilers have been written that way for ages, and there is very little preventing us from doing the same. (Currently, DynASM doesn't support dynamic selection of registers other than those present in x86, which are rather limited. This is something that can be fixed, though).


To be sure, I think we should leave large parts of the current JIT infrastructure intact, with maybe an improvement here or there. And I think this 'expression JIT' can be introduced piecemeal, with the current 'cut-and-paste' code snippets serving as a template. The expression JIT will function as another node type for the current JIT, although probably most code will eventually be converted to it. When this is done it opens up a lot of possibilities:
  • Load and store elision, and in general 'optimal' code generation by register selection and allocation. We are aided in a way by the fact that x86-64 has become more 'RISC-y' over the years, which makes instruction selection simpler.
  • JIT compilation of REPR object methods. The 'expression tree' on which the JIT operates is architecture-independent, so it's feasible to have REPR objects generate a tree for specific operations, effectively 'inlining' such operations into the compilation frame. In real terms, this may translate a high-level array access into a simple pointer reference.
  • NativeCall calls may be more easily converted into JIT-level C calls, which is ultimately the same thing, just much more efficient.
  • Many optimizations that become much simpler even if they were possible before. For example 'small bigint' optimization which lets us execute small integer operations with big integer semantics, courtesy of cheap overflow checking at the machine level. Or possibly transforming tight operation loops into equivalent SIMD instructions, although that one is in fact rather more involved.
To conclude, I think the current MoarVM JIT compiler can be radically improved. I also think that this is nothing groundbreaking intellectually speaking and that there are few barriers to implementation. And finally, that with these improvements MoarVM will have a strong base for future development. I'd love to hear your thoughts.

maandag 18 augustus 2014

Some Lessons

It's wrap-up time! This will be my final post in the GSoC 2014 program, because the GSoC 2014 progam ends 19:00 UTC today, so that's early enough that I won't be writing another post today. It's hard enough to write a post every week as it is. So what I'd like to do today is wrap-up what we've reached and what I've learned in the past three months.

Let's start with a summary of what has been reached. MoarVM has an optional JIT compiler based on spesh, which is the optimization framework I've mentioned often by now. The JIT compiled code cannot handle all instruction yet, but it can handle loops, invocations, perl6 container ops, some regular expressions, expressions and deoptimisation. Reported benefits are anywhere between nothing and 50% faster code, which is awesome. I assume very few people have a lot of truly computation-intensive code running on rakudo perl6 yet, but I suspect that is where it would help most.

The code-generator that we currently have is not awesome. It is in fact pretty naive. Values that are computed in a CPU register must be stored in the MoarVM register space before we can use it in another instruction. We make no attempt to detect loops and allocate registers, we don't do common sub-expression elimination, nor do we make any attempt to dynamically select different instructions.

Well, one aspect of it is awesome, and that is that by using DynASM, there is a direct correspondence between the code we write and the code that is executed. I cannot over-emphasize how user-friendly this really is. Writing a compiler as if I'm writing regular x64 assembly is a great benefit and allows for much faster development times.

Anyway, this discussion of the code generator brings me to another topic I'd like to discuss, and that's why we wrote our own code generator rather than borrowing the one from LuaJIT (or LLVM, or v8 for that matter). I think this is a very fair question and so it deserves some discussion. The first point I'd like to call out is the idea that LLVM (or any of the alternatives) are magic black boxes in which you enter intermediate-level code, wait, and out rolls totally optimal machine code that will run faster than C. I'm obviously totally biased but based on my limited experience I'd say that's not how this works.

First of all, code generation is hardly the most difficult part of doing a JIT compiler. By far most of the complexity in this project has come from integration with the existing MoarVM infrastructure, and that's where the most complex bugs have come from, too. (I've written at length about such bugs already). For example, MoarVM call frame mechanics involve a lot of bookkeeping that simply doesn't exist for C call frames. Likewise, the JIT compiler must make objects available and understandable for the garbage collector just as the interpreter does (or change the GC to understand how to read a JITted frame, not an easy task by itself). Exception throwing and handling form a similar challenge, one which I've talked about at length in another post.

Second, using another projects' code generator comes with specific costs of its' own. A great example of this is the FLTJIT that has recently been added to JavascriptCore (the javascript interpreter that is embedded in webkit). As described here, the JavascriptCore team essentially use LLVM as a plug-in code generation backend after converting their intermediate representation into LLVM IR. And to be fair to them, this seems cause quite a speedup. However this also comes at a cost. For one thing, LLVM optimization and compilation is sufficiently slow for it to be moved into a separate thread (so that it doesn't slow down execution of the unoptimised code). For another example, the garbage collector used by JavascriptCore is semi-conservative (i.e. the stack is scanned conservatively, whereas heap objects are scanned precisely), and to avoid the retention of lots of dead objects unused portions of the stack have to be zeroed explicitly. For a third example, they apparently had to go great lengths to deal with on-stack-replacement, something we handle more simply. In short, using LLVM is costly.


As for using the LuaJIT or v8, I'd argue that there is no truly language-independent compiler. Both v8 and LuaJIT heavily used properties of their respective languages to optimise code, properties which do not hold for perl6. And similarly these languages may have quirks or unusual properties which perl6 does not. An example of these is the truth-value of NaN values. For perl6 these are true because NaN is not equal to 0.0, and for javascript they are false.

I would also note that impressive as the results of the aforementioned projects may be, what they do isn't magic. Algorithms to do code generation are really quite well-known and researched, and there is no reason why we couldn't - in time - implement them. More important than code generation however seems to be optimizing memory access, as it is my suspicion that this is where most of the time is actually spent (Timotimo is actually looking into this as I'm writing this). Because none of these backends know anything about perl6, none of them can use any of the special properties of perl6 to optimize code, which we can (and do).

In conclusion, the choice of a code generation backend is a trade-off like any other. There is no magical silver bullet that will make your code run fast. You could argue against writing our own (even using the help of DynASM) and such an argument could be fair. I would argue as I have above that the trade-off was such that using DynASM was right, especially given the ease of use that it provides.

So what are the next steps for the MoarVM JIT? The first thing is to keep adding instructions while trying to remain bug-free. This isn't as easy as it looks because it seems like every third instruction we add enables the execution of poorly-tested branches that contain new bugs :-). After that, I'd still like to add full register selection for x64 by modifying DynASM. This won't be a very simple task for me since I know very little of x64 instruction encoding or of lua. With that, we can add proper instruction selection and low-level optimization, which should result in a speed-up. However, the most important work is probably to be done in spesh by transforming polymorphic code in simpler, monomorphic, efficient code. For instance, the instruction to test the truth-value of an iterator could be transformed into an iterator-specific instruction that would be much simpler, possibly eliminating a function call. Such transformations should also benefit users who do not use the JIT compiler yet. It seems to me spesh is the place where MoarVM can gain most.

maandag 11 augustus 2014

Names and Labels

I compensate for the infrequency of my blog posts with their length. Or the other way around. Anyway, I have some good news to report, so let's do that first. The JIT branch of MoarVM (moar-jit), which has been my work for the last few months, has been merged into the main branch just this morning, after we've found it not to crash and burn on building nqp, rakudo, and running spectests. This means that, with some fiddling, you too can now run JIT for MoarVM, NQP, and Perl 6. World domination for camelia!

Well, unless we find new bugs, of course. We've had plenty of these the last few weeks, and most of them all had the same cause, which can be summarized simply: semantic use of the interpreter process counter is not really compatible with JIT compilation. But maybe that requires some elaboration.

Simply put, during interpreting MoarVM sometimes needs to know exactly where in the program the interpreter is. This happens, for example, when using handlers, which is the mechanism MoarVM uses for try-catch constructs. In the following frame, for example, lines 1 and 4 would be the start and end of the handlers, and the block within CATCH would be invoked if the do-something-dangerous() would actually throw something. On the other hand, if another-dangerous-thing() were to throw something, the CATCH block should clearly not catch it.

0: sub a-handler-frame() {
1:     try {
2:         do-something-dangerous();
3:         CATCH { say($_); }
4:     } 
5:     another-dangerous-thing();
6: }

To determine what should be done in the event that either of these dangerous functions raises an error, MoarVM inspects the current process counter of the frame, and determines whether or not the try block applies. And this works very well in practice, so long as MoarVM is actually interpreting code. When the same code is compiled - as in, JIT compiled - the interpreter is merely used for entering into the JIT code, and never changes as we move through the frame. So the interpreter instruction pointer can no longer be used to tell where we are. As a result, exception handling didn't quite go smoothly.

A similar problem existed with dynamic variable caches. These are used to make lexical lookup of dynamic variables cheaper by caching them locally in the frame. It is not normally necessary to know where the interpreter is in the frame, except when we're dealing with inlined frames. Put shortly, before inlining the inlined frames may have had different ideas of the contents of the dynamic lexical cache. Because of this, MoarVM needs to know in which of the inlined frames we're actually working to figure out which cache to use. (NB: I'm not totally sure this explanation is 100% correct. Please correct me if not). So again when running the JIT MoarVM couldn't figure this out, and would use the wrong cache.  By the way, I should note that jnthn has been extremely helpful in finding the cause of this and several other bugs.

Because the third time is a charm (as I think the saying goes), another, very similar, version of the same bug appeared just a bit earlier with deoptimization. As I had never implemented any operation that caused a 'global deoptimization', I naively thought I wouldn't have to deal with it yet. After all,global deoptimization means that all frames in the call stack would have to be deoptimized. And you may have guessed it, but to do that correctly, you'll have to know precisely where you are in the deoptimising frame. This one was not only found, but also very helpfully fixed by jnthn.

All this implied that it became necessary for me to just solve this problem - where are we in the JIT code - once and for all. And in fact, there already existed parts of a solution to this problem. After all, the JIT already used a special label to store the place we should return too after we'd invoke another frame. So to determine where we are in the program, all we need to do is map those pointers back to the original structures that refer to them - that is to say, the inlined frames, the handlers, and the deoptimization structures. So it was done, just this week. I'd be lying if I said that this went without a hitch, because especially exception handling presented some challenges, but I think this morning I've ironed out the last issue. And because today is - according to the GSoC 2014 timeline - the 'soft pencils-down date' - in other words, the deadline - we felt it was time to merge moar-jit into master and let you enjoy my work.

And people have! This gist shows the relative speedup caused by spesh and JIT compilation in an admittedly overly simple example. As a counterexample, the compilation of CORE.setting - the most time-intensive part of building rakudo - seems to take slightly longer while using JIT than while. Still, tight and simple loops such as these do seem to occur, so I hope that in real-world programs the MoarVM JIT will give better performance. Quite possibly not as good as the JVM or other systems, certainly not as good as it could be, but better than it used to be.

There is still quite a lot to be done, of course. Many instructions are not readily compiled by the JIT. Fortunately, many of these can be compiled into function calls, because this is exactly what they are for the interpreter, too. Many people, including timotimo and jnthn, have already added instructions this way. Some instructions may have to be refactored a bit and I'm sure we'll encounter new bugs, but I do hope that my work can be a starting point.

vrijdag 25 juli 2014

Bugs, Updates, and ABIs

Once upon a time, way too long ago, I blogged, and notified you of my progress. There's been plenty progress since then, so it was time to write again. Since I've last wrote, I've added support for invocations, 'invokish' instructions - including decontainerization and object conditionals, which appear really frequently - and OSR. I've also had to fix bugs which crept in the code but which were never properly tested before, due to the fact that we typically need to implement a whole set of ops before any particular frame is compiled, and then if those frames break it is unclear which change caused it. I'll talk about these bugs a bit first and then about my next challenges.

The first bug that I fixed seemed to have something to do with smart numification specifically. This is an example of a so-called 'invokish' instruction, in which an object is coerced into a primitive such as a number or a string. Some types of objects override the default methods of coercion and as such will need to run code in a method. Because the JIT doesn't know beforehand if this is so - many objects are coerced without invoking code at all - a special guard was placed to ensure that the JIT code falls out into the interpreter to deal with an 'unexpected' method invocation. Freakishly, seemed to work about half of the time, depending on the placement of variables in the frame.

Naturally, I suspected the guard to be wrong. But (very) close inspection in gdb assured me that this was not the case, that the guard in fact worked exactly as intented. What is more, usually JIT bugs cause unapologetic crashes, segmentation faults, bus errors and the like, but this bug didn't. The code ran perfectly, just printing the wrong number consistently. Ultimately I tracked it down to the differences in parameter passing between POSIX and Windows. On both platforms, the first few parameters to a function are passed in registers. These registers differ between platforms, but that's easy enough to deal with using macro definitions. In both platforms, floating-point arguments are passed via the 'SSE' registers as opposed to the general-purpose registers. However, the relative positions are assigned differently. On windows, they are assigned in the order of the declaration. In other words, the following function declaration


void foo(int i, double d, int j, double f);


assigns i to the first general-purpose register (GPR), d to the second SSE register, j to the third GPR, and f to the fourth SSE register. On POSIX platforms (Mac OS X, Linux, and the rest), they are first classified by type - integer, memory, or floating point - and then assigned to consecutive registers. In other word, i and j are passed in the first and second GPR, and d and f are passed in the first and second SSE register. Now my code implemented the windows behavior for both platforms, so on POSIX, functions expecting their first floating point argument in the first SSE register would typically find nothing there. However, because the same register is often used for computations, there typically would be a valid value, and often the value we wanted to print. Not so in smart numification, so these functions failed visibly.

The second bug had (ultimately) to do with write barriers. I had written the object acessors a long time ago and had tested the code frequently since then, so I had not expected anything to be wrong with them. However, because I had implemented only very few string instructions, I had never noticed that string slots require write barriers just as object slots do. (I should have known, this was clear from the code in the interpreter). Adding new string instructions thus uncovered a unused code path. After comparing the frames that where compiled with the new string instructions with those without, and testing the new string instructions in isolation, I figured that the accessors had something to do with it. And as it turned out, they had.

The third bug which puzzled me for over a week really shouldn't have, but involved the other type of object acessors - REPR accessors. These accessors are hidden behind functions, however these functions did not take into account the proper behavior on type objects. Long story short, type objects (classes and the like) don't have any attributes to look up, so they should return NULL when asked for any. Not returning NULL will cause a subsequent check for nullity to pass when it shouldn't. Funnily enough, this one didn't actually cause a crash, just a (MoarVM) exception.

I suppose that's enough about bugs and new features though, so let's talk about the next steps. One thing that would help rakudo perl 6 performance - and what jnthn has been bugging me about for the last weeks :-) - is implementing 'extops'. In short, extops are a way to dynamically load new instructions into the interpreter. For the interpreter, they are just function calls, but for the JIT they pose special challenges. For example, within the interpreter an extop can just branch to another location in the bytecode, because this is ultimately just a pointer update. But such a jump would be lost to the JIT code, which after all doesn't know about the updated pointer. Of course, extops may also invoke a routine, and do all sorts of interesting stuff. So for the JIT, the challenge will not be so much executing the extops as figuring out what to do afterwards. My hope is that the information provided about the meaning of the operands - that is, whether they are literal integers, registers, or coderefs - will provide sufficient information to compile correct code, probably using guards. A similar approach is probably necessary for instructions that (may) throw or catch exceptions.

What's more directly relevant is that moar-jit tends to fail - crash and burn - on windows platforms. Now as I've already mentioned, there are only a few differences between windows and POSIX on assembly level. These differences are register usage and calling conventions. For the most part, I've been able to abstract these away, and life was good (except for the floating point functions, but I've already explained that at length). However, there are only so many arguments that can fit in registers, and the rest of them typically go to the stack. I tacitly assumed that all arguments that are pushed on stack should be 64 bits wide (i.e. as wide as a register). But that's not true, smaller arguments take fewer bits as is needed. The ubiquitous MVMint32 type - an integer 32 bits wide - takes only 4 bytes. Which means that a function expecting 2 32 bit numbers on stack would receive the value of only one, and simply miss the other. As POSIX has 6 GPR's available, and Win64 only 4, it is clear this problem only occurs on windows because there aren't any functions with more than 7 arguments.

Seems like a decent explanation, doesn't it? Unfortunately it is also wrong, because the size of the argument only counts for POSIX platforms. On windows, stack arguments are indeed all 64 bits wide, presumably for alignment (and performance) reasons. So what is the problem, then? I haven't implemented the solution yet, so I'm not 100% sure that what I'm about to write is true, but I figure the problem is that after pushing a bunch of stack arguments, we never pop them. In other words, every time we call a function that contains more than 4 parameters, the stack top grows a few bytes, and never shrinks. Even that wouldn't be a problem - we'd still need to take care of alignment issues, but that's no big deal.

However, the JIT happens to use so called non-volatile or callee-save registers extensively. As their name implies, the callee function is responsible for either restoring these registers to their 'original' value upon exit, either by saving these values on stack or by not using them at all. Contrary to popular opinion, this mechanism works quite well, moreover many C compilers preferentially do not use these registers, so using them is quite cheap in comparison to stack usage. And simple as well. But I store and restore them using push and pop operations, respectively. It is vital the stack top pointer (rsp register) is in the right place, otherwise the wrong values end up in these registers. But when the stack keeps on growing, on windows as well as on POSIX systems, the stack pointer ends up in the wrong place, and I overwrite the callee-save register with gibberish. Thus, explosions.

From my review of available literature - and unfortunately, there is less literature available than one might think - and the behavior of C compilers, it seems the proper solution is to allocate sufficient stack space on JIT code entry, and store both the callee-save registers as well as the stack parameters within that space. That way, there's no need to worry about stack alignment issues, and it's always clear just where the values of the callee-save registers are. But as may be clear from this discussion, that will be quite a bit of work, and complex too. Testing might also be challenging, as I myself work on linux. But that's ultimately where VM's are for :-). Well, I hope to write again soon with some good news.

zondag 6 juli 2014

Moar JIT progress

So, it seems I haven't blogged in 3 weeks - or in other words, far too long. It seems time to blog again. Obviously, timotimo++ has helpfully blogged my and other's progress in the meantime. But to recap, since my last blog the following abilities have been added to the JIT compiler:

  • Conditionals and looping
  • Fast argument list access
  • Floating point and integer arithmetic
  • Reading and writing lexicals, and accessing 'world values'.
  • Fast reading and writing of object attributes
  • Improved logging and bytecode dumping.
  • Specialization guards and deoptimisation
The last of these points was done just this week, and the problem that caused it and the solution it involves are relevant to what I want to discuss today, namely invocation.

The basic idea of speculative optimization - that is, what spesh does - is to assume that if all objects in the variable $foo have been of class Foobar before, they'll continue to be FooBar in the future. If that is true, it is often possible to generate optimized code, because if you know the type of an object you typically know its layout too. Sometimes this assumption doesn't hold, and
then the interpreter must undo the optimization - basically, return the state of the interpreter to where it would've been if no optimization had taken place at all.

All the necessary calculations have already been done by the time spesh hands the code graph over to the JIT compiler, so compiling the guards ought to be simple (and it is). However, an important assumption broke because of it. The MoarVM term for a piece of executable code is a 'frame', and the JIT compiler compiles whole frames at a time. Sometimes frames can be inlined to create bigger frames, but the resulting code always represents a single new frame. So when I wrote the code responsible for entering JIT-ted code from the interpreter, I assumed that the JIT-ted code represented an entire frame, at the end which the interpreter should return control to its caller.

During deoptimization, however, the interpreter jumps from optimized, type-specific code, to safe, unoptimized 'duck-typing' code. And so it must jump out of the JIT into the interpreter, because the JIT only deals with the optimized code. However, when doing so, the JIT 'driver' code assumed that control had reached the end of the frame and it ought to return to the caller frame. But the frame hadn't completed yet, so where the caller had expected a return value there was none.

The solution was - of course - to make the return from the current frame optional. But in true perl style, there is more than one way to do that. My current solution is to rely on the return value of the JIT code. Another solution is to return control to the caller frame - which is, after all, just a bit of pointer updating, and encapsulated in a function call, too - from the JIT code itself. Either choice is good, but they have their drawbacks, too. Obviously, having the driver do it means that you might return inappropriately (as in the bug), and having the JIT code might mean that you'd forget it when it is appropriate. (Also, it makes the JIT code bigger). Moreover, the invoked frame might be the toplevel frame in which case we shouldn't return to the interpreter at all - the program has completed, is finished, done. So this has to be communicated to the interpreter somehow if the JIT-code is considered responsible for returning to the frame itself.

The issues surrounding a JIT-to-interpreter call are much the same. Because MoarVM doesn't 'nest runloops', the JIT code must actually return to the interpreter to execute the called code. Afterwards the interpreter must return control back to the JIT code. Obviously, the JIT-ted frame hasn't completed when we return to the interpreter during a callout, so it can't return to its caller for the same reason. What is more, when calling out to the interpreter, the caller (which is JIT code) must store a return address somewhere, so the JIT driver knows where to continue executing after the callee returns.

I think by now it is too late to try and spare you from the boring details, but the summary of it is this: who or what should be responsible for returning control from the JIT-frame to the caller frame is ultimately an issue of API design, specifically with regards to the 'meaning' of the return value of the JIT code. If the 'driver' is responsible, the return value must indicate whether the JIT code has 'finished'. If the JIT code is responsible, the return value must indicate whether the whole program has finished, instead. I'm strongly leaning towards the first of these, as the question 'is my own frame finished' seems a more 'local' answer than 'is the entire program finished'.

With that said, what can you expect of me the coming week? With object access and specialization guards complete, the next step is indeed calling to interpreted code from the JIT, which I have started yesterday. I should also get at argument passing, object creation, decontainerization, 'special conditionals', and many other features of MoarVM. The goal is to find 'compilation blockers', i.e., operations which can't be compiled yet but are common, and work through them to support ever greater segments of compiled code.

In the long run, there are other interesting things I want to do. As I mentioned a few posts earlier, I'd like to evolve the 'Jit Graph' - which is a linked list, for now - into a 'real' graph, ultimately to compile better bytecode. An important part of that is determining for any point in the code which variables are 'live' and used, and which are not. This will allow us to generate code to load important variables - e.g., the pointer input arguments buffer - temporarily in a register so that further instructions won't have to load it again. It will also allow us to avoid storing a computed value in a local if we know that it will be overwritten in the next instruction anyway (i.e., is temporary). Because copy-instructions are both very frequent and potentially very costly (because they access memory), eliminating them as best as possible should result in great speed advantages. Ultimately, this will also allow us to move more logic out of the architecture-specific parts and into the generic graph-manipulating parts, which should make the architecture-dependent parts simpler. I won't promise all this will be done in a single summer, but I do hope to be able to start with it.




zaterdag 14 juni 2014

Progress

Those of you who have followed #moarvm or github closely may already know, but this week I've finally checked in code that calculates 2 + 2 = 4 and returns that value to its' caller. To be very specific, I can make a frame that does the following operations:


const_i64_16 r0, 2
const_i64_16 r1, 2
add_i, r2, r1, r0
return_i r2

As a proof of concept, this is a breakthrough, and it shows that the strategy we've chosen can pay off. I didn't quite succeed without help of FROGGS, jnthn, nwc10, timotimo and others, but we're finally there. I hope. (I'll have to see about windows x64 support). The next thing to do is cleanup and extension. Some objectives for the following week are:
  • Cleanup. The JIT compiler still dumps stuff to stderr for my debugging purposes, but we shouldn't really have that. I've tried moving ad.all output to the spesh log but I can hardly find the data in there, so I think I'll make a separate JIT log file instead. Similarly, the file for the JIT compiler's machine code dump - if any - should be specified. And I should add padding to the dump, so that more than one block can be dumped.
  • Adding operations to compile. MoarVM supports no fewer than 638 opcodes, and I support 4 yet. That is about 0,62% of all opcodes :-). Obviously in those terms, I have a long way to go. jnthn suggested that the specialized sp_getarg opcodes are a good way to progress, and I agree - they'll allow us to pass actual arguments to a compiled routine.
  • Translate the spesh graph out of SSA form into the linear form that we use for the JIT 'graph' (which is really a labeled linked list so far).
  • Compile more basic blocks and add support for branching. This is probably the trickiest thing of the bunch.
  • Fix myself a proper windows-x64 virtual machine, and do the windows testing myself.
  • Bring the moar-jit branch up-to-date with moarvm master, so that testers don't have such a hard time.
 As for longer-term goals,we've had some constructive contact with Mike Pall (of LuaJit / DynASM fame), and he suggested ways to extends DynASM to support dynamic registers. As I've tried to explain last week, this is important for 'good' instruction selection. On further reflection, it will probably do just fine to introduce expression trees - and the specialized compiler backend for them, which would need register selection - gradually, i.e. per supported instruction rather than all at once.

However, the following features are more important still:
  • Support for deoptimisations. Up until now (and the foreseeable future) we keep the memory layout exactly the same
  • JIT-to-interpreter calls. This is a bit tricky - MoarVM doesn't support nesting interpreters. What we'll have to do instead is return to the interpreter with a label that stores our continuation, and continue at that continuation when we return.
  • At some point, JIT-to-JIT calls. Much the same problems apply - in theory, this doesn't have to differ from JIT-to-interpreter calls, although obviously we'd rather optimise the interpreter out of this loop.
  • Support for exceptions, obviously, which - I hope - won't be as tricky as it seems, as it ultimately depends on jumping in the bytecode at the right place.
  • Support for simple optimisations, such as merging various MoarVM opcodes into a single opcode if that is more suitable.
So that is it for now. See you next week!