Hey everyone!

Through this post, I will be sharing my progress towards the task - https://github.com/parrot/parrot/issues/1083.

I have completed the part of the task that required me to inline the CallContext ATTR accessors to omit the obj check. For now, I have manually edited include/pmc/pmc_callcontext.h to fix the macro definitions.
However, after completing the remaining task and ascertaining an improvement in speed performance, I plan to improve pmc2c to automatically generate this fix.

read more

parrot.org | Parrot VM | 2014-07-23 18:13:00

Hey everyone!

Let me share this week's progress.

Earlier this week, I finished my tests with Parrot for its releases 2.7 - 3.0. Including rurban's profiling, we now have data (reliable to some extent) for the commits in this range. The main objective for the profiling was to determine the highest overhead to be targeted next for refactor.

(For the list of all the identified overheads, please take a look at - http://wiki.enlightenedperl.org/gsoc2014/ideas/improve_performance_of_me...)

read more

parrot.org | Parrot VM | 2014-07-16 18:16:14

As a dreamer of dreams and a travelin' man,
I have chalked up many a mile.
Read dozens of books about heroes and crooks,
And I've learned much from both of their styles.
    -- Heard playing in Margaritaville bar,
       in Orlando after YAPC::NA::2014.

On behalf of the Parrot team, I'm proud to announce Parrot 6.6.0, also known as "Parrothead". Parrot is a virtual machine aimed at running all dynamic languages.

read more

parrot.org | Parrot VM | 2014-07-15 23:59:59

Hey everyone!

My this week's work involves testing. With this work, we are trying to figure out the commits that slowed down Parrot during its releases 2.7 - 3.0. To do this, I am running the bench.sh tool provided in parrot-bench. rurban is helping me out with these tests to save time (since I have got a slow machine) and also to cross-check results.

read more

parrot.org | Parrot VM | 2014-07-09 17:39:23

Hey everyone!

I am happy to announce that my task #2 (https://github.com/parrot/parrot/issues/1080) is now complete and the issue has been closed.

To give a gist of what has been done -

The goal was to optimize the pmc2c compiler, more specifically the PCCMETHODs, by avoiding the run-time overhead of having to call two costly C functions per method call. These C functions were:-

Parrot_pcc_fill_params_from_c_args(interp, _call_object, sig, &_self, args...);
Parrot_pcc_set_call_from_c_args(interp, _call_object, rettype, result);

read more

parrot.org | Parrot VM | 2014-07-02 16:28:30

Hey everyone!

I will catch you up on my work this week. As I had mentioned in my last post (http://www.parrot.org/zyroz4), I have already started working on a new task (https://github.com/parrot/parrot/issues/1080).

Since, this work requires me to make changes to the Pmc2c compiler, I am required to code in Perl. I am however, new to Perl and thus, spent quite some time getting used to some basic coding in Perl this week.

read more

parrot.org | Parrot VM | 2014-06-25 17:50:42

Hey everyone!

Let me share the progress for this week.

I have successfully finished my task #1 that required me to add write barriers to the PMC methods (https://github.com/parrot/parrot/issues/1069).

For this task, I had finished most of the work last week itself (http://www.parrot.org/zyroz3).

This week I mostly verified all the WB annotations for one last time and it did help me to fix some bugs and incorrect WBs. All this work is now part of our latest release for Parrot 6.5.0 and we have achieved a speed improvement of around 2.5% - 5% through this task!

read more

parrot.org | Parrot VM | 2014-06-18 18:29:26

Parrot 6.5.0 is available on Parrot's FTP site, or by following the download instructions.
For those who want to hack on Parrot or languages that run on top of Parrot, we recommend our organization page on GitHub, or you can go directly to the official Parrot Git repo on Github

read more

parrot.org | Parrot VM | 2014-06-17 15:13:53

Hey everyone!

There is a lot of good news to share this week! Our GSoC task #1 (https://github.com/parrot/parrot/issues/1069) is a success!! :D

So, I have been spending a lot of time looking into and fixing 95 pmc files with a calculated number of 2230 methods for core PMCs alone...

read more

parrot.org | Parrot VM | 2014-06-10 20:38:57

Hey everyone!

I will be filling you in with what we have achieved in the last week.

So, rurban has fixed up the pmc2c compiler and it looks good to handle the write barriers. However, the rules that we had followed earlier to place these WBs in the PMCs have been changed to some extent for a better performance. You can have a look at them @ https://github.com/parrot/parrot/issues/1069.

My major work has involved verifying and correcting the annotations for the WBs in the methods for each PMC. There are about 95 .pmc files and I am about to reach the half way mark.

read more

parrot.org | Parrot VM | 2014-06-04 17:08:11

It's been quiet.  Too quiet.

Interest in Parrot has waned over the past 18 months.  The most recent flurry of activity happened when Allison Randal brought up the fact that The Parrot Foundation was in shambles and suggested shutting it down.  This naturally brought up the state of Parrot itself and what the future holds for it, if anything.  The situation is perhaps less than ideal.  The short answer is that Parrot's immediate prospects are iffy at best, but there is at least one niche where Parrot still has a chance to shine.

The surface problem with Parrot is that there’s a lack of people who can find the tuits to hack on it these days.  Different people have their own analyses as to why this is happening.  My best answer is that Parrot doesn’t have a compelling value proposition.  Hosting every dynamic language was pretty revolutionary around the time Parrot was started more than a decade ago.  Today that’s no longer the case and the bigger language runtimes like the JVM, CLR and JavaScript (not a VM but a very poplar compilation target) can run circles around Parrot on most of the axes that matter.

Those of us who care about Parrot need to find a way to make it matter and to do so quickly.

Rakudo is the current most complete and active language implementation that runs on Parrot, and even *it* is moving toward running on many backends.  Parrot’s best bet is to focus exclusively on supporting Rakudo and give it a reason to stick around.  If supporting all dynamic languages was ever a good idea for Parrot, that’s no longer the case.  The reality of Parrot’s effective niche has become much harder to ignore.  The best move is to adapt accordingly.

Parrot has been inactive (among many reasons) because its developers can see that the goal of hosting all dynamic languages isn’t realistically attainable given Parrot's current resources.  With a new and more tightly defined plan, Parrot has a fighting chance to find a useful niche.

Parrot's new niche and reason for existence needs to be to support Rakudo and nqp until those languages either fail, succeed, or have no further use for Parrot.

This will be a liberating shift for Parrot.  The official policy is now “make nqp and Rakudo better”.  Within that constraint, any change is welcome.  In a bit more detail, the two goals by which any potential change should be judged are:

1) Does it provide a benefit to Rakudo, especially a *measurable* *non-theoretical* benefit?

If a change makes Rakudo happy, sold!  This includes requested features, optimizations, bug fixes and the like.  This is *the* primary concern and the best way to provide value to nqp and Rakudo.

2) Does it make Parrot’s code simpler without increasing complexity elsewhere?

Simplifying Parrot is valuable, but only in a much more indirect way.  This goal is a distant second in importance to performance improvements.  That said, simplifying Parrot is still helpful.  Some of Parrot’s problems come from the decade of accumulated cruft.  A simpler Parrot is more approachable and easier to profile, maintain and debug.  Simplicity should be pursued as long as that simplicity doesn't mean shuffling complexity elsewhere and *especially* if the simplification comes with a performance bump.

That’s all there is to it.  With simple and immediate rules rather than a slow and deliberate deprecation policy, half-done features that were kept around for years “just in case” can safely be removed.

Another implication of all this is that our deprecation and support policy are going away.  They were well-intentioned but appropriate for a project in a much more mature and stable state.  Our new support policy is “we’ll try to fix bugs and keep nqp running”.  We’ll continue to make monthly releases but they will not be labelled as “supported” or “developer” as in the past.

Observers of Parrot will note by now that this isn’t the first time that Parrot has tried something radical.  This isn’t even the first time that *I’ve* tried something radical.  What's different this time is that we’re no longer trying to be all things to all languages; we’re trying to be one thing to one language that’s already our customer.  This will still involve a ton of work, but the scope reduction shrinks the task from Herculean to merely daunting.

So here’s where you, the reader come in.  Whether you’ve hacked on Parrot in the past or came for the lulz and accidentally got interested, you can help.  The big goals are to make Parrot (and by extension nqp and Rakudo) smaller and faster.  Below are a few specific ways you can help.  Whatever you do though, don't make any changes that will be detrimental to nqp and Rakudo, and coordinate any backwards-incompatible changes before they get merged into Parrot master.

Grab a clone of Parrot and nqp.  Build and install them.  Play with the sixparrot branch, where some initial work is already in progress.  Already there?  Great!  The next steps are a little harder.

Remove code paths that nqp doesn’t exercise.  This can be single if statements or it can be whole sections of the source tree.  Tests are the same as code; if the nqp and Rakudo’s tests don’t exercise them, out they go.  Tests exist to increase inertia, but are only useful to the degree that they test useful features.  When in doubt, either ask in #parrot or just rip it out and see what happens.

Relatedly, profile and optimize for nqp.  If you like C, break out valgrind, build out a useful benchmark and see how fast you can make it run.  If you find some code that doesn’t seem to be doing anything, you’ve just found an optimization!

Learn nqp and Perl 6.  There’s been a lack of tribal knowledge about nqp’s inner workings ever since Parrot started distancing itself from Rakudo.  We need to reverse that tendency so that nqp is regarded as an extension of Parrot.

Overall, the next few months will be interesting.  I don't know if they'll result in success for Parrot, but I'm willing to give it one more shot.

cotto | reparrot | 2013-02-15 23:51:22

I might not be too bright. Either that or I might not have a great memory, or maybe I’m just a glutton for punishment. Remember the big IO system rewrite I completed only a few weeks ago? Remember how much of a huge hassle that turned into and how burnt-out I got because of it? Apparently I don’t because I’m back at it again.

Parrot hacker brrt came to me with a problem: After the io_cleanup merge he noticed that his mod_parrot project doesn’t build and pass tests anymore. This was sort of expected, he was relying on lots of specialized IO functionality and I broke a lot of specialized IO functionality. Mea culpa. I had a few potential fixes in mind, so I tossed around a few ideas with brrt, put together a few small branches and think I’ve got the solution.

The problem, in a nutshell is this: In mod_parrot brrt was using a custom Winxed object as an IO handle. By hijacking the standard input and output handles he could convert requests on those handles into NCI calls to Apache and all would just work as expected. However with the IO system rewrite, IO API calls no longer redirect to method calls. Instead, they are dispatched to new IO VTABLE function calls which handle the logic for individual types.

First question: How do we recreate brrt’s custom functionality, by allowing custom bytecode-level methods to implement core IO functionality for custom user types?

My Answer: We add a new IO VTABLE, for “User” objects, which can redirect low-level requests to PMC method calls.

Second Question: Okay, so how do we associate thisnew User IO VTABLE with custom objects? Currently the get_pointer_keyed_int VTABLE is used to get access to the handle’s IO_VTABLE* structure, but bytecode-level objects cannot use get_pointer_keyed_int.

My Answer: For most IO-related PMC types, the kind of IO_VTABLE* to use is staticly associated with that type. Socket PMCs always use the Socket IO VTABLE. StringHandle PMCs always use the StringHandle IO VTABLE, etc. So, we can use a simple map to associate PMC types with specific IO VTABLEs. Any PMC type not in this map can default to the User IO VTABLE, making everything “just work”.

Third Question: Hold your horses, what do you mean “most” IO-related PMC types have a static IO VTABLE? Which ones don’t and how do we fix it?

My Answer: The big problem is the FileHandle PMC. Due to some legacy issues the FileHandle PMC has two modes of operation: normal File IO and Pipe IO. I guess these two ideas were conflated together long ago because internally the details are kind of similar: Both files and pipes use file descriptors at the OS level, and many of the library calls to use them are the same, so it makes sense not to duplicate a lot of code. However, there are some nonsensical issues that arise because Pipes and files are not the same: Files don’t have a notion of a “process ID” or an “exit status”. Pipes don’t have a notion of a “file position” and cannot do methods like seek or tell. Parrot uses the "p" mode specifier to tell a FileHandle to be in Pipe mode, which causes the IO system to select a between either the File or the Pipe IO VTABLE for each call. Instead of this terrible system, I suggest we separate out this logic into two PMC types: FileHandle (which, as it’s name suggests, operates on Files) and Pipe. By breaking up this one type into two, we can statically map individual IO VTABLEs to individual PMC types, and the system just works.

Fourth Question: Once we have these maps in place, how do we do IO with user-defined objects?

My Answer: The User IO VTABLE will redirect low-level IO requests into method calls on these PMCs. I’ll break IO_BUFFER* pointers out into a new PMC type of their own (IOBuffer) and users will be able to access and manipulate these things from any level. We’ll attach buffers to arbitrary PMCs using named properties, which means we can attach buffers to any PMC that needs them.

So that’s my chain of thought on how to solve this problem. I’ve put together three branches to start working on this issue, but I don’t want to get too involved in this code until I get some buy-in from other developers. The FileHandle/Pipe change is going to break some existing code, so I want to make sure we’re cool with this idea before we make breaking changes and need to patch things like NQP and Rakudo. Here are the three branches I’ve started for this:

  • whiteknight/pipe_pmc: This branch creates the new Pipe PMC type, separate from FileHandle. This is the breaking change that we need to make up front.
  • whiteknight/io_vtable_lookup: This branch adds the new IOBuffer PMC type, implements the new IO VTABLE map, and implements the new properties-based logic for attaching buffers to PMCs.
  • whiteknight/io_userhandle: This branch implements the new User IO VTABLE, which redirects IO requests to methods on PMC objects.

Like I said, these are all very rough drafts so far. All these three branches build, but they don’t necessarily pass all tests or look very pretty. If people like what I’m doing and agree it’s a good direction to go in, I’ll continue work in earnest and see where it takes us.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-11-21 00:00:00

First, some personal status:

Personal Status

I haven’t blogged in a little while, and there’s a few reasons for that. I’ll list them quickly:

  1. Work has been…tedious lately and when I come home I find that I want to spend much less time looking at a computer, especially any computer that brings more stress into my life. Also,
  2. My computer at home generates a huge amount of stress. In addition to several physical problems with it, and the fact that I effectively do not have a working mouse (the built-in trackpad is extremely faulty, and the external USB mouse I had been using is now broken and the computer won’t even book if it’s plugged into the port), I’ve been having some software problems with lightdm and xserver crashing and needing to be restarted much more frequently than I think should be needed. We are planning to buy me a new one, but the budget won’t allow that until closer to xmas.
  3. The io_cleanup1 work took much longer than I had anticipated. I wrote a lot more posts about that branch than I ever published, and the ones I did publish were extremely repetitive (“It’s almost finished, any day now!”). Posting less means I got out of the habit of posting, which is a hard habit to be in and does require some effort.

I’m going to do what I can to post something of a general Parrot update here, and hopefully I can get back in the habit of posting a little bit more regularly again.

io_cleanup1 Status

io_cleanup1 did indeed merge with almost no problems reported at all. I’m very happy about that work, and am looking forward to pushing the IO subsystem to the next level. Before I started io_cleanup1, I had some plans in mind for new features and capabilities I wanted to add to the VM. However, I quickly realized that the house had some structural problems to deal with before I could slap a new coat of paint on the walls. The structure is, I now believe, much better. I’ve still got that paint in the closet and eventually I’m going to throw it on the walls.

The io_cleanup branch did take a lot of time and energy, much more than I initially expected. But, it’s over now and I’m happy with the results so now I can start looking on to the next project on my list.

Threads Status

Threads is very very close to being mergable. I’ve said that before and I’m sure I’ll have occasion to say it again. However there’s one remaining problem pointed out by tadzik, and if my diagnosis is correct it’s a doozie.

The basic threads system, which I outlined in a series of blog posts ages ago goes like this: We cut out the need to have (most) locks, and therefore we cut out many possibilities of deadlock, by making objects writable only from the thread that owns them. Other threads can have nearly unfettered read access, but writes require sending a message to the owner thread to perform the update in a synchronized, orderly manner. By limiting cross-thread writes, we cut out many expensive mechanisms that would need to be used for writing data, like Software Transactional Memory (STM) and locks (and, therefore, associated deadlocks). It’s a system inspired closely by things like Erlang and some functional languages, although I’m not sure there’s any real prior art for the specifics of it. Maybe that’s because other people know it won’t work right. The only thing we can do is see how it works.

The way nine implemented this system is to setup a Proxy type which intercepts and dispatches read/write requests as appropriate. When we pass a PMC from one thread to another, we instead create and pass a Proxy to it. Every read on the proxy redirects immediately to a read on the original target PMC. Every write causes a task to dispatch to the owner thread of the target PMC with update logic.

Here’s some example code, adapted from the example tadzik had, which fails on the threads branch:

function main[main](var args) {
    var x = 1;
    var t = new 'Task'(function() { x++; say(x); });
    ${ schedule t };
    ${ wait t };

Running this code on the threads branch creates anything from an assertion failure to a segfault. Why?

This example creates a closure and schedules that closure as a task. The task scheduler assigns that task to the next open thread in the pool. Since it’s dispatching the Task on a new thread, all the data is proxied. Instead of passing a reference to Integer PMC x, we’re passing a Proxy PMC, which points to x. This part works as expected.

When we invoke a closure, we update the context to point to the “outer” context, so that lexical variables (”x”, in this case) can be looked up correctly. However, instead of having an outer which is a CallContext PMC, we have a Proxy to a CallContext.

An overarching problem with CallContext is that they get used, a lot. Every single register access, and almost all opcodes access at least one register, goes through the CallContext. Lexical information is looked up through the CallContext. Backtrace information is looked up in the CallContext. A few other things are looked up there as well. In short, CallContexts are accessed quite a lot.

Because they are accessed so much, CallContexts ARE NOT dealt with through the normal VTABLE mechanism. Adding in an indirect function call for every single register access would be a huge performance burden. So, instead of doing that, we poke into the data directly and use the raw data pointers to get (and to cache) the things we need.

And there’s the rub. For performance we need to be able to poke into a CallContext directly, but for threads we need to pass a Proxy instead of a CallContext. And the pointers for Proxy are not the same as the pointers for CallContext. See the problem?

I identified this issue earlier in the week and have been thinking it over for a few days. I’m not sure I’ve found a workable solution yet. At least, I haven’t found a solution that wouldn’t impose some limitations on semantics.

For instance, in the code example above, the implicit expectation is that the x variable lives on the main thread, but is updated on the second thread. And those updates should be reflected back on main after the wait opcode.

The solution I think I have is to create a new dummy CallContext that would pass requests off to the Proxied LexPad. I’m not sure about some of the individual details, but overall I think this solution should solve our biggest problem. I’ll probably play with that this weekend and see if I can finally get this branch ready to merge.

Other Status

rurban has been doing some great cleanup work with native PBC, something that he’s been working on (and fighting to work on) for a long time. I’d really love to see more work done in this area in the future, because there are so many more opportunities for compatibility and interoperability at the bytecode level that we aren’t exploiting yet.

Things have otherwise been a little bit slow lately, but between io_cleanup1, threads and rurban’s pbc work, we’re still making some pretty decent progress on some pretty important areas. If we can get threads fixed and merged soon, I’ll be on to the next project in the list.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-09-14 00:00:00

FINALLY! The big day has come. I’ve just merged whiteknight/io_cleanup1 to master. Let us rejoice!

When I started the project, months ago, I had intended to work on the branch for maybe a week or two at the most. Get in, clean what I could, get out. Wash, rinse, repeat. That’s exactly why I named the branch “io_cleanup1”, because I intended it to just be the first of what would be a large series of small branches. Unfortunately as I started cleaning I was lead to other things that needed to go. And those things lead elsewhere. Before I new it I had deleted just about all the code in all the files in src/io/* and started rewriting from the ground up.

Sometimes sticking with a plan and breaking up projects into small milestones is a good thing. Othertimes when you know what the final goal is and you’re willing to put in the effort, it’s good to just go there directly. That’s what I ended up doing.

To give you an idea of what my schedule was originally, I had intended to get this first branch wrapped up and merged before GSOC started, so that I could keep my promise of implementing 6model concurrently with that program. With GSOC over last week (I’ll write a post-mortem blog entry about it soon), I’ve clearly failed at that. I’m extremely happy with the results so far and given the choice I would not go back and do things any differently. The IO system was in terrible condition and it desperately needed this overhaul. I wish it hadn’t taken me so long, but with a system that’s so central and important, it was worthwhile taking the extra time to make sure things were correct.

Where to go from here? My TODO list for the near future is very short:

  1. Threads
  2. 6model
  3. More IO work

The Threads branch, the magnum opus of Parrot hacker nine is 99.9% of the way there. If we can just push it up over the cliff, we should be able to merge soon and open up a whole new world of functionality and cool features for Parrot. I’m already planning out all the cool additions to Rosella I’m going to make once threads are merged: Parallel test harness. Asynchronous network requests, an IRC client library. The addition of a real, sane threading system opens up so many avenues to us that really haven’t been available before. Sure there are going to be plenty of hiccups and speedbumps to deal with as we really get down and start to use this system for real things, but the merge of the threads branch represents a huge step forward and a great foundation to build upon.

I’m going to be putting forward as much effort as I can to getting this branch wrapped up and merged. Some of the remaining problems only manifest on hard-to-test platforms, which is where things start to get tricky. As I mentioned in an email to parrot-dev a while ago, test reports on rare platforms are great, but if we can’t take action on the reported failures we can get ourselves into something of a bind. The capability to find problems on those platforms and the capability to fix problems on those platforms are two very different capabilities. But, most of the time that’s a small issue and we’re going to just have to find a way to muscle through and get this branch merged one way or the other. If we can merge it without purposefully excluding any platforms, that would be great.

Before anybody thinks that I’m done with IO and that system is now complete, think again. There is still plenty of work to be done on the IO subsystem, and all sorts of cool new features that become possible with the new architecture and unified type semantics. I want to separate out Pipe logic from FileHandle into a new dedicated PMC type. Opening FileHandles in “p” mode for pipes is clumsy at best, and I want a more sane system. And while I’m at it, 2-way and 3-way pipes would make for a great feature addition (we can’t currently do these in any reliable way).

The one thing that has changed most dramatically in the new IO system is buffers. The buffering subsystem has not only been rewritten but completely redesigned. Instead of being type-specific they are now unified and type independent. Buffers are their own struct with their own API. Instead of having a single buffer that is used for both read and write, handles now have separate read and write buffers that can be created and managed independently. I want to create a new PMC type to wrap these buffers and give the necessary management interface so they can be used effectively from the PIR level and above.

Finally, the whiteknight/io_cleanup1 branch tried to stay as backwards compatible as possible, so many breaking changes I wanted to make had to wait until later. In the future expect to see many smaller branches to remove old broken features, old crufty interfaces, and old bad semantics. We’ll make these kinds of disruptive changes in much smaller batches, with more space between them.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-08-27 00:00:00

On behalf of the Parrot team, I’m proud to announce Parrot 4.7.0, also known as “Hispaniolan”. Parrot is a virtual machine aimed at running all dynamic languages.

Parrot 4.7.0 is available on Parrot’s FTP site, or by following the download instructions at http://parrot.org/download. For those who would like to develop on Parrot, or help develop Parrot itself, we recommend using Git to retrieve the source code to get the latest and best Parrot code.

Parrot 4.7.0 News:

- Core
    + Added .all_tags() and .all_tagged_pmcs() methods to PackfileView PMC
    + Several build and coding standards fixes

The SHA256 message digests for the downloadable tarballs are:

4360ac3dffafffaa00bce561c1329df8ad134019f76930cf24e7a875a4422a90 parrot-4.7.0.tar.bz2
c0bffd371dea653b9881ab2cc9ae5a57dc9f531dfcda0a604ea693c9d2165619 parrot-4.7.0.tar.gz

Many thanks to all our contributors for making this possible, and our sponsors for supporting this project. Our next scheduled release is 18 September 2012.

The release is indeed out a day late. It’s not that I forgot about it, it’s just that I can’t read a calendar and HOLY CRAP, IT’S WEDNESDAY ALREADY? When did that happen? So, and I can’t stress this enough, Mea Culpa.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-08-22 00:00:00

This morning I made a few last commits on my whiteknight/io_cleanup1 branch, and I’m cautiously optimistic that the branch is now ready to merge. The last remaining issue, which has taken the last few days to resolve, has been fixing readine semantics to match some old behavior.

A few days ago I wrote a post about how complicated readline is. At the time, I thought I had the whole issue under control. But then Moritz pointed out a problem with a particular feature unique to Socket that was missing in the new branch.

In master, you could pass in a custom delimiter sequence as a string to the .readline() method. Rakudo was using this feature like this:

str = s.readline("\r\n")

Of course, as I’ve pointed out in the post about readline and elsewhere, there was no consistency between the three major builtin types: FileHandle, Socket and StringHandle. The closest thing we could do with FileHandle is this:

str = f.readline();

Notice two big differences between FileHandle and Socket here: First, FileHandle has a separate record_separator method that must be called separately, and the record separator is stored as state on the FileHandle between .readline() calls. Second, FileHandle’s record separator sequence may only be a single character. Internally, it’s stored as an INTVAL for a single codepoint instead of as a STRING*, even though the .record_separator() method takes a STRING* argument (and extracts the first codepoint from it).

Initially in the io_cleanup1 branch I used the FileHandle semantics to unify the code because I wasn’t aware that Socket didn’t have the same restrictions that FileHandle did, even if the interface was a little bit different. I also didn’t think that the Socket version would be so much more flexible despite the much smaller size of the code to implement it. In short, I really just didn’t look at it closely enough and assumed the two were more similar than they actually were. Why would I ever assume that this subsystem ever had “consistency” as a driving design motivation?

So I rewrote readline. From scratch.

The new system follows the more flexible Socket semantics for all types. Now you can use almost any arbitrary string as the record separator for .readline() on FileHandle, StringHandle and Socket. In the whiteknight/io_cleanup1 branch, as of this morning, you can now do this:

var f = new 'FileHandle';
f.open('foo.txt', 'r');
string s = f.readline();

…And you can also do this, which is functionally equivalent:

var f = new 'FileHandle';
f.open('foo.txt', 'r');
string s = f.readline("TEST");

The same two code snippets should work the same for all built-in handle types. For all types, if you don’t specify a record separator by either method, it defaults to “\n”.

Above I mentioned that almost any arbitrary string should work. I use the word “almost” because there are some restrictions. First and foremost, the delimiter string cannot be larger than half the size of the buffer. Since buffers are sized in bytes, this is a byte-length restriction, not a character-length restriction. In practice we know that delimiters are typically things like “\n”, “\r\n”, ”,”, etc. So if the buffer is a few kilobytes this isn’t a meaningful limitation. Also, the delimiter must be the same encoding as the handle uses, or it must be able to convert to that encoding. So if your handle uses ascii, but you pass in a delimiter which is utf16, you may see some exceptions raised.

I think that the work on this branch, save for a few small tweaks, is done. I’ve done some testing myself and have asked for help to get it tested by a wider audience. Hopefully we can get this branch merged this month, if no other problems are found.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-07-22 00:00:00

I had made a round of fixes with regards to encodings in the whiteknight/io_cleanup1 branch a few days ago. Rakudo hacker Moritz was able to take a look at Rakudo’s spectests and verify that more tests were indeed passing because of it. The remaining test failures represent the changing semantics for the read method and what appear to be two genuine regressions or bugs.

Hopefully I will be able to get all these things sorted out this week before I go away on a mini vacation next weekend. Otherwise I can’t imagine this branch gets merged before the 4.6 release this month.

A few days ago I wrote a post about readline and some of the intricacies involved in that, and some of the weird semantics that I was attempting to unify. It turns out that some of these semantics are a major cause in one of the last bugs in the branch. Let’s look at some code in master to see where the hangup is. First, readline on a Socket:

METHOD readline(STRING *delimiter    :optional,
                INTVAL has_delimiter :opt_flag) {
    INTVAL idx;
    STRING *result;
    STRING *buf;
    GET_ATTR_buf(INTERP, SELF, buf);

    if (!has_delimiter)
        delimiter = CONST_STRING(INTERP, "\n");

    if (Parrot_io_socket_is_closed(INTERP, SELF))

    if (buf == STRINGNULL)
        buf = Parrot_io_reads(INTERP, SELF, CHUNK_SIZE);

    while ((idx = Parrot_str_find_index(INTERP, buf, delimiter, 0)) < 0) {
        STRING * const more = Parrot_io_reads(INTERP, SELF, CHUNK_SIZE);
        if (Parrot_str_length(INTERP, more) == 0) {
            RETURN(STRING *buf);
        buf = Parrot_str_concat(INTERP, buf, more);

    idx += Parrot_str_length(INTERP, delimiter);
    result = Parrot_str_substr(INTERP, buf, 0, idx);
    buf = Parrot_str_substr(INTERP, buf, idx, Parrot_str_length(INTERP, buf) - idx);
    SET_ATTR_buf(INTERP, SELF, buf);
    RETURN(STRING *result);

We can ignore the fact that this implementation of readline doesn’t call Parrot_io_readline like every other PMC does. Or that if we did call that function the program would throw an exception because Parrot_io_readline doesn’t support sockets anyway. Whatever. Moving on…

For comparison, let’s look at the version from the Handle PMC (which is inherited by FileHandle):

METHOD readline() {
    STRING * const string_result = Parrot_io_readline(INTERP, SELF);
    RETURN(STRING *string_result);

The Socket version takes a delimiter parameter which is a STRING. When doing readline on a Socket, you can pass in any arbitrary string which is used as the token for end of line. With FileHandle, you don’t seem to have that. However, you can definitely use custom delimiters with FileHandle. However, we clearly don’t take a delimiter here and we aren’t passing one in as an argument to Parrot_io_readline like we do in the branch. Let’s see how it’s done instead. Here’s a snippet from Handle PMC:

    ATTR INTVAL    record_separator;  /* Record separator (only single char supported) */

We don’t need to look at any other code. This is the smoking gun. Socket.readline() can take any arbitrary STRING to use as a record separator, but FileHandle.readline() can only use a single codepoint, which it doesn’t take as an argument.

So that’s the problem right there. When I standardized the readline mechanics between types, I picked the FileHandle semantics. This was probably the wrong decision, because not only could Sockets use a more general mechanism but Rakudo relies on that behavior in its spectests. This does raise a question about why nobody ever expected this same behavior from FileHandle, or why the difference was not considered some kind of bug. It really goes to show how immature our IO system has been for all these years, and how we had all just grown accustomed to the arbitrary, inconsistent, nonsensical behaviors. It just works for some basic usages, so nobody ever complains about it. That time is, thankfully, coming quickly to an end.

Fixing this issue is actually going to take some serious work. Several function signatures are going to need updating to take a STRING delimiter instead of an INTVAL codepoint, and a major chunk of buffering logic is going to need to be rewritten to work on substrings instead of on individual codepoints. This, in turn, is going to require a heck of a lot more testing.

Last night I started putting in some of the changes necessary to use a substring terminator instead of a single codepoint. Most of what I’ve already done has been modifying function signatures. The real changes need to occur deep within the buffering logic and will require a little bit more time.

I’m looking forward to getting this branch fixed up and merged back to master so I can get to work on my next project. I think 6model is going to be the next thing I dig into, before I find something else that annoys me enough to put in a huge amount of effort to rewrite it. I’ll post more updates about my future projects and plans as I go.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-07-10 00:00:00

HTML is a derivative of SGML, just like XML is. Sure, they look pretty much the same for the most part, but there are a few key differences that prevent HTML from being parsed exactly like XML. Part of the reason why I like XHTML so much is that it’s more usable with more parsers, including many of simpler and full-featured XML parsers. Simplicity in parsing was one of the original motivations of the XML design, at least in comparison to a full SGML parser or even something like a full HTML parser.

But that’s all besides the point.

I’ve been in something of a backyard gardening kick lately. We bought our house only a few short months ago, and are only half way through the first summer growing season in my modest little garden. My plans for next year are much more expansive. I’ve finally talked my wife into letting me buy some cherry trees to plant. She was also pretty willing to get a few grape vines planted (especially when I sketched out the beautiful wooden arbor they would be growing on). She put her foot down when I started talking about blueberries, apples and pears, however. And another garden bed or two for more vegetables. For some reason she’s convinced that we need some measure of open space in our little plot so the kid has somewhere to run and play. Some people have weird priorities.

This is all sort of besides the point too.

Getting the things I need for all this gardening work I’ve talked myself into is not cheap. Cherry tree seeds actually do grow on trees so that’s not a big deal, but other things like fertilizers, soil amendments, tools, materials for building a grape trellis and raised garden beds, not to mention a longer hose to reach all the new things that are going to require regular watering all cost money. And maybe a sprinler, like one of those fancy ones on an electronic timer. I can avoid some of that cost by getting things used and at discount on sites like Craigslist. So I’ve been going there. Every day.

And it’s tedious. I have to sort through hundreds of listings for things I don’t want, in categories that seem far too course. Sometimes, because things often get incorrectly categorized, I have to look in other related categories too, sorting through things that are even less relevant on average to try and find the occasional gem. This is all on top of the hardware-related problems I have being unable to use the trackpad on my laptop so web navigation on sites without keyboard shortcuts is an extreme pain. I start to think to myself: I can do better, I’m a programmer! For some values of “better” and “programmer”.

Enter Rosella. Now with Parrot, Winxed and Rosella I can use the Net library to fetch the text of the HTML code of the page. After some hacking in the last few days, I can parse that code with my Xml library (set in a new lenient mode) and start to work with it in a meaningful way:

function main[main]() {
    var rosella = load_packfile("rosella/core.pbc");
    Rosella.initialize_rosella("xml", "net", "string");

    var ua = new Rosella.Net.UserAgent.SimpleHttp();
    var response = ua.get("http://philadelphia.craigslist.org/w4m/");
    var doc = Rosella.Xml.read_string(response.content, false);

        .get_children_named("p", "row":[named("class")])
        .map(function(node) {
            return {
                "title": node.first_child("a").get_inner_xml(),
                "link":  node.first_child("a").attributes["href"],
                "price": node.first_child("span", "itempp":[named("class")]).get_inner_xml(),
                "has_pic": !Rosella.String.null_or_empty(
                    node.first_child("span", "itempx":[named("class")]).get_inner_xml()
        .filter(function(obj) {
            return indexof(obj["title"], "compost") >= 0;
        .map(function(obj) {
            return Rosella.String.format_obj("<a href='{link}'>{title} for {price}</a>", obj);
        .foreach(function(string s) { say(s); });

That second argument to Rosella.Xml.read_string tells the parser to go into “non-strict” mode, which is basically my attempt to fudge the XML parsing rules to allow for the SGML nonsense in HTML. Without that, the parser will blow up pretty early in the parse because of unbalanced tags. The XML parser by default does not handle tags which are not balanced and which do not have the trailing slash to indicate a standalone tag, and the Craigslist source is filled with those kinds of things.

All I need to do is set this scraper up on a timer, and have it send me results somehow. If I set up a small server with mod_parrot and some kind of tool for generating RSS feeds, I could have this output neatly delivered to me on a regular basis. Considering that mod_parrot is moving along so smoothly and RSS is just another XML format, I think this is a pretty reasonable idea.

So, I started working on that. As of last night, I’ve sketched out two small libraries, one for RSS feeds and one for the competing standard, Atom. These libraries are thin wrappers around the XML library to deal with the specifics of RSS and Atom. Here’s an example of consuming an RSS feed:

var rss = Rosella.Rss.read_url("http://www.parrot.org/rss.xml");
    .foreach(function(i) {
        say(Rosella.String.format_obj("{title} (by {creator}) : {description}", i));

You can do almost exactly the same thing with an Atom feed too, if you’ve got one of those instead. Right now RSS and Atom are implemented in two separate libraries, but I may combine them together for simplicity and to avoid unnecessary code duplication.

I’m working on an interface to write and publish feeds as well, though that’s not quite ready yet. You can bet that when I’ve got that working, I’ll be setting up a copy of mod_parrot to use it with.

I’ve been sort of kicking around the idea of a specialized HTML parsing library, which would more or less be an SGML parser with some schema information. I’m not sure I want to get into that hassle because HTML is a pretty messy thing and it will take a huge amount of effort to get something that works most of the time. But, if you’re willing to put up with a little bit of oddity, the Xml library works well enough for many cases.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-06-30 00:00:00

The io_cleanup1 branch is nearing completion, though as always the last few details are what holds everything up. In the past few days all the remaining tests in the parrot repo were passing. The coding standards tests, as usual, the last to be resolved. Then I started building and testing other things on the branch: Winxed builds and tests fine. So does Rosella. Then I looked at NQP and Rakudo. Both built fine, but Rakudo was failing two socket-related spectests.

That’s not entirely unexpected. Even though my intention was to make this branch as painless as possible there were still some unavoidable changes to interfaces and semantics. There are a few places where older semantics are surrounded by large /* HACK! */ comments, but for the most part I’ve tried to make everything sane. That’s why I wasn’t surprised to see Rakudo failing a few tests. I was much more surprised that Rakudo built without any problems the first time I tried it. I figured the test failures represented some kind of semantic mismatch, and getting Rakudo passing again would have been as easy as getting the old semantics returned, with a note about a future update path.

It turns out this wasn’t exactly the case. For one test it was the simple difference in the way we read on streams with multibyte encodings. This was expected and we can fix it to use the old behavior if that’s what Rakudo prefers. For the second failing test, it’s not that there’s a semantic difference per se, but instead there is a glaring and serious bug in master that was corrected in the new branch. Here, I’m going to explain what’s going on.

Look at this code:

Parrot_io_recv_handle(PARROT_INTERP, ARGMOD(PMC *pmc), size_t len)
    Parrot_Socket_attributes * const io = PARROT_SOCKET(pmc);

    /* This must stay ASCII to make Rakudo and UTF-8 work for now */
    STRING * res    = Parrot_str_new_noinit(interp, len);
    INTVAL received = Parrot_io_recv(interp, io->os_handle,
                                     res->strstart, len);

    res->bufused = received;
    res->strlen  = received;

    return res;

This is a pared-down version of the code behind the recv method on Socket. It creates a new string with the specified length pre-allocated, then passes the buffer to the low-level recv C API (which has been abstracted a little to account for platform differences).

Notice the comment there in the middle which says the string uses the ASCII encoding, for use by Rakudo. This is what I saw, and this is the semantic I followed in the new system: When you read from a socket by default in the new system, the string is encoded as ASCII unless you specify differently.

Just for my own verification, I had to look at the Parrot_str_new_noinit function to verify that the string was, in fact, being set to ASCII:

Parrot_str_new_noinit(PARROT_INTERP, UINTVAL capacity)
    STRING * const s = Parrot_gc_new_string_header(interp, 0);
    s->encoding = Parrot_default_encoding_ptr;

    Parrot_gc_allocate_string_storage(interp, s,
        (size_t)string_max_bytes(interp, s, capacity));

    return s;

Elsewhere in the system, we have this:

Parrot_default_encoding_ptr = Parrot_ascii_encoding_ptr;

So yes, the string returned by the Socket does indeed use the ASCII encoding in master. And, after double-checking, the version in the io_cleanup1 branch was using ASCII also. However, in the new branch Rakudo’s test fails because of an exception about a lossy conversion of non-ascii data into the the lower bit-width format. A quick check shows that both systems create an ASCII string buffer and both systems call the same recv function to fill it. So where’s the problem? What the hell?

For comparison, here’s the snippet of code from the new branch that reads data into a STRING, possibly using a buffer:

bytes_read = Parrot_io_buffer_read_b(interp, buffer, handle, vtable,
                                   s->strstart + s->bufused, byte_length);
s->bufused += bytes_read;
STRING_scan(interp, s);

We’re reading out a number of bytes, appending them into the string’s pre-allocated storage and updating the number of bytes actually used. That’s all the same as in master. However, the last line, STRING_scan does not appear in master. What is it?

STRING_scan() loops through the data in the string to verify that it correctly matches the string’s encoding. For instance, if the string is encoded as ASCII, STRING_scan will loop through to make sure all character values are lower than 128. If the string is UTF-16, STRING_scan verifies that we have an even number of bytes and that each value is an acceptable codepoint.

master doesn’t do this, which means there is a bug. In master, we don’t scan the string after recv but before we return it to the user, which means we can have non-ASCII data in a string marked with the ASCII encoding. The Rakudo test puts UTF-8 data into the socket on the server side, and then reads out a string and encodes that to UTF-8 to verify that it comes out correctly. However in the new branch we actually check that the string is valid before giving it out to user code, and it isn’t, so we throw an exception.

Combine that with the fact that the Socket PMC has no way to change the encoding it uses in master, which means all Sockets used in Parrot master are potential sources of bugs.

Two nights ago I added methods to Socket to get/set the encoding to use, and everybody’s favorite Moritz created a branch for Rakudo to use it. Last night I did some playing with default encodings. Tonight and into the weekend I’m hoping to wrap up the last few details to get the Rakudo spectest passing like normal again. Hopefully, if all goes well, we can start talking about a merger within the next week or two.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-06-28 00:00:00

In terms of usage, there aren’t too many IO-related features in Parrot’s user interface more straight-forward than the readline method. It does exactly what you tell it to do: read a line of text from the given file and return that line of text as a Parrot string. Easy.

Tonight I was looking at some of the old code to get an idea about expected semantics for some tests that need fixing. Let’s look at some code:

.sub read_a_line
    .param string type
    $P0 = new [type]
    $S0 = $P0.'readline'()

.sub test_readline
    $S0 = 'read_a_line'('FileHandle')
    say $S0
    $S0 = 'read_a_line'('Socket')
    say $S0
    $S0 = 'read_a_line'('StringHandle')
    say $S0

The valid types for this are, as usual, "FileHandle", "Socket" and "StringHandle". Notice that we’re reading a line from the object of the given type before we’ve opened, connected or initialized. Pretend, in order to save myself some typing, that I’ve set up exception handlers and the like above. So, what happens?

  • For FileHandle we throw an exception. You can’t read from a closed handle.
  • For StringHandle, we throw an exception for the same reason.
  • For Socket we return null because…whatever. (in the test suite we test that when converted to a floating-point number, that it’s 0.0. Again, whatever).

So that’s a little bit weird that socket does something different from the other two, but fundamentally it’s a pretty different type so I suppose some differences can be allowed.

Now, let’s try something slightly different:

.sub read_a_line
    .param string type
    $P0 = new [type]
    $P0.'open'("foo.txt", "r")
    $P0.'print'("This is \n test text")
    $S0 = $P0.'readline'()

With this example we can only operate on FileHandle and StringHandle because Socket doesn’t have an .open() method like those two do. What does this do for those two types?

  • For FileHandle we throw the same exception, you still can’t read from a closed handle.
  • For StringHandle you can read like normal without any indication that the handle is closed!

So that’s weird to say the least that StringHandle has two different behaviors. Socket has yet another problem, in a slightly different way. The method Socket.readline() returns null when not open, but if you pass a Socket to the Parrot_io_readline method, it always throws an exception because apparently readline on a Socket isn’t supported! And because readline on a Socket uses a completely different code path from FileHandle the two types use completely different buffering mechanisms with subtly different semantics (StringHandle, because it uses the in-memory string buffer, does it in a third way).

To recap: What is conceptually a simple operation, read in some text until we find a delimiter, is done in three completely different ways by three different types, each with different error-handling semantics depending on both history, state, and the interface used. If anybody was wondering why I wanted to rewrite this subsystem, here’s part of the reason.

Actually, I kind of lied. It’s really not a simple operation which is all the more reason we should share common code. It’s a clear case of an algorithm where the hard parts should be encapsulated inside a clean interface so that different types can avoid needing to reimplement it over and over again (with differences, bugs and complications). That’s the way it really should be, but some of the complications in the code are a little hard to live with. Here’s the general algorithm for readline on a FileHandle, as it’s implemented in Parrot master:

  1. The filehandle requires a buffer for this, so create (and fill) a buffer if one isn’t configured.
  2. Create a new, empty STRING header.
  3. Treating the buffer like an encoded STRING, scan the buffer looking for the end of the delimiter or the end of the buffer, whichever comes first.
  4. Allocate/reallocate enough space in the STRING header to hold all the data we’ve found in the buffer.
  5. Append all the characters we’ve found to the STRING.
  6. If we’ve found the delimiter, we’re done. Return it to the user.
  7. Otherwise, check if we are at the end of file for the input. If so, go to 8. If not end of file, go to 9.
  8. Check that the last codepoint is complete and has all its bytes. If so, return the STRING to the user. If not, throw an exception about a malformed string.
  9. Check that the last codepoint is complete and has all its bytes. If so, go to 10. Otherwise, go to 11.
  10. Refill the buffer and go to 3.
  11. Determine how many more bytes we need to read to complete the last codepoint.
  12. Refill the buffer, and check that we have at least that many bytes available to read. If so, go to 13. Otherwise, throw an exception about a malformed string input.
  13. Read in the necessary number of bytes (1, 2 or 3 at most) from the buffer and go to 3.

If you’re reading an ASCII or fixed8 string the logic obviously collapses down to something a little bit more manageable. Also, this same logic, almost line for line, is repeated in the routine to read a given number of characters from the handle, where characters in a non-fixed-width encoding (like utf8) may need multiple reads to get if we don’t get all the bytes for the character into the buffer in a single go. Notice that the versions provided by StringHandle and Socket are both much more simple and not safe for multi-byte encodings like utf8 or utf16.

In my io_cleanup1 branch, the logic has been simplified substantially, and a single codepath is now used for all three of the major types:

  1. Make sure the handle has a read buffer set up and filled.
  2. Create a new, empty STRING header.
  3. Ask the buffer to find the given end-of-line character. The buffer will return a number of bytes to read in order to get a whole number of codepoints, and a flag that says whether we’ve found the delimiter or not.
  4. Append those bytes to the string header.
  5. If the delimiter is found or if we are at EOF, return the string.
  6. Fill the bufffer and go to #3.

By simply coding the buffer logic to refuse to return incomplete codepoints in response to a STRING read request, the whole algorithm becomes hugely simplified. The readline routine in master takes up 185 lines of C code. In my new branch, the same routine takes up only 47 lines. Of course, this isn’t comparing apples to apples, because I did break up some of the repeated logic into helper routines, and the buffers in my system are obviously a little bit smarter about STRINGs and codepoints, but that’s not exactly the point. The real point is that three large, complicated, hard-to-read functions in master are now a single, much smaller, easier-to-read routine that relies on clear abstraction boundaries to do a difficult job in a much more conceptually simple way.

I’ve also updated the STRING read routine (now called Parrot_io_read_s) to use a similar algorithm and actually share some of the new helper methods. That sharing itself also helps to decrease total lines of code has has other benefits as well.

Notice that there is one small change in these two algorithms, which may or may not need to be worked around if it causes problems. Notice that we don’t read out of the buffer an incomplete codepoint. If we have an incomplete one at the end of the file, the first algorithm will read it in and throw an exception about a malformed string. The second algorithm will ignore those final bytes and successfully return all the rest of the valid-looking data from the buffer instead. In the first algorithm, it then becomes impossible to read the partial data out and make a best effort, while in the second algorithm you can easily get to the data, even if the last codepoint is corrupted and cannot be read. I’d really love to hear what people think about this change, and whether it’s worth keeping or needs to change. I suspect it is better this way but only the users can really say for sure.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-06-13 00:00:00

I was going to call this post “IO Cleanup Status”, but let’s face facts: This is a complete rewrite of the entire subsystem. I haven’t hardly left a single line of code untouched. It is a full rewrite of the system hiding behind a mostly-similar (though not quite the same) API. I didn’t intend to completely rewrite the whole subsystem when I started the branch, hence the benign-sounding branch name. Following along with our cultural norms, I could have called it whiteknight/io_massacre or something similarly upbeat. Whatever. I’ve known people stuck with un-liked names for their entire lives, so this branch can be misnamed for a few weeks.

So what is the status of this branch, exactly?

At the time of this writing the branch is mostly complete. The major architectural work has all been done, with per-type logic separated out into new IO_VTABLE structures, and buffering logic divorced from FileHandle into a new IO_BUFFER structure. Now you can do things that have never been possible before, like buffering socket input and output, or doing readline with custom line-end characters on all handle types, and a whole bunch of other, increasingly-obscure operations. A lot of the new capabilities are things you didn’t even know we didn’t support before. Now, we do.

We aren’t quite there yet, but the stage is set for some other awesome changes in the future too, which I’ll talk about in more depth when we get there.

The current status of the branch is good. Parrot builds without any huge amount of new warnings and with no errors on my platform. Some platform-specific code needs to be updated for Windows, I’m sure. The one big thing standing in the way is keeping track of file positions through operations like seek and tell. These things are made a little bit more difficult when you have read buffers reading ahead, because the position of the next character to read according to the user may be far different than the position of the file descriptor according to the operating system. Then consider the case when you have a file opened for read and write, with buffers in both directions. The old system had a single buffer per FileHandle which needed to be flushed if you tried to read when the buffer was in write mode, or you tried to write when it was in read mode. If you’re switching back and forth between reading and writing often enough, buffering actually decreases performance when it’s supposed to be a performance enhancer.

The FileHandle has an attribute to keep a pointer to the current cursor location, but I’m not always updating it as often as I should and not always reading it when I should. If you have a file opened for read and write, when you write 5 characters at the current file position you need to increment the read buffer by 5 characters also. When you go to read in 5 characters from the current position, you either need to flush the write buffer first or you can try to read those characters right out of the write buffer. There’s nothing complicated about it, just a lot of bookkeeping to get right and lots of little interactions that need to be tested. It’s helpful that we don’t do seek or tell on some things like Sockets, and we don’t really buffer StringHandles.

The branch is moving along well and if I can find the time to actually sit down and work on it for a dedicate period of time I might be able to get it closer to being done. I’m shooting for being mergable sometime after the coming release.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-06-08 00:00:00

Tonight I’ve hit something of a milestone with my branch to rewrite the IO subsystem. As of tonight the parrot binary, parrot-nqp and winxed all build in my branch and coretest runs (though fails some tests). The entire build does not complete because of some failures related to dynops, but it does get most of the way through. This means that most of the main-path IO APIs and FileHandle operations are working correctly, which is a relatively small portion of everything that has changed.

With Parrot building, I’m now able to more closely keep track of progress and regressions, and do more live testing as I make new changes. Until this point all my changes have just been mental exercises, so I’m happy to have a little bit more feedback and even some validation.

Of course, just saying that it builds doesn’t really mean anything. Several things are still not implemented or completely wired up. Some operations on files such as seek, peek and tell are still not implemented yet. Several methods on the various PMCs (FileHandle, Socket and StringHandle) have not been updated to use the new system. There are a few regressions I need to address with regards to buffering. Specifically, “line buffering” has been removed from the system during the rewrite and hasn’t been added back. Line buffering in Parrot has never really done much, but it’s just hacky and obscure enough that I’m sure somebody is relying on it.

Some things, like files opened for dual read/write modes or append modes haven’t been completely dealt with in code either. I don’t think there’s a lot of work to do for this, but since the buffering architecture has changed so much from what it used to be and since these modes are relatively rare and not as thoroughly tested I want to spend a little bit of extra time making sure there are no regressions.

Also there are several coding standards tests (especially for function-level annotations and documentation) which fail spectacularly in the branch, and it’s going to take time to update all the old documentation and add docs for all the new functions. I also need to update PDD 22 to reflect the new architecture of the IO system.

I’ve been working on this branch pretty aggressively for the last two weeks and I think I’m about 50% of the way done. That’s not too bad considering the magnitude of the change and the amount of time I’ve had to hack. Within a week or two more, if all goes well, I think the branch might be ready for wider testing and eventually merging.

As usual when we’re talking about changes this big, merges are not something to be rushed. Assuming all goes well and other people like what I’ve been doing, expect to see a brand new IO system in Parrot sometime later this summer.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-06-01 00:00:00

The IO subsystem is a lot like the garbage collector: So long as it just works we can ignore its faults for quite a long time. The garbage collector had performance and other issues for years before everybody’s favorite bacek went through and finally rewrote it. His effort there saves the rest of us meer mortals from having to touch the GC again for another couple years.

The IO system works reasonably well. It’s got a decent set of features more or less, it implements most of the important operations that our users have needed in the past, it’s not spectacularly slow (and disk or network operation performance almost always outweighs any issues in the code that leads to those things), and we haven’t been getting a lot of error reports or feature requests for it. In short, if it ain’t broke, don’t fix it.

A few days ago I was working on a ticket for moritz to add better integration between our various IO vector PMCs (FileHandle, Socket, etc) and the ByteBuffer PMC. ByteBuffer is what it’s name implies: It’s an array-like type for working with individual bytes in a chunk of memory. It’s like a binary encoded STRING, but it’s not immutable and has a handful of additional features that a raw STRING (or the String PMC) doesn’t. ByteBuffer can be populated from and exported to a STRING, and it is useful for certain types of operations that need to operate on a sequence of bytes without having to worry about strings and encodings and all that other nonsense. Mortiz’s request was a reasonable one so I sat down and made it happen. A few nights ago I merged that work in to master with an “experimental” tag on it.

However while I was in the IO subsystem code making this happen something did break. Not in the code, instead something broke inside my poor little head. The snapping sound you hear is the poor camel’s back under the load of that last piece of straw. I’ve had enough of that system and its inside-out organization and collection of half-ideas and botched refactors. I’ve had my fill of the nonsense and finally decided it was time to make things right.

And before anybody says to me, “hey Mr Whiteknight, you shouldn’t be so mean, somebody probably worked really hard to make this code do what it does”, let me just say two things: First, “Mr Whiteknight” is my father’s handle and Second, I was one of the people who helped put IO where it is today. I don’t feel particularly bad insulting myself or my own work, and my contributions, though well-intentioned at the time, are a big part of why the system is in the condition its in now. First, a brief history lesson.

When I joined Parrot, it sported an IO system based on layers. Layers were arranged in a structure something like a vtable, and IO requests would be fed through the layers. Each layer getting the output of the one before it until the bottom layer actually spat the data out (or, read it in depending on which way you were moving). This worked pretty well when you were trying to do File IO on a file with a particular encoding, with buffering, through an asychrony mechanism, etc. Actually I say it worked well but it was sort of overkill: It was just too much infrastructure for the possible benefits and despite the theory of allowing better code reuse there really weren’t too many different layering combinations that could be set up. Plus, layers start to interdepend and violate encapsulation, then optimization starts prompting a few “short cuts” where layers were flattened together. One of the earlier things I did on Parrot, post-GSOC, was to remove some of the last vestiges of the then-unused layering system from Parrot’s IO.

The IO subsystem has something of a problem where it has a few masters and has to be performance conscious. Many of our programs are still the kind that shuffle data about (very much in the influence of Perl) and IO operation performance mattered when your compiler is reading in HLL code and outputting PIR code, then you’re reading PIR code in and trying to compile it again. Too much nonsense and everybody feels it.

In Parrot at the user level you can do IO in two ways: Through the IO PMCs (FileHandle, mostly) and through opcodes (say, print, etc). The problem, put succinctly, is this: We want to encapsulate logic for writing to files inside the FileHandle PMC, but we don’t want to add new IO-specific VTABLES and we don’t want to incur the costs of method calls on every single IO request. In other words, we didn’t want the print opcode to just be a thin wrapper around the print method on FileHandle. Such a thing, especially if implemented naively, would have killed performance by creating nested runloops and a whole host of other problems.

The way the system is set up is that both FileHandle.print() and the print opcode are both thin wrappers around the real routine Parrot_io_putps, which does all the hard work. And, more importantly, that routine is expected to act transparently (like the print opcode does) on any IO PMC type like Socket or StringHandle. The only real way to do this, if you can’t call a method on the FileHandle and Socket PMC is to use a large switch-statement:

switch (handle->vtable->base_type) {
    case enum_class_FileHandle:
    case enum_class_Socket:
    case enum_class_StringHandle:
        Parrot_pcc_invoke_method_from_c_args(..., handle, "print", ...);

I’ve obviously glossed over all the details, but this is the general form of that routine and several other similar routines in the IO API. You’ll notice several things from even a quick glance at this example:

  1. If we want to add a new IO type to Parrot core we need to add a new entry to the switch statement in every IO API routine that needs to care about PMC type (this is a major part of the reason we don’t yet have a sane, separate Pipe type).
  2. If the user passes in an Object, something defined at the PIR level, we do fall back to calling the method, because we can’t do anything else intelligently.
  3. We can’t really subclass FileHandle or Socket from the user level, because it would fail the base_type test, and wouldn’t be able to handle the low-level structure accesses from that point forward anyway.

Point number 2 is particularly interesting because the FileHandle.print() method calls Parrot_io_putps, which may turn around and call the .print() method. This is a big part of the reason why FileHandle cannot be subclassed in user code. It’s clearly an example of poorly separated concerns and poor encapsulation. Either the method should call the IO API or the IO API should call the method but we can’t be doing both. Actually, I’d far prefer the former, if we can do it in a good, general way.

There are a few other issues worth mentioning, which I’ll just dump rapid-fire without much explanation:

  • We don’t have a separate Pipe type. Instead, FileHandle can be opened in “pipe mode” to write to a separate process or read output from a separate process.
  • We have limited buffering, but only on FileHandle and we cannot configure buffers for input and output separately, or use separate buffers.
  • We don’t really have encodings set up in any consistent way, so it’s very possible, though I haven’t worked out all the details, to write strings with different encodings to a file. This is especially true if we’re using buffers and performing writes through different API routines.
  • FileHandle logic is considered to be the default and is given deference in the code. Pipe logic is unified with file logic at a very low level. Socket and StringHandle are treated as bolted-on spare parts and don’t benefit from hardly any code sharing or unified architecture. They also don’t have all the same useful features as FileHandle has.
  • Several functions in the IO subsystem are poorly or inconsistently named and implemented, not to mention the often-times confusing documentation and absurd architectural arrangements.

So that’s the system we’ve got. What do I want to do to fix these issues?

The first thing I’ve suggested is to break up IO functionality into an IO_VTABLE of function pointers, similar to how the STR_VTABLE, the sprintf dispatch mechanism, the packfile segment dispatch table and other similar mechanisms in Parrot work. Each IO request would go through the API routines, which dispatch to a vtable routine (possibly with some intermediate buffering logic). Here’s what it looks like in the branch to do a basic write:

IO_VTABLE * const vtable = IO_GET_VTABLE(interp, handle);
vtable->write_s(interp, handle, str->strstart, str->bufused);

And here’s how to do it with write buffering:

IO_VTABLE * const vtable = IO_GET_VTABLE(interp, handle);
IO_BUFFER * const read_buffer = IO_GET_READ_BUFFER(interp, handle);
Parrot_io_buffer_write_s(interp, handle, vtable, buffer, str);

Internall, the buffer does it’s magic and flushes data out to the vtable if necessary.

The second thing I want to do is break out buffering so that instead of being a detail of the FileHandle PMC a buffer is a separate struct which can be attached to any IO type as desired. And, even better, we can attach multiple buffers to an IO stream, at least one each for input and output, configured separately. The buffering API, which will be cleaned up and properly encapsulated, will take a pointer to the IO_VTABLE for the handle and will pass data through transparently as required. A thin wrapper PMC type, IO_BUFFER, would allow references to buffers to be accessed and configured directly, which would be very useful in some cases.

Imagine, if I may go off on a short tangent, a threaded system where one worker task had a reference to a buffer and continuously made sure it was filled in the background while another worker task read bits and pieces from the buffer very quickly. It would be possible, through careful choice of algorithm, to do such a thing lock-free. Feel free to replace “file” with “socket” or “pipe” in the example above too. Imagine also a system where we can transparently use mmap (or it’s windows equivalent) to map a file to memory as part of the buffer, and keep working with it that way.

The third thing I want to do is start teasing apart the logic for Pipes from the file logic. I’ll create a separate io_vtable for pipe operations, and use that inside FileHandle when we’re in pipe mode. Eventually we’ll be able to create a separate type, divide out all the logic completely, and get to work on really interesting stuff like feasible 2-way and 3-way pipes.

The fourth thing I want to do is start setting up interfaces so that IO operations including buffering, low-level IO, file descriptor manipulation and other things become more accessible at the PIR level so users can make better use of these tools, both in subclasses of the in-built handle PMCs and in custom types which neither derive from nor hold instances of those types.

I’ve started sketching out many of these ideas in the whiteknight/io_cleanup1 branch. cotto seems to agree with the general direction and I haven’t heard any complaints so far, so I’ve had my head down and been working hard on making these ideas reality. As of this writing, I’ve modified just about every single line of code in the subsystem, gotten most of the new architecture and logic into place and set up the vtables for the most important built-in types. I have a few details to finish up before I try to build (and inevitably debug) this new beasts. Ultimately I would like this first round of cleanups to produce no user-visible changes, so the old PMC methods and exported API functions are going to continue doing what they’ve always done. Later rounds of cleanups will add new interfaces and eventually deprecate and remove some of the crufty older ones. I’ll post more updates as this work progresses.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-05-27 00:00:00

««<« HEAD:drafts/gc_destructors.md In my last post I mentioned some of the work I was trying to do with GC finalization and destructors. I promised I would publish a longer and more in-depth post about destructors, what the current state is, what I am doing,

and what still needs to be done.

In my last post I mentioned some work involving the GC, finalization and destructors. Today I’m going to expand on some of those ideas, talk about what the current state of destruction and finalization are in Parrot, some of the problems we have with coming up with better solutions, and some of the things I and others are working on to get this all working as our users expect us to. I apologize in advance for such a long post, there’s a lot of information to share, and hopefully a much larger architectural discussion to be started.


Destructors are hard. The idea behind a destructor is a simple one: We want to have a piece of code that is guaranteed to execute when the associated object is freed. Memory allocated on the heap is going to get reclaimed en masse by the operating system when the process exits. However, things such as handles, connections, tokens, mutexes, and other remote resources might not necessarily get freed or handled correctly if the process just exits, or if the object is destroyed without some sort of finalization logic performed on it. Here’s a sort of example that’s been bandied about a lot recently:

function main () {
    var f = new 'FileHandle';
    f.open("foo.txt", "w");
    f.print("hello world");

In this example we would expect that the text "hello world" would be written to the foo.txt file. However, because the text to be written may be buffered (both in Parrot and by the OS), there’s a very real chance that the data won’t get written if we do not call the finalizer for the FileHandle PMC.

Obviously, the brain-dead solution to this particular problem is to manually close or flush the file handle:

function main () {
    var f = new 'FileHandle';
    f.open("foo.txt", "w");
    f.print("hello world");

However, the whole point of having things like finalizers (“destructors”) and GC is to make it so that the programmer does not need to worry about little details like these. The program should be smart enough to find dead objects in a timely manner and free their resources. Beyond that, many programming languages (with special emphasis on Perl6) require the availability of reliable and sane destructors.

In the remainder of this post I would like to talk about why destructors are hard to implement correctly, why Parrot does not currently (really) have them, and some of the ideas we’ve been kicking around about how to add them.

First, let’s cover where we currently stand. Parrot does have destructors, of a sort, in the form of the destroy vtable. That routine is called by the GC when the object is being reclaimed, during the sweep pass. A side-effect of this implementation is that if PMC A refers to PMC B and both are being collected, it’s very possible that A’s destructor tries to access some information in B after B has already been reclaimed. Think about a database connection object that maintains a socket on one side, and a hash of connection information on the other. The socket probably cannot perform a simple disconnect, but instead should send some sort of sign-off message first to alert the server that it can proceed with its own cleanup. The socket PMC would need information from the connection information hash to send this final message, but if the hash had already been reclaimed the access would fail with undefined results.

This situation has lead to more than a few calls for ordered destruction. In one of the most common and severe cases, Parrot’s Scheduler PMC was being relied upon by various managed PMCs. When a Task PMC was destroyed, at least in earlier iterations of the system, it would attempt to send a message to the Scheduler that it was no longer available to be scheduled. Ignore for a moment the fact that the Task could not possibly have been reclaimed in the first place if the Scheduler had a live reference to it, and if the Scheduler was still alive itself.

Because of some of these order-of-destruction bugs, GC finalization (a final, all-encompassing GC sweep path guaranteed to execute all remaining destructors prior to program exit) had been turned off. That and performance reasons. Turning off GC finalization leads to the problem above where data written to the FileHandle is not not flushed before program exit. You are probably now starting to understand the bigger picture here.

Having ordered destruction means essentially that we should be able to have an acyclic dependency graph of all objects in the system with destructors. However, maintaining this in the general case is impossible and attempting to approximate it would be very expensive in terms of performance. In any case, this is just a way to work around the problem of our naive sweep algorithm, which destroys and frees dead objects in a single pass, and not a real solution to the larger problems. A far better idea, recently suggested by hacker Moritz, is a 2-pass GC sweep.

In the 2-pass case the GC sweep phase would have two loops: the first to identify all PMCs to be swept (from a linear scan of the entire memory pool), execute destructors on them and add them all to a list, and the second to iterate over that list (after all destructors had been called) and reclaim the memory. Because of the linked-list setup of the GC, this second pass could, conveivably, be almost free because we could simply append this list of swept items to the end of the free list for an O(1) operation , and the first pass would be no less friendly on the processor data cache than our current sweep would be. This, in theory, solves our problem with ordered destruction, and should allow us to re-enable GC finalization globally without having to worry about these kinds of bugs causing segfaults in the final milliseconds of a program.

So that’s the basics of our current system and our problem with GC finalization, and shows us how we would proceed to make sure destructors were always called as a guarantee of the VM. However, this doesn’t begin to address any of the problems with destructors that will plague their implementation and improvement. I’ll talk about that second subject now.

Destructors, as I said earlier, are hard. In the case of GC finalization, after the user program has executed and exited, it’s relatively easy to loop over all objects and call destructors. It is those destructors which happen during normal program execution that cause problems.

In the C++ language, destructors have certain caveats and limitations. For instance, we can’t really throw exceptions from destructors, because that may crash the program. Not just an “oops, here’s an exception for you to handle”, but instead a full-on crash. In Parrot we can probably be smarter about avoiding a crash but not by much. It’s a limitation of the entire destructors paradigm. Let me demonstrate what I’m talking about.

Let’s say I have this program, which opens up a filehandle to write a message and then starts doing something unrelated to the filehandle but expensive with GC:

function foo() {
    var f = new 'FileHandle';
    f.open("foo.txt", "w");
    f.write("hello world!");
    f = null;       // No more references to f!

    for (int j = 0; j < 1000000; j++) {
        var x = new MyObject(j);

Somewhere along the line, when the GC runs out of space, it’s going to attempt a sweep and that means that f is going to be identified as unreferenced, finalized and reclaimed. The question is, where? The thing is that we don’t know where GC is going to run for a few reasons:

  1. We don’t know how many free headers GC has left in the free list before it has to sweep to find more.
  2. We don’t know how many PMCs are being allocated per loop iteration, because the various methods on x could be allocating them internally, and all PCC calls currently generate at least one PMC, and this is a lot of pressure.

So at any point in that loop, any point at all, GC could execute and reclaim the FileHandle f. That calls the destructor, flushes the output, and frees the handle back to the operating system. Good, right? What if there is a problem closing that handle, and the destructor for FileHandle tries to throw an exception (or, if this example isn’t stoking your imagination, imagine that f is an object of type MyFileHandleSubclass with an error-prone finalizer).

There are a few options for what to do here. The first option is that we throw the exception like normal. This means that the loop code with the MyObject variables, which is running perfectly fine and has no reason to throw an exception by itself, is interrupted in mid loop. The backtrace, if we provide one at all, probably points to MyObject but with an exception type and exception message indicative of a failed FileHandle closing. Initial review by the poor developer doing the debugging will show that there are no filehandles trying to close inside this loop and then we get a bug report because a snippet of code which is running just fine exits abruptly with an error condition which it did not cause. The solution for this, wrapping every single line of code you ever write in exception handlers to catch the various possible exceptions thrown from GC finalizers, is untenable from a developer perspective.

A second option is that we somehow disallow things like exceptions from being thrown from destructors, because there’s no real way to catch them rationally. This seems reasonable, until we start digging into details. How do we disallow these, by technical or cultural means? And if we’re relying on cultural means (a line in a .html document somewhere that says “don’t do that, and we won’t be responsible if you do!”), what happens if a hapless young programmer does it anyway without having first read all million pages of our hypothetical documentation? Does Parrot just crash? Does it enter into some kind of crazy undefined state? Obviously we would need some kind of technical mechanism to prevent bad things from happening in a destructor, though the list of potentially bad things is quite large indeed (throwing exceptions, allocating new PMCs, installing references to dead about-to-be-swept objects into living global PMCs, etc) and filtering these out by technical means would be both difficult and taxing on performance. When you consider that even basic error reporting operations at an HLL level, depending on syntax and object model used, may cause a string to be boxed into a PMC, or a method to be called requiring allocation of a PMC for the PCC logic, or whatever, we end up with finalizers which are effectively useless.

A third option is that we could just ignore certain requests in finalizers, such as throwing exceptions. If an exception is thrown at any point we just pack up shop, exit the finalizer and pretend it never happened. This works fine for exceptions, but does nothing for the problem of a finalizer attempting to store a reference to the dieing object into a living object. I don’t know why a programmer would ever want to do that, but if it’s possible you can be damned sure it will happen eventually. Also, when I say “pack up shop”, we’re probably talking about a setjmp/longjump call sequence, which isn’t free to do.

The general consensus among developers is that errors caused by programs running on top of Parrot should never segfault. If you’re running bytecode in a managed environment, the worst that you should ever be able to get is an exception. Segmentation faults should be impossible to get from a pure-pbc example program.

However, as soon as you introduce destructors, suddenly these things become possible. And not just from specifically malicious code, even moderately naive code will be able to segfault by storing a reference to a dieing PMC in a place accessible from live PMCs. Unless, that is, we try to do something expensive like filtering or sandboxing, which would absolutely kill performance.

And this point I keep bringing up about dead objects installing references to themselves in living objects is not trivial. Our whole system is built around the premise that objects which are referenced are alive and objects which are no longer referenced can be reclaimed by GC. Throughout most of the system we dereference pointers as if they point to valid memory locations or to live PMCs. If we turn that assumption around and say that dead objects may still be referenced by the system, then we lose almost all of the benefits that our mark and sweep GC has to offer. Specifically we would either have to install tests for “liveness” around every single PMC pointer access, which would bring performance to a standstill. Otherwise, we need to have a policy that says the user at the PIR level is able to create segfaults without restriction, though officially we declare it to be a bad idea. It’s not just a matter of having to test PMCs to make sure they are alive, the memory could be reclaimed and used for some other purpose entirely! Meerly accessing a reclaimed PMC could cause problems (segfaults, etc) or, if the PMC has already been recycled into something like a transparent proxy for a network resource, send network requests to do things that you don’t want to have happen! The security implications are troubling at best.

The only real solution I can come up with to this problem, and it’s not a very good one, is to add a “purgatory” section to the GC, where we put PMCs during GC sweep but we do not actually free them. The next time GC runs, anything which is still in purgatory is clearly not referenced and can be freed immediately. Anything that is no longer in purgatory has been “resurrected” by some shenanigans and has to be treated as still being alive even though its destructor has already been called. In other words, we take a performance hit and enable zombification in order to prevent segfaults. I don’t know what we want to do here, this is probably the kind of decision best left to the architect (or tech-savvy clergy) but I just want to point out that none of our options are great.

I’ve also brought up the problem with allocating new objects during a finalizer. Why is this a problem? Keep in mind that GC tends to execute when we’ve attempted to allocate an object and have none in the free list. If we have no available headers on the free list, are already in the middle of a GC sweep and ask to allocate a new header, what do we do? Maybe we say that we invoke GC when we have only 10 items left (instead of 0) on the free list, guaranteeing that we always have a small number of headers available for finalization, though no matter what we set this limit at it’s possible we could exhaust the supply if we have many objects to finalize with complex finalizers. Every time a finalizer calls a method or boxes a string, or does any of a million other benign-sounding things PMCs get allocated. If we try to allocate a PMC when there are no PMCs on the free list and we’re already in the middle of GC sweep, the system may trigger another recursive GC run.

Another option is that we could maintain multiple pools and only sweep one at a time. If one pool is being swept we could allocate PMCs from the next pool (possibly triggering a GC sweep in that second pool and needing to recurse into a second pool, etc). Maybe we allocate headers directly from malloc while we’re in a finalizer, keep them in a list and free them immediately after the finalizer exits. We have some options here, but this is still a very ««<« HEAD:drafts/gc_destructors.md real problem that requires very careful consideration. Something like a semi-space GC algorithm might help here, because we could allocate from the “dead space” before that space was freed.

Or we could try to immediately free some PMCs during the first sweep pass, and use those headers as the free list from which to allocate during destructors. This raises some problems because it would be very difficult to identify PMCs which could be freed during the first pass without negating any references which are going to be accessed during the destructors. Also, we run into the (rare) occurance where all the PMCs swept during a particular GC run have destructors, and there are no “unused” headers to immediately free and

recycle for destructors.

real problem that requires very careful consideration. Again, I don’t have an answer here, just a long list of terrible options that need to be sorted according to the “lesser of all evils” algorithm.


Let’s look at destructors from another angle. Obviously a garbage-collected system is supposed to free the programmer up from having to manually manage memory (at least) and possibly other resources as well. You make a mess and don’t want to clean it yourself, the GC comes along after you and takes care of the things you don’t wnat to do yourself. On one hand the argument can be made that if you really care about a resource being cleaned in a responsible, timely manner, that you call an explicit finalizer yourself and leaving those kinds of tasks to the finalizer is akin to saying “I don’t care about that object and whatever happens, happens.” After all, if you can’t throw an exception from a destructor and if the destructor is called outside normal program flow with no opportunity to report back even the simplest of success/failure conditions, it really doesn’t matter from the standpoint of the programmer whether it succeeded or silently failed. Further, if the resource is sensitive, you don’t clean it explicitly and Parrot later crashes and segfaults because some uninformed user created a zombie PMC reference, your destructor cannot and will not get called no matter what. If all sorts of things at multiple levels can go wrong and prevent your destructor from running, does it really matter if the destructor gets called at all?

Another viewpoint is that destructors don’t need to be black-boxes, and we don’t care if they have problems so long as they’ve given a best effort to ««<« HEAD:drafts/gc_destructors.md free the resources, those efforts have a decent expected chance of success, and they have an opportunity to log problems in case somebody has a few moments to spare reading through log files. After all, if a FileHandle fails to close in an automatically-invoked destructor, it also would have failed to close in a manually-invoked one and what are you going to do about it? If the thing won’t close, it won’t close. You can either log the failure and keep going with your program (like our destructor would have done automatically) or you can raise hell and possibly terminate the program (like what could happen if an exception is thrown from a destructor). In other words, when you’re talking about failures related to basic resources at the OS level, there aren’t many good options when you’re writing programs at the Parrot level.

I suspect that what we are going to end up with is a system where we allocate a temporary managed pool of PMCs to be available, and allocate all PMCs during a destructor from that pool. After GC, we clear the emergency pool at once. This solution adds a certain amount of complexity to the GC and also does nothing to deal with the zombie references problem I’ve mentioned several times. We’d have to make a stipulation that PMCs allocated during a destructor may not themselves have automatic destructors.

Things start to get a little bit complicated no matter what path we choose. This is the kind of issue where we’re going to need lots more input,

especially from our users.

free the resources and they have an opportunity to log problems in case somebody has a few moments to spare reading through log files. After all, if a FileHandle fails to close in an automatically-invoked destructor, it also would have failed to close in a manually-invoked one and what are you going to do about it? If the thing won’t close, it won’t close. You can either log the failure and keep going with your program (like our destructor would have done automatically) or you can raise hell and possibly terminate the program (like what could happen if an exception is thrown from a destructor). In other words, when you’re talking about failures related to basic resources at the OS level, there aren’t many good options when you’re writing programs at the Parrot level. If you’re not so hot at OS administration, there might not be anything you can do no matter what.

In Parrot we really want to enable PMC destruction and GC finalization. As things stand now you can run destroy vtables written in C, usually without issue. However when we expose this functionality to the user we are talking about executing PBC, in a nested runloop (at least one!), with fresh allocations and all the capabilities of PBC at your disposal. As soon as you open that can of worms, the many problems and problematic possibilities become manifest. The security concerns become real. The performance implications become real. I’m not saying that these are problems we can’t solve, I’m only pointing out that they haven’t been solved already because they are hard problems with real trade-offs and some tough (and unpopular) decisions to be made.


Andrew Whitworth | Whiteknight's Parrot Blog | 2012-05-23 00:00:00

As I promised in my last post, I have several branches up in the air that need to be worked on. Some branches merged last week after the release. Others are pending to merge soon and some are still in development. In this post I’m going to give a short summary of these things, since I haven’t been posting regular updates like normal.

Already Merged

After the release last week I merged three small branches that brought small changes and appeared to test cleanly with NQP and Rakudo. In short, these were uncontroversial.

  • whiteknight/gh_675 named after the Github Issue of the same name, this branch removed the can vtable. In all cases in core and in external projects where I looked, the can vtable was simply a redirect to the find_method vtable and a check for null. There’s no need for this added indirection, we can call the find_method VTABLE directly from can opcode.
  • whiteknight/imcc_file_line This branch removed some very old, long-deprecated IMCC directives. The .line and .file directives were not poorly implemented (as far as IMCC goes) but they weren’t used and weren’t introspectable. The setline and setfile directives (yes, they are directives even though they looked like opcodes!) weren’t used anywhere and weren’t implemented well. I’ve removed all four. Now, we can use the .annotate directive to replace all of these and add other metadata besides in a way that is easy to introspect from within running bytecode.
  • whiteknight/remove_cmd_ops removed a few command-line arguments from the parrot executable which were non-functional. These arguments have been disconnected since the time of the IMCC API cleanups months ago, and nobody had even noticed. Now they’re gone.

Those things out of the way, here’s a list of some of the branches that are currently unmerged but may be merging soon.


This is one of the most disruptive branches I’ve got going, which is why I’m in no hurry to merge it. Before I can merge it I need to patch both NQP and Rakudo. I submitted patches for these but they weren’t ready to apply and I have to go back and re-do them.

This branch removes the deprecated Eval PMC. The IMCCompiler PMC has already been updated to use a PDD31-compliant interface, which returns a PackfileView PMC instead of an Eval. NQP and Rakudo need to be updated to use this new interface instead of the older VTABLE_invoke one. This update will work in the Parrot master branch just fine, so we can make those updates to NQP and Rakudo and test them thoroughly before we merge the eval_pmc branch in.


This is a much bigger and much more disruptive branch. However, because of the fact that NQP and Rakudo don’t really use subroutine flags for their control flow, those two projects won’t really be affected as much as everybody else will be.

The remove_sub_flags branch removes the :load and :init flags from the PIR syntax and replaces them with :tag. The only real way to work with :tag is through the PackfileView PMC, so we need to merge the eval_pmc branch into Parrot first before we can make any further progress on this one. This is a back-burner task and will probably not be touched before the end of the summer.


We’ve received some requests from Rakudo folks that we need to start getting serious about GC finalization. This involves two changes: First is setting the GC to perform a finalization sweep at interp exit, which it currently is not doing. The second is to fix some sweep-related behaviors so the destroy VTABLE can be much more sane and useful.

The whiteknight/gc_finalize branch does both of these things. First, it re-enables GC finalization which had been turned off for so long that the code for it no longer works in master. Second, it moves to a two-stage sweep algorithm, so that we execute all destroy vtables first before we start freeing any resources.

There are still going to be problems with destroy vtables however, and I’m searching for solutions to these. Let me illustrate with a short example. We call GC to sweep typically in response to a request for a new PMC when we have none on the free list. If we have an item on the freelist, we return that immediately and very quickly. If not, we invoke GC to try and free up some headers (or allocate new ones from the OS).

Let’s say we’re programming in Rakudo Perl6 and we have an object with a destructor. For the purposes of our example, it’s a DB connection object. That destructor needs to call a method on a Socket object connecting the client program to the server. As everybody should be aware of now, calling a method in Parrot itself is going to allocate a CallContext PMC.

However, we run into a small problem because we’re in GC because we’re out of PMCs to allocate. So if we try to allocate a new PMC at this point I don’t know exactly what will happen but I can only imagine that the results would not be good. At the worst case, we recursively call into GC which goes back to sweeping, which re-executes finalizers, and we get into an infinite loop.

I won’t go into all the details here, I’ve got another (long) post drafted that discusses these and some other issues related to finalization. This whiteknight/gc_finalize branch solves some of the first few problems but there will be more to come after that.


The singleton designator for C-level PMCs has been deprecated for some time now, and the whiteknight/gh_663 branch intends to remove them.

Here’s how singletons work in Parrot: The get_pointer and set_pointer vtables are used to manage a single reference to an existing singleton PMC if any. To get the PMC, we invoke the get_pointer vtable without an invocant PMC (the only such occurance of a vtable invoked without an existing PMC reference in the whole codebase that I am aware of). If it returns NULL, a new header is created. If the new header is created, the set_pointer vtable is called on the new object with itself as an argument.

This all happens inside Parrot_pmc_new and is mostly transparent, except for the few bits of code throughout the system which violate this (rather flimsy) encapsulation boundary.

The get_pointer and set_pointer vtables operate on void* pointers, so we even lose typesafety. Plus, we don’t expose get_pointer or set_pointer vtables to PIR code, so there’s absolutely no way to create a singleton class at the user-level using this mechanism. You can do what users of all other languages do and create an accessor and restricted constructor and implement singletons that way. In fact, I think that’s better.

The majority of offending code has been ripped out of this branch, though I’m still seeing some segfaults during the build as a result of bad, unchecked pointer accesses in places where encapsulation has been violoated. I’ve got to spend a little bit more time tracking down some of these failures. Then, assuming NQP and Rakudo aren’t relying on this mechanism, the merge should be relatively painless.


A while ago, moritz suggested that we improve integration of our ByteBuffer PMC type, especially with our FileHandle and Socket types. We should be able to read a sequence of raw bytes from either of those PMCs into a ByteBuffer and we should be able to write raw bytes from a ByteBuffer into either of those destinations too.

The whiteknight/gh_610 aims to make this a reality. Already I’ve done most of the code work to get this in place, though I haven’t added all the necessary tests and documentation. Plus, a few coding standards tests are failing too.

While looking at this code, I am reminded that the IO subsystem is kind of messy. I’ve tried to clean it up in the past, and made a few small improvements over time. However, without a larger guiding vision to follow, I never really had a great idea of what kind of larger architectural changes to make to really bring this subsystem up out of the mud. After working on this branch, I finally had something like a flash of insight, and think I have a good idea about how to clean things up. This leads me to…


My idea is a relatively simple one: All our IO operations are controlled by the various PMC types (FileHandle, Socket, StringHandle, etc), but all our IO API functions are currently implemented as ugly (and brittle) switch statements to pick between execution pathways for these different types. A far better idea would be to separate out the different logic behind a virtual function dispatch table (vtable).

I’ve written up some proposed changes in the whiteknight/io_cleanup1 branch, and will start work if other people think it’s a decent idea.

The key points are as follows:

  1. Move all FileHandle-specific logic into src/io/filehandle.c. Do the same for Pipe, Socket and StringHandle types.
  2. Implement a new io_vtable type, which will contain a dispatch table for common operations. Each one of the files created in #1 above will implement the routines for one io_vtable and supporting logic.
  3. Buffering will be refactored. Instead of the FileHandle PMC containing several attributes for buffering, we’ll instead use an io_buffer object to hold buffering details. An encapsulated buffering API will take this buffer structure and the relevant vtable and automatically perform buffering if necessary.
  4. I am going to start separating out Pipe logic from FileHandle, though I’m not planning to create a separate type for it quite yet.

Once these things are done, I think the IO system will be much cleaner and much more hackable. This is lower priority right now until some of my ideas are vetted, but I’m glad I finally have a plan in mind after so many years of staring helplessly at this code.


The engine for our sprintf implementation is sort of old and messy. It’s some very functional and very stable code, but it needs to be brought up to date with our modern coding and organizational standards.

In the whiteknight/sprintf_cleanup branch I make several changes, most of which are entirely internal and should not affect users at all:

  1. I move the files from ‘src/misc.c and src/spf_.c to src/string/sprintf.c and src/string/spf_.c respectively.
  2. I’ve cleaned up some header-file nonsense and created a new src/string/spf_private.h header file to hold private data.
  3. I’ve changed the code to use a StringBuilder instead of the older (and now-incorrect) repeated string concatenations. With immutable strings, each concat operation creates a new STRING instead of appending to the pre-allocated buffer, which is extremely wasteful. I haven’t benchmarked this change, but I suspect it has higher performance on longer, more complicated formats.
  4. I’ve fixed a sub-optimal error message at request of benabik in ticket #759.

This branch is almost complete and I’ll probably merge it this weekend. Besides the text of the exception message, there are no visible user changes so it shouldn’t be controversial at all.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-05-20 00:00:00

Its existence guarantees nothing in itself, and the catalytic or Promethean moment only occurs when one individual is prepared to cease being the passive listener to such a voice and to become instead is spokesman, or representative.

But it’s important to remember the many dreary years when the prospect of victory appeared quite unattainable. On every day of those years, the “as if” pose had to be kept up, until its cumulative effect could be felt.

– Christopher Hitchens, Letters to a Young Contrarian

On behalf of the Parrot team, I’m proud to announce the 4.4.0 release of Parrot “Banana Fanna Fo Ferret”. Parrot is a virtual machine aimed at running all dynamic languages.

Parrot 4.4.0 is available on Parrot’s FTP site, or by following the download instructions. For those who want to hack on Parrot or languages that run on top of Parrot, we recommend our organization page on GitHub, or you can go directly to the official Parrot Git repo on Github.

Parrot 4.4.0 News:

- Core
    + Most internal calls to libc exit(x) have been replaced with
      Parrot_x_* API calls or PARROT_FORCE_EXIT
- Documentation
    + 'pdd31_hll.pod' made stable in 'docs/pdds/'.
    + Updated main 'README' to 'README.pod'
    + Updated various dependencies, e.g., 'lib/Parrot/Distribution.pm'.
    + Updated all 'README' files to 'README.pod' files.
    + Added 'README.pod' files to top-level directories.
- Tests
    + Update various tests to pull from new 'README.pod'
    + Updated 't/tools/install/02-install_files.t' to pull from new
- Community
- Platforms
- Tools
    + pbc_merge has been fixed to deduplicate constant strings and
      merge annotations segments

Alvis Yardley (or a delegate) will release Parrot 4.5.0, the next scheduled monthly release, on June 16th 2012. Subsequent release managers are to be announced. A special thanks to our donors, contributors and volunteers for making this release possible.


I haven’t been doing enough blogging lately! On Tuesday I put out the 4.4.0 release of Parrot, “Banana Fanna Fo Ferret”. I figured it was a fun play on words. I added a little quote from a favorite writer of mine, Christopher Hitchens. Much of his writings can be pretty inflamatory, but I picked two quotes that related to historical struggles for social progress, and which when read in a certain light (and dramatically out of context) make sense for Parrot too.

The release went off without a problem, and I’ve got a few branches waiting in the environs to be merged. I’m sure I’ll talk about some of those projects if I can get back into a normal blogging rhythm again.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-05-17 00:00:00

Last week I promoted the Parse and Json libraries in Rosella to stable status. For both those libraries I wrapped up a few outstanding TODO issues, wrote up some website documentation and added a bunch of unit tests. I figured I would do the same thing for the XML library too. After all I had done the hard part: the first 90% of the library was the recursive descent parser which I had most of.

So today I got to work on that library, trying to put together the last few bits so I could make the library stable. Like I said, I had about 90% of it done already. I spent the time today doing another 90%. I figure I only have about 90% left to go before I have a “real”, usable XML library. Somewhere a mathematician is reading this post and inventing new curse words, but nobody can hear him, because he has no friends.

It turns out that XML is hard.

Anybody can put together a little parser for XML-like tag syntax with attributes, text, and nested tags. That part is dirt simple, and I had that done in an hour or two. It’s once you start getting into DTD declarations and schema validation that things get messy. Honestly, I don’t think I can seriously call Rosella’s XML library “complete” without those things. Or, not without most of them. I can probably get away with only the first 90% or so.

So, what can Rosella’s Xml library do today? Here is a sample of XML text that I can parse into a document object tree without problems:

<?xml version="1.0"?>
<!DOCTYPE foo [
    <!ELEMENT foo (bar, baz)>
    <!ELEMENT bar ANY>
    <!ELEMENT baz (fie)>
    <!ELEMENT fie EMPTY>
    <!ATTLIST fie
                lol CDATA #REQUIRED
                wat CDATA #IMPLIED
                sux CDATA #FIXED "hello!">
        <fie lol="laughing out loud" wat="you talkin bout?" sux="hello!"/>

Or, if I want, I can jam all that schema nonsense into a separate file, and load it separately:

<!DOCTYPE foo SYSTEM "foo.dtd">

Although I haven’t integrated Rosella Net yet, to allow loading schemas from a URL. In code, I can do a few things:

var dx = new Rosella.Xml.Document();
if (!dx.is_valid()) {
    for (string err in dx.errors)

var dtd = new Rosella.Xml.DtdDocument();
var errors = dtd.validate_xml(dx);
if (elements(errors) > 0) {
    for (string err in errors)

That example shows us loading an XML document from a file and validating it with it’s built-in rules from the !DOCTYPE header. The second part shows us loading a separate DTD definition from a standalone file, and using that to validate the XML document too. In both cases, the validator runs through the document object and returns a whole list of error messages, not just a simple yes/no flag. In both cases, we can also re-serialize the XML and DTD documents back to string and then to file.

So what is left to do? Well, for starters there’s a bunch of syntax in the !ELEMENT tag that I don’t quite handle yet, such as quantifiers and alternations:

<!ELEMENT foo (bar*, (baz|bar), fie?)>

Parsing all that in a way that doesn’t suck is not something I’m looking forward to doing.

Then in attribute lists, there’s some syntax I don’t deal with, such as enumerated values again:

<!ATTLIST foo bar (yes|no)>

The validator I’ve implemented is pretty naive so far, and isn’t set up to do quantifiers anyway. That’s all going to take a while to do. We’re doing some basic validation now, but nowhere near as much as we would expect from a full implementation.

And keep in mind, even when I’m done implementing (mostly) proper XML and DTD parsing, I could still go on to parse other schema languages like XSD which some applications might expect and even prefer. Maybe I could do something like XPath too, which would be very nice. I probably won’t try to do XSLT though: I’m still young and I would like to keep some of my sanity in reserve for my twilight years.

My Json library is about 1300 lines of winxed code long, including whitespace. My Xml library is about 2400 lines of code long and still growing. Json is pretty easy (by design!), but XML is very hard. I’m not going to push the Xml library to become stable any time soon, there’s a hell of a lot of work left on it and I’m not going to rush anything.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-04-28 00:00:00

Here are some updates on various projects I’ve been working on or been planning to work on:


In my post introducing ParrotStore, I mentioned that I only had support for MySQL, Memcached, and a little bit of stuff working for MongoDB. In the past few days I’ve also added SQLite3 support. Now you can do this, after installing the prequisites, building and installing:

var sqlite3_lib = loadlib('sqlite3_group');
var sqlite3 = new 'SQLite3DbContext';
sqlite3.query("INSERT INTO tbl1 (name, number) VALUES ('Andrew', 100)");
var result = sqlite3.query("SELECT * FROM tbl 1");
for (var row in result) {
    for (string colname in row)
        print(colname + "=" + string(row[colname]) + " ");

SQLite3 offers a bunch of features that I don’t tap into yet, but we have a good start and can do some basic work with it already.

Also, I mentioned that we didn’t support queries with multiple result sets in the MySQL bindings. Well, now we do (and we do in SQLite3 too):

var result1 = mysql.query("CALL my_stored_proc");
var result2 = sqlite3.query("SELECT * FROM tbl1 ; SELECT * from tbl2");

If the query returns one result set, a DataTable object is returned. If it has multple result sets, an array of DataTables is returned instead.

Eval PMC

I went digging through my backlog of old branches last night and found my incomplete branch for removing the deprecated Eval PMC. After updating to current master I gave it a spin and most things looked good. I fixed all the core parrot tests and then moved on to the rest of the ecosystem.

Winxed works fine with the PackfileView PMC instead of the Eval PMC. I made a few of those updates in the past, so it mostly worked out of the gate. Rosella compiled and ran like a charm too.

NQP-rx works fine because it mostly relies on the PCT libraries that ship with Parrot, and which I had already fixed.

The new NQP is a little bit more of a hassle. It took me a little bit of effort to figure out the bootstrapping mechanism, but after a few hours of hacking I had NQP building on the new Parrot using PackfileView instead of Eval. However, one of the regex tests hangs indefinitely now and I’m having trouble tracking that down. this project may get bumped down to a lower priority level until I can either figure out what the problem with NQP is, or until I can enlist some help to fix it.

I would like to merge this branch as soon as NQP is fixed and I can prove that I can build it and Rakudo on the branch.

Sub Flags Cleanup

My remove_sub_flags branch, tasked with removing the old :load and :init flags from Parrot and replacing them with the new :tag() syntax is right where I left it a few weeks ago. I’m down to a relatively small list of test failures, the solution to most of which is to update the syntax in the tests themselves. A handful of tests such as those using the parrot-nqp and winxed compilers are failing because I need to update those compilers first to generate the correct code so the tests can run correctly.

After fixing NQP-rx and Winxed, I need to get started testing out the new NQP and Rakudo. I suspect both of those two things will be made to work without too much effort.

It turns out that the Eval PMC deprecation work overlaps with this slightly, so the things I change for that branch should help reduce failures in this branch too. After I get Eval deprecated and removed, I’ll come back to this branch and see where things stand.

This is such a large and disruptive change that I can’t imagine we would want a merge before the 4.4 release, even if I got all the bugs ironed out. We could be a month or more away from a merge, so I’m not listing this work as high priority.


Bacek has been doing a lot of refactoring in PCC land, trying to fix some slow and infelicitious aspects of it. I’ve gotten a set of new PCC-related opcodes added to core and have a few more that I want to add, including new variants of set_args, get_params and friends to take explicit context arguments instead of using magical behavior to try and find them automatically. A few patches to IMCC and the new behavior might go in without anybody noticing. I’ve talked more about this in past posts, and I’m sure I’ll have more to say when I start making changes.


Rosella is mostly where I want it to be right now. I’m planning to change around the development cycle to stick to supported releases of Parrot and Winxed instead of tracking HEAD for both of them. I’m going to promote one or two more libraries to “stable” status and then put out a release sometime after Parrot 4.4 hits the news stands next month. I’ve already promoted the Parse and Json libraries to stable status. I will probably promote Xml and Net too, since I am pretty happy with both of those two libraries and feel that they are almost ready for general use.

After that, I suspect Rosella is going to take a back seat for a while, so I can focus on some other projects.

Google Summer of Code

GSOC is keeping me pretty busy so far. We accepted 4 projects this summer. The fifth project, which was to do some work on the Jaesop Stage 1 compiler, was lost because the student was accepted to a different organization instead. The four remaining projects are:

  1. Security Sandbox by Justin
  2. Mod_Parrot 2.0 by brrt
  3. LAPACK Bindings by jashwanth
  4. PACT Assembly by benabik

I think these projects will be very cool, and I am looking forward to see what kinds of great code they can produce this summer.

Green Threads

nine has been doing some amazing work on his threading branch. Yesterday he informed me that he had a solution to make green threads work on Windows, and had already implemented part of it. That’s awesome, because I was planning to work on porting the green threads to windows next, but if he’s doing it then I don’t have to.

Some of the performance numbers he’s been getting are pretty impressive for certain tasks. Some benchmarks he has are even showing a significant threading performance improvement over a similar benchmark written in perl5.

I’ve been doing some testing on his branch and things are looking mostly good except for one or two remaining GC-related bugs that need to be ironed out. After that, if we can get some concensus, I would love to start talking a merger shortly after 4.4.


With Green Threads possibly off my TO-DO list, Eval PMC Deprecation mostly wrapped up and remove_sub_flags on the back burner, I can start moving towards my next project: 6model. And I can do it much earlier than I was expecting. I’m going to mine benabik’s rejected 6model project proposal for some ideas, then I’m going to jump in and try to get things working. I suspect things could get moving pretty quickly, if I can keep my level of free time relatiely high.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-04-25 00:00:00

I created a new repo for a new project: ParrotStore. ParrotStore intends to provide some storage and persistance (and caching and database) solutions for Parrot. At the time of writing this post we have three in development: Memcached, MySQL and MongoDB.


The first thing I wrote is a rudimentary pure-parrot interface to Memcached for high speed caching. The interface looks like this:

var memcached = new ParrotStore.Memcached(["", ""]);
memcached.set("foo", "hello world!");
:(int have, string content) = memcached.get("foo");

Or, if you want a simpler interface, you can do something like this:

string content = memcached.autoget("foo",
    function() { return "hello world!"; }

The autoget method will try to read from Memcached if the item exists, and will invoke the callback to get the value otherwise (and save it to Memcached for later use). Of course, for this to be practical the callback to generate the content should be more expensive than a return of a constant string.

I havent’t tested with multiple memcached servers yet, and I haven’t implemented several of the methods memcached supports. It’s a start, however, and I can already think of several potential uses for it.


MySQL is popular and extremely common, so I figured I should work on that next. Plus, if we ever want to have a snowball’s chance in hell of hosting a decent PHP compiler, we’re going to want easy and available bindings for MySQL. Now, after a little bit of hacking today, we have it.

Here’s what we can do in Parrot today:

var lib = loadlib("mysql_group");
var mysql = new 'MySQLDbContext';
mysql.connect("localhost", "username", "password", "database", 0, 0);
var result = mysql.query("DROP DATABASE foo;");
say(result, " rows effected");      // "1 rows affected", if you had one

result = mysql.query("SELECT * FROM bar");
say(typeof(result));                // "MySqlDataTable"
for (var row in result) {           // Iterate over all rows
    int idx = int(row);
    say("row " + string(idx));
    for (string column in row) {    // Iterate over all columns
        say(column + ": " + string(row[column]));

One thing I don’t handle quite yet is handling multiple result sets. So if you have a stored proc which returns multiple sets of data, you won’t get any but the first back into your program. I’ll try to get that implemented as quickly as I can.


We’re starting to use MongoDB at work, and I figured a great way to become more familiar with this piece of software was to write bindings for it for Parrot. Despite several unnecessary problems with linking to the Mongo C Driver libraries, I’ve managed to produce a few results.

Mongo uses a storage format called BSON (similar to JSON), and stores BSON documents as atomic units. ParrotStore implements a BsonDocument and a MongoDbContext PMC type. As of this morning, you can create a BSON document and insert it into the DB:

var lib = loadlib("mongodb_group");
var bsondoc = new 'BsonDocument';
bsondoc.append_string("first", "Andrew");
bsondoc.append_string("nick", "Whiteknight");

var mongo = new 'MongoDbContext';
mongo.connect("", 27017);
mongo.insert("local.foo", bsondoc);

The document is indeed written to the database, although I don’t have any methods yet to read it back out. The documentation for the C Driver for MongoDB is lacking, but I have the source code handy and it is pretty readable. I hope to have basic querying implemented by the end of the day.

Here are a few things I plan to add, either today or in the next few days:

  1. Support simple querys and commands
  2. Support introspecting and iterating over BSON documents
  3. Implement a JSON->BSON translator (I have most of this written already).

There are several other features that I need to implement, although many of them aren’t necessary to say I have a minimally functional set: support for replicated sets, support for atomic find/replace updates, support for cursors and bson iterators, etc. There’s a lot of work here, but I’m off to a pretty good start already.

Build System and Project Setup

ParrotStore contains a bunch of sub-projects which are really only related by theme. They’re all solutions for storing stuff, but they don’t really relate to each other besides that. So, the build system is set up to easily build these projects individually. At the terminal, if you have make, you can build them like this:

make memcached
make install_memcached
make mysql
make install_mysql
make mongodb
make install_mongodb
make            # attempts to build them all
make install    # attempts to build and install them all

This is great for if you don’t have the mysql or mongodb development packages installed but you want to get the memcached library (or any other combination).

Internally, the makefile calls a distutils-based setup.winxed program for building the various components, but you shouldn’t use setup.winxed directly.

Like Rosella, which is a prerequisite for this project, ParrotStore will be a collection of things not one big monolithic system. It will provide a Memcached interface in one standalone library, a MySQL interface in one, a MongoDB interface in one, and other interfaces separately too. Some of them (like Memcached) will be pure parrot. Other things like MongoDB will have C-level components too. Where Rosella has always promised to be pure Parrot, ParrotStore cannot and should not follow such a rule. Some things may turn out to be implementable with NCI, but that’s an experiment for later. Maybe, much later.

Also, expect a lot of synergy between Rosella and ParrotStore. ParrotStore will both use Rosella internally, provide many of the interfaces that other Rosella-based projects expect, and add several extensions to make Rosella features even more cool and powerful.

Future Projects

The goal of ParrotStore is simple persistance. In a sense it might become something like an ORM, or contain an ORM, mapping Parrot data to and from various persistance mechanisms. This project does not intend to do any embedding, whether Parrot embedded in a database or a database embedded in Parrot, or whatever else. The Database (or cache or whatever) is separate, and ParrotStore just provides a client interface to it. For instance, the PL/Parrot project embeds Parrot into the Postgres DB. ParrotStore would provide an external interface for querying it instead.

I do not yet have a runnable test suite. I’ve been doing ad hoc tests because this is all so new and experimental. I need to add a test suite.

I also want to add a custom caching mechanism for storing frozen PMCs to file and fetching them again. Multiple backends to a PMC mechanism would allow us to store PMCs to various persistance systems for later use. This is another thing that I’ve wanted for a while, but I haven’t quite nailed down a design yet.

I would like to add a client interface for Postgres. I suspect there are some people floating around who could help make that a reality.

I think this project will probably grow organically, adding new storage backends and cool interfaces for various purposes, and then adding some tools and utilities that use these things. As with all my projects, feedback, requests, suggestions, and questions about my basic compentency are always welcome.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-04-15 00:00:00

The GSOC 2012 proposal deadline has come and gone. We’ve received several project proposals, although only half a dozen are serious, honest, plausible proposals. We have a few days now to rank them, comment on them, and assign potential mentors to them. When we find out how many slots Google is able to assign to us, we’ll be able to pick out which ones will be worked on this summer.

Here’s a list of the decent-looking proposals we’ve received:

  • Jaesop Stage 1 Compiler and Runtime by mayank. mayank intends to fix up the last remaining bits of stage 0, then get started on a Javascript-compiler-in-Javascript stage 1. By the end of the summer I think it’s very plausible that he could have a self-hosting compiler for most of the JavaScript language and at least a start on the basic runtime. If he keeps the abstraction boundaries nice and clean, after the summer is over we should be well primed to start upgrading bits of the internals to use 6model and PACT, when those two projects are ready to be used there.
  • LAPACK Bindings for Parrot-Linear-Algebra by jashwanth. This is something I’ve wanted to add to PLA since the beginning of that project, and an absolute necessity if I ever want to get back to my dream of writing an M language compiler for Parrot. jashwanth has proposed assing LAPACK bindings to PLA (via NCI) and implementing a nice interface for some of the most important transformations, decompositions and operations provided by that library. He also intends to provide a few pure-parrot backup implementations for cases when LAPACK isn’t available but we still need to get work done. It’s an open-ended project that can be done in small, discrete chunks.
  • 6model Integration by benabik. We know we want 6model, and we know benabik has the chops to pull it off. He is still working on his thesis AND is expecting a baby this summer, but somehow I still don’t feel like it’s an undoable project. His proposal is to integrate 6model into Parrot’s core and start transitioning our existing PMCs to use 6model instead (and abandon most of our current object model).
  • PACT Assembly by benabik. Yes, benabik has submitted two proposals. This one is the start of PACT; something that, like 6model, we know we want. benabik, considering PACT was his brain child and he’s getting his PhD in compiler-related topics, is uniquely qualified to pull this one off and make it shine. The real question is which one of his two proposals we as a community want him to work on more (I’m already personally signed up to do whichever one he doesn’t pick, so it shouldn’t be a loss either way). PACT, as I’ve talked about before, is intended to be a large modular library of compiler tools and building blocks, so there is ample room to expand the project if things are going unusually well.
  • Security Sandbox by Justin. Security sandboxing is something that we’ve wanted, to varying degrees, for years. Justin has proposed to at least get us started with proper security and implement as many permissions and restrictions as he has time for. It’s a project that we can consider to be a “success” if even half of what gets proposed actually gets completed, and there is plenty of room to build on if his momentum gets up.
  • Mod_Parrot 2.0 by brrt. ModParrot, the Parrot module for the Apache webserver hasn’t been actively maintained in some time, and has fallen into disrepair following many of the internals changes to Parrot in the past few years. brrt has proposed an update to ModParrot to use the new and more stable embedding API. This is another modular project that can grow if his development speed stays high to include all sorts of helper libraries, driver programs, plugins for HLLs (Rakudo in particular) and other things. Most valuable at all may be his plans for implementing an automated test suite, which will help ensure ModParrot never falls by the wayside again.

So we’ve gotten 6 decent proposals from 5 students, and if even half of these go on to succeed in reaching their goals Parrot will be much better off by the end of the summer. And this list doesn’t even include the calling-conventions work that bacek is working on, or the threading work that nine is working on, The M0 work that several other developers are doing, or the packfile and IMCC and whatever else work that I’m planning for myself. This could be a very eventful summer indeed.

If you’re signed up to be a mentor this summer, or if you would like to be, please head over to the GSOC website, sign up, and take a look at the proposals.

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-04-09 00:00:00

I’ve received a few emails from prospective GSOC students interested in doing Jaesop-related work this summer for GSOC. So, to make sure it was a platform that’s fit to be worked on, I fired it up and ran through the tests. Everything passed, which is pretty awesome, except there weren’t a whole heck of a lot of tests to begin with. Saying that 100% of tests covering about 10% of the code passed isn’t saying much with any kind of certainty.

Having a student work on Jaesop, if such a proposal is submitted and accepted, would be quite a boon for that project. In summers past we’ve had students working on compiler-related projects, usually starting with nothing or almost nothing. For instance last year Rohit was working on a JavaScript compiler starting from the ground up and didn’t have a huge amount of luck. Lucian was working on a Python compiler project, disregarding much of the older Pynie project work that had been done. He did better, but would have been able to achieve a lot more if he were starting from a stronger foundation (especially, an improved Python-ready object model). Asking another student to work on a new compiler project this year would not be a great thing for us to do, especially knowing that some of the fundamental issues (i.e. object model) are still not resolved at the lowest level.

Jaesop is slightly different because a student would be starting with a working foundation. It’s not perfect by any stretch, but it is something. It is a working piece of code with some of the complexities of the JavaScript object model and library logic already sorted out. To make it an even better platform for launching a summer project, I decided that a few more pieces needed to be added. I feel like it’s the difference between a student who can hit the ground running and one who has to crawl around a few big rocks first.

First thing, I added require() and exports. This way we can do the Common.js analog of loading bytecode files. Here’s a small example that I wrote out, called sys.js:

exports.puts = function(s) {

And here is how we would call that from our program:

var sys = require("sys");
sys.puts("Hello world!");

Doing this kind of stuff is the kind of unnecessary complexity that a student shouldn’t have to worry about. It’s tangential to any problems a student would be solving and so having the student waste time on infrastructure like this would just be taking time away from the actual project.

I’ve also improved logic relating to protypes. It’s not perfect or compliant by any standards, but it’s much much closer to the norm and likely gets us close enough to parsing and running the kinds of non-trivial programs that are going to form the basis of a compiler.

var sys = require("sys");
function Foo() {
    this.a = "foo a";
Foo.prototype.b = "foo_b";

function Bar() {
    this.a = "bar a";
Bar.prototype.b = "bar b";

var f = new Foo();

var b = new Bar();

If I didn’t tell you otherwise, you might think this was normal JavaScript code running on Node.js or some other real JavaScript compiler! This exact code example is running on my machine right now. Prior to my hacking this morning, the code above would either have thrown an error or have printed out the same thing twice because both Foo and Bar function objects would have shared a prototype.

Now that I’ve done this work I feel much better about a Jaesop-based GSOC project happening this summer. Now the responsibility lies with the students to submit acceptable proposals and get to work!

Andrew Whitworth | Whiteknight's Parrot Blog | 2012-04-01 00:00:00

M0 has been coming down the pipeline for several months. It's still pretty raw and has a number of known functionality holes, but it's getting better by the week. I'd like to make the next few stages of M0 part of our official roadmap, so this post spells out the overall plan and what I think we can accomplish in the next three months.

M0 currently exists as a fairly hacky Perl 5 prototype. This is of necessity because Perl isn't generally intended to operate at the level that M0 requires. Perl is still serviceable as a prototype implementation language, but the form that will be integrated into Parrot will be written in C. There will be many stages between now and when the M0 migration is complete, but the goal I'll focus on is noop integration. I'll explain what I mean by that below.

I see Parrot's migration to M0 falling into 7 stages:

M0 Prototype

We're working out bugs in the Perl 5 M0 interpreter and making certain that M0 will be a sufficient foundation for Parrot. M0 may change significantly but we're making an effort to stabilize it.

C89 Implementation

We're happy with M0 and have a reasonably efficient compiler-agnostic implementation of M0, written in C89, which passes all tests. Separate compiler-specific implementations are fine, but not a priority.

Noop Integration

C/M0 is linked into libparrot and exposes an interface that C code can use to call into M0 code. At this point no subsystems have been reimplemented in M0.


We specify and implement Mole, which will be a C-family langauge that compiles directly to M0. Writing M0 is painful (this was an explicit design goal), so Mole is what a large chunk of the M0 that implements Parrot will be written in. M0 bytecode is what will be run from Parrot, so other code generation possibilities exist.

Early Integration

We've started moving subsystems over to M0. The order of which systems hasn't been determined yet, but producing a complete list and making sure we're aware of the dependencies will prove important.

C/6model in Core

Having a solid implementation of 6model in core will eventually be a blocker. Implementing our current object semantics in M0, only to switch to 6model later isn't a wise use of our hackers' tuits.

Pervasive Integration

At this point, everyone can jump in. We have a couple major subsystems converted and have worked most of the kinks out of the process of translating C into M0. We'll be converting every subsystem that we can find to M0 and will have plenty of example code and documentation to lower the barrier to entry.

Complete Integration

Parrot has a fairly small core of C code consisting of little more than the M0 VM and the GC.

Committing to a timeline can be tricky. It's much more important to have an M0 that's thoroughly well thought-out than one that's usable by a certain date. That said, the M0 spec and prototype are coming along nicely. Completing the "Noop Integration" stage and possibly getting a solid Mole compiler by the 3.9 release are reasonable goals, depending on how many interested parties make themselves known. I'm happy to see that whiteknight has made C/6model one of his roadmap goals. C/6Model in Core is largely orthogonal to M0 except that it needs to be integrated and solid before we start translating Parrot's object-related C code into Mole.

cotto | reparrot | 2011-08-01 20:04:14

Note: this post is about implementing an M0 interpreter in Perl and is more a lightly edited braindump than a polished presentation of a concept.

Recently some test failures in M0's test suite revealed that the prototype Perl interpreter had been sneaking some of its perl-nature into the implementation.  The M0 assembler had been storing all values as strings and the interpreter had been secretly using its perlishness to convert the number-like values into ints at runtime.  This doesn't work well for an M0 implementation because M0 needs to be very specific about the low-level behavior of an implementation and the way it treats registers.

Perl is not C, and the basic problem I'm running into is that Perl is not designed to operate at the low level that M0 (as it currently stands) requires.  M0 is all about bytes and assigning meaning to the value in a register by using a certain classes of ops on it.  Perl is much higher-level and doesn't even have a particularly strong distinction between strings and integer values.  If I want Perl to have string byte-oriented C-like semantics, it means that I'll be widely (ab)using the bytes pragma and pack/unpack.  This is doable, but it's also torturing Perl into implementing something even further from its intended use case than the current (and subtly-incorrect) M0 implementation already is.  sorear rightly freaked out when he looked at the M0 interp code, because it's doing something that Perl wasn't intended to do and something that Perl isn't particularly well-suited to.

Still, javascript has been used to emulate at least x86, 6502, Z80 and 5A22 and  with surprisingly reasonable performance.  Arguably that's also pretty far from javascript's intended use case, and still it works.  This many just be an issue of finding the least hacky way to do something inherently very hacky.

The alternative is to specify M0 to have flexible underlying semantics, but I don't know that it'd be either practical or advisable to go too far down this road.  It's worth giving some thought to making the M0 spec be minimally unnatural to implement in a high-level language, but M0 is by its nature a low-level beast.  Implementations are bound to reflect that to some

In the end, the best way forward will probably be to plow through the craziness of implementing a simplified CPU in Perl and look forward to building on chromatic's C implementation, where the intent of the implementation language is much closer to the aim of the project.

cotto | reparrot | 2011-07-20 00:06:08

Really TLDR: The Parrot has landed.

It brings me great joy to announce that I have completed all milestones for my TPF grant regarding the Parrot Embed/Extend subsystems! Not only that, but all of my grant work was included in the most recent release of Parrot, 3.5.0 "Menelaus".

The actual TLDR of this update is "many tests were written, code coverage is above 95% for all systems described in the grant, docs were improved, many Parrot Trac tickets were created and many a blarg toast was cooked.

For those of you that have a thirst for knowledge unquenched (I know who you are), you are welcome to pondiferously peruse the Impending Technical Details.

The Deets

The last portion of this grant definitiely challenged me to think in new ways about testing and I am now only beginning to reap the benefits. I was charged with adding code coverage a few rarely-if-ever-used C functions in Parrot's embed/exted subsystem, which allows you embed Parrot into other applications and other funky stuff.

Whiteknight++ greatly helped me write a test for Parrot_sub_new_from_c_func which takes a C function and a string that describes the function signature of the C function and returns a NCI PMC, which can be invoked.

I also learned many lessons about code coverage during the final stage of this grant, even though I thought I was at such a level of expertness that it would be hard to learn drastically new and important perspectives on testing. This pattern of thinking is always wrong.

Lesson 1

Sometimes you are the underdog and you have to interpret the rules in a new way in order to have a chance at winning. You need to be Ender Wiggins from Ender's Game: continually inventing new tactics to keep a winning edge over the competition.

I noticed that a large portion (about 80%) of the uncovered code in one file was a macro that was copy-and-pasted into two places. I refactored this into a single macro called POP_CONTEXT, which reduced the total number of lines in the file by roughly 10, while simultaneously decreased the number of uncoverd lines in the file by ~20 lines, which had a combined effect of pushing the code coverage over the necessary 95% mark.

This change definitely increases the maintainability and modularity of the code, but it feels a bit like gaming the system. Nonetheless, it saved the day.

Lesson 2

The simplest useful test that you are avoiding is the most valuable next test to write, because it has the best ROI (Return on Investment, where investment is the time it takes to write the test, and the return is having an automated way of verifying that the feature works.

Lesson 3

Software developers are very optimistic about time estimates. We forget about all the possible things that could go wrong and often quote estimates on something approaching "base case scenario". As a rule of thumb, I think all software developers should think long and hard about a time estimate for a given project, write down the estimate, then multiply that time estimate by pi for a REAL estimate.

I theorize that pi is the factor of time it takes to debug and write tests for behavior of hard-to-recreate edge cases.

I originally thought my grant would take about 3 months, but it ended up taking about 9 or ten. QED.

Finally, I would like to thank my grant manager Makoto Nozaki for providing lots of feedback, support and encouragement, as well as everyone else at the The Perl Foundation for funding this grant.

Jonathan Leto | Jonathan Leto | 2011-07-11 02:14:24

Welcome to the first edition of PWN.  At YAPC::NA, long-time developer chromatic expressed frustration at the fact that Parrot as a community hasn't been effective in communicating the knowledge of its members.  IRC, while great for immediate communication, doesn't lend itself to transparency for those who don't have time to hang out on #parrot 24/7 or to follow our irc logs.  My hope for this newsletter is to make Parrot's development more transparent, even for those with only have an hour or two per week to keep up with Parrot.  I also hope that this will serve as a common channel of communication for all Parrot developers in order to provide a basic understanding of what's been happening in Parrot and what's needed.


The past week contained YAPC::NA, a grassroots Perl conference organized by the Perl community for the Perl community.  There were three Parrot-related talks given by kid51, dukeleto and me, and one Perl 6 talk given by colomon.  There was also a well-attended Parrot/Perl6 BoF session on Tuesday and a hackathon on Thursday.  The hackathon was largely focused on coding and didn't generate significant directed discussion.

kid51's 10 Questions

kid51 had a short talk in which he raised a number of important questions about OSS projects in general.  He then proceeded to apply those questions to Parrot, with less than stellar results.  He had some of good points, particularly that Parrot needs to become production-ready before it can be considered a true success, that Parrot needs to have a better-defined purpose and focus, and that the project needs to "get to the point".  Asking tough questions isn't usually fun, but kid51 did Parrot a great service by honestly and directly pointing out some of the flaws of our community.  I hope his feedback will lead to positive changes in the way we look at ourselves and the products we're producing.

kid51's slides and a recording of his talk are here.

dukeleto's Visual Introduction to Parrot

dukeleto presented an introduction to the world of Parrot.  His intent was to give Parrot newbies a high-level overview of Parrot, its community and its ecosystem.  It was lighter in content due to being targeted toward less experienced audiences.  Nevertheless, it was an entertaining talk for people who already knew Parrot and provided a novel metaphor for understanding VTABLEs.  Once we're based on 6model, I look forward to seeing what kind of metaphor he comes up with.

dukeleto's slides are here.

cotto's State of Parrot

I presented a talk on the state of Parrot just after dukeleto's talk.  I covered developments in Parrot over the past year, some of the issues we need to deal with and what we expect the future to hold.  The short version is that there are a number of problems that are keeping Parrot from realizing its potential, but I think we have it within ourselves to overcome them and to produce an exciting production-ready virtual machine with some novel and useful properties.

My slides are here.

colomon's Numerics in Perl 6

colomon gave a worthwhile talk about performing numerical calculations in Perl6, both in Rakudo and Niecza (pronounced "niecha").  The talk was a good display of how people are using code that's built on top of Parrot and Rakudo.  As with all beta software, there were places where colomon ran into holes in the implementations of both Niecza and Rakudo, but the talk was hopeful and make me proud to be a Parrot hacker.

His slides are here.

Parrot/Perl6 BoF

The Perl6 and Parrot BoF session was considerably more organization-focused than most attendees were expecting.  Although the majority of attendees were from Parrot, Perl 6 (Larry Wall) and Rakudo (colomon) were also represented.  A primary point was that Parrot need to get better at communicating communal knowledge among its members and users.

Someone also suggested an intriguing way of reframing participation in Parrot.  Many of us developers work to scratch our own itches, but question "What would you be doing if the Parrot Foundation were paying you a salary?" provided a new way to look at how we manage Parrot and spawned a couple threads on parrot-dev.  For my part, this question provided the morivation for putting together this newsletter.  I hope it will also provide a motivation for all developers to take a more complete view of Parrot.

Room For Improvement

In this section of the newsletter, I will highlight areas of Parrot that are ripe for optimization.  Due to YAPC::NA this newsletter is already filling up quickly, so I'll highlight just one area.

config_lib.pir creates a hash that contains all data picked up by Configure.pl during configuration.  It has more than 250 entries, the majority of which don't provide any useful information.  Figuring out which entries in the hash are necessary and removing all the rest will help trim Parrot's startup time and make parrot_config a bit easier to sort through.  If you're interested in this, drop by #parrot or parrot-dev and chances are good that someone will be able to put you to work.

Other possible areas for optimzation are listed on the following pages on our wiki.


If you see an interesting conversation on either #parrot, parrot-dev or #perl6, please mark it by saying "PWN".  When preparing this newsletter, I'll search through irclog (moritz++) for any mentions of "PWN" and a summary of the conversation to the next edition of PWN.

cotto | reparrot | 2011-07-05 13:23:09

I met with fellow Parrot hackers allison++, cotto++ and chromatic++ recently in Portland, OR (it was jokingly called YAPC::OR on IRC) to talk about what we call M0. M0 stands for "magic level 0" and it is a refactoring of Parrot internals in a fundamental way.

cotto++ and I have been hacking on a detailed spec (over 35 pages now!) and a "final prototype" in Perl 5 in the last few weeks. M0 is as "magic level 0", which means it consists of the most basic building blocks of a virtual machine, which the rest of the VM can be built with. The term "magic" means high-level constructs and conveniences, such as objects, lexical variables, classes and their associated syntax sugar. M0 is not meant to be written by humans, except during bootstrapping. In the future, M0 will be probably be generated from Parrot Intermediate Representation (PIR), Not Quite Perl 6 (NQP) or other High Level Languages (HLLs).

The most important reason for M0 is to correct the fact that too much of Parrot internals are written in C. Parrot internals is constantly switching between code written in PIR, other HLL's such as NQP and C. Many types of optimizations go right out the window when you cross a language boundary. It is best for a virtual machine to minimize crossing language boundaries if an efficient JIT compiler is wanted, which we definitely desire. Since many hotpaths in Parrot internals cross between PIR and C, they can't be inlined or optimized as much as we would like.

A few years back, Parrot had a JIT compiler, from which many lessons were learned. I am sure some people were frustrated when we removed it in 1.7.0 but sometimes, it is best to start from a clean slate with many more lessons learned under your belt. Our old JIT did support multiple architectures but required maintaining a "JIT version" of every opcode on each architecture supported. Clearly, this method was not going to scale or be maintainable.

I will venture to say that M0 is the culmination of the lessons learned from our failed JIT. I should note that "failure" does not have a negative connotation in my mind. Indeed, only through failure are we truly learning. If you do something absolutely perfectly, you aren't learning.

We are at an exciting time in Parrot's history, in that for a long time, we wanted an elegant JIT, using all the latest spiffy techniques, but it was always an abstract idea, "just over there", but not enough to grab a-hold of. A new JIT that meets these goals absolutely requires something like M0, and is the driving force for its design. M0 will pave the way for an efficient JIT to be implemented on Parrot.

M0 currently consists of under 40 opcodes from which (we wager) all the rest of Parrot can be built upon. This is radically different from how Parrot currently works, where all of the deepest internals of Parrot are written in heavily macroized ANSI 89 C.

M0 has a source code, i.e. textual form and a bytecode form. chromatic++ brought up a good point at the beginning of the meeting about the bytecode file containing a cryptographic hash of the bytecode. This will allow one to distribute bytecode which can then be cryptographically verified by whoever eventually runs the bytecode. This is a very "fun" application of cryptography that I will be looking into further.

allison++ brought up some good questions about how merging bytecode files would be done. We hadn't really thought about that, so it lead to some fruitful conversation about how Parrot Bytecode (PBC) is currently merged, what it does wrong, and how M0 can do it less wronger.

We then talked about what exactly a "Continuation" in M0 means, and tried to clear up some definitions between what is actually meant by Context, State and Continuation.

chromatic++ also mentioned that an optional optimization for the garbage collector (GC) would be for it to create a memory pool solely to store Continuations, since they will be heavily used and many of them will be short-lived and reference each other, so having them in a small confined memory region will reduce cache misses. We are filing this under "good to know and we will do that when we get there."

Next we turned to concurrency, including how we would emulate the various concurrency models of the languages we want to support, such as Python's Global Interpreter Lock (GIL). We decided that M0 will totally ignorant of concurrency concepts, since it is a "magical" concept that will be implemented at a higher level. We have started to refer to the level above M0 as M1 and everything above M0 as M1+.

allison++ also mentioned that many innovations and optimizations are possible in storing isolated register sets for each Continuation (a.k.a call frame). This area of Parrot may yield some interesting surprises and perhaps some publishable findings.

We all agreed that M0 should be as ignorant about the GC as possible, but the GC will most likely learn about M0 as optimizations are implemented. The pluggability of our GC's were also talked about. allison++ raised the question "Are pluggable GC's easier to maintain/implement if they are only pluggable at compile-time?" Indeed, they probably are, but then we run into the issue that our current "make fulltest" runs our test suite under different GC's, which would require multiple compiles for a single test suite run. chromatic++ made a suggestion that we could instead make GC's pluggable at link-time (which would require a decent amount of reorganization) which would still allow developers to easily test different GC's without recompiling all of Parrot. chromatic++'s estimate is that removing runtime pluggability of GC's would result in an across the board speed improvement of 5%.

This conversation then turned toward the fact that M0 bytecode might depend on what GC was used when it was generated, i.e. the same M0 source code run under two different GC's would generate two different bytecode representations. This would happen if the M0 alloc() opcode assumes C calling conventions. This was generally deemed distasteful, so our alloc() opcode will not "bake in C assumptions", which is a good general principle, as well. This will be a fun test to write.

allison++ brought up the fact that we may need a way to tell the GC "this is allocated but uninitialized memory", a.k.a solve the "infant mortality" problem. chromatic++ suggested that we could add some kind of lifespan flag to our alloc opcode (which currently has an arbitrary/unused argument, since all M0 opcodes take 3 arguments for symmetry and performance reasons). This could be as simple as hints that a variable is local or global, or a more detailed delineation using bit flags.

It was also decided that we didn't need an invoke opcode and that invoke properly belongs as a VTABLE method on invokables.

We also talked about the fact that register machines greatly benefit from concentrating VM operations on either the caller or the callee side. Looking for more references about this. It seems that the callee side seems to be what we will try for, but I am not quite sure why.

We finally talked about calling conventions and decided that goto_chunk should roughly be equivalent to a jmp (assembly unconditional jump to address) and the invoke VTABLE would setup a return continuation (i.e. make a copy of the program counter), do a goto_chunk, and let the callee handle the rest, such as looking up a return continuation and invoking it.

After the main M0 meeting, cotto++, allison++ and I sat down at a coffee shop and came up with a list of next actions for M0:

  • Write a recursive version of 'calculate the n-th Fibonacci number' in M0
  • Write a simple checksum algorithm in M0 (suggestions?)
  • Create a working PMC in M0
  • M0 disassembler
  • Create a "glossary brochure for Github cruisers"
  • Implement function calls and returns
  • Make sure each M0 opcode is tested via Devel::Cover
  • Convert the M0 assembler to C
  • Convert the M0 interpreter to C
  • Link M0 into libparrot (no-op integration)

I have been talking to cotto++ on IRC while typing up these notes and we have come to the conclusion that a "bytecode verifier" should also be put on that list. A verifier is a utility that detects invalid bytecode and prevent attacks via malicious bytecode. This is something that happens at runtime, where as a bytecode checksum happens before runtime, or at the end of compile time. They provide different kinds of insurance. The bytecode checksum feature will be an instrinsic feature that is not optional, since it prevents Parrot from running known-bad bytecode. But a bytecode verifier adds a significant amount of overhead. This overhead is reasonable if you are running untrusted code, but it is unreasonable when your are running trusted bytecode (i.e. bytecode that you created), so the verifier will have an option to be turned off.

We obviously have a lot of fun stuff to work on, so if any of it sounds fun, come ask cotto++ or me (dukeleto) on #parrot on irc://irc.parrot.org for some M0 stuff to do. We especially need help with writing tests and documentation.

There will be a Parrot hackathon at YAPC::NA this year, where I am sure some M0-related hacking will be happening. If you have never been to a hackathon before, I highly recommend them as a way to join a project and/or community. Meatspace is still the best medium for some things :)

(UPDATE: Some factual errors about our old JIT were pointed out by rafl++ and corrected)

Jonathan Leto | Jonathan Leto | 2011-06-02 17:10:54

I am excited to announce that I have completed my next grant milestone! I recently increased test coverage of extend_vtable.c to over 95% ( 95.5% to be exact), achieving the milestone with a half percent buffer. It definitely wasn't easy, but I changed the way I was approaching writing tests and it resulted in a huge burst of productivity.

I went through a test coverage report and wrote down, on an actual piece of paper, every function that had no test coverage. This allowed me to circle the functions that I thought would be easiest to write tests for, and quickly got those out of the way. I then went for uncovered functions that were similar to already covered functions, and then finally I got to the hard functions.

This was a fruitful exercise, because it was decided by Parrot developers that some VTABLE functions escaped accidentally and that they should be removed from the public API. Whiteknight++ removed Parrot_PMC_destroy (extra points for humor), which I was using incorrectly in the extend_vtable tests and which was actually coredumping Parrot, but only on certain platforms. I then removed Parrot_PMC_mark and Parrot_PMC_invoke, the first being an implementation detail of the garbage collector, and Parrot_PMC_invoke because it was the only function that returned a '''Parrot_Opcode_t*''' and basically not fit for public consumption.

I also created a ticket (TT#2126) for a bug in the Parrot_PMC_morph function, which has some possibly buggy but definitely unspecified behavior.

The remaining, untested functions in extend_vtable are clone_pmc, cmp_pmc, get_pointer_keyed_int, get_pointer_keyed_str, remove_vtable_override, set_pointer_keyed and set_pointer_keyed_str. I leave the testing of these functions as an exercise to the interested reader :)

Grant Refactoring

This reminds me of a saying, I can't remember it exactly, but it is something about the best laid plans of camels and butterflies often taste like onions. Anyway, since I wrote my grant, the Parrot Embed API was deprecated and replaced with a shinier and better documented system. After talking with cotto++ and whiteknight++ on IRC, it was decided that working on test coverage for the new embed API was a better use of resources than writing tests for the old embed API that my original grant referred to, which will most likely be removed from Parrot soon.

The new embed API is called src/embed/api.c and the plan is to replace my grant milestone of 95% coverage of embed.c with 95% coverage of embed/api.c, which is currently at 72% coverage.

To summarize, I have two grant milestones left, increasing extend.c (currently at 61% ) and embed/api.c to 95% coverage.

Given the lessons learned from testing extend_vtable and based on the fact that I have already made some headway, my new estimate for these milestones is three weeks each. To make this more definite, I plan to be done with this grant work by July 15th.

This is the home stretch! I can feel it in my bones.

Jonathan Leto | Jonathan Leto | 2011-06-01 06:58:49

School's out for Summer!

I'm not just talking about Alice Cooper, it finally is!

At long last, I have finished the last of my schoolwork. Couchpotato kingdom, here I come! Oh wait...I have that totally kickass debugger to write.

This means that I can finally focus on my GSoC work. Things are going to start picking up around here and you're going to see a lot more activity.

One of the first things that I will be considering is breakpoints. Tomorrow, this is what I will be focusing on and will post about what I plan on doing and what I still need to figure out.

Kevin Polulak | soh-cah-toa | Google Summer of Code 2011 | 2011-05-16 00:00:00

A number of useful conclusions and targets came from the Q2 2011 Parrot Developers Summit that happened yesterday.  This post will contain a summary of the event and my take on what we'll be doing as a result.  Props go out to kid51 for organizing an agenda for the meeting and keeping us more-or-less in line.  Strict organization isn't vital for an irc meeting, but he did good job of making sure that our limited time was used effectively.

We started out reviewing the state of our previous roadmap goals.

The Deprecations-as-Data goal was substantially met.  I love this goal because it has potential to make life easier for our users (especially Rakudo) by expressly delineating what features are going to need upgrading.  A recent issue with nci and the 't' type demonstrates that we still have more room for improvement.  (pmichaud and whiteknight discussed a proposed solution after the meeting, but it needs a little experimentation first.)  My hope for data-based deprecations is that we end up with a better early warning system that alerts Parrot's users and gets discussions started before things break horribly.  pmichaud's concern was that that the web tends toward passivity and that what's needed is active notification of pending and actual removals.  I think this will be a boon.

whiteknight's IMCC Isolation goal is making excellent progress.  pmichaud commented that it's had no negative impact on Rakudo's development, which is impressive given its scope and invasiveness.  IMCC isn't yet an optional component, but it's quite possible to run libparrot without initializing IMCC at all.   Excising it completely is quickly becoming a possibility.  whiteknight has been doing a bang-up job and isn't showing any signs of slowing down.

The third goal is one that dukeleto and I have been working on, of getting M0 prototyped.  dukeleto's working on the assembler and I've got the interpreter, both being written in Perl 5 with the binary M0 format (".m0b") being the only interaction between them.  The punchline is that the interpreter is fully-implemented with stubs for all ops and the assembler is a couple weeks from being usable, depending on duke's tuits.  On the one hand I'm a little disappointed that we don't have a fully usable prototype, but it is what it is.  Even once both prototypes are "complete", there are several questions we need to get together with allison and/or chromatic to answer.  Our M0 plan is to get the prototypes as complete as we know how and to have another meeting where we get all our questions answers, possibly even hacking the last few needed bits into the prototypes as we meet.

Once we moved away from the retrospective, pmichaud quickly asked what Parrot's plans were concerning Rakudo.  He specifically asked if Rakudo should consider itself officially blessed in developing against master rather than a release (we said "yes"), and if we planned to use Rakudo for regular benchmarking.  This second concern is especially important because Rakudo has seen some significant performance regressions in the last couple months, in spite of the introduction of the new generational mark & sweep GC.  The expectation is that regular performance testing would have brought this to light sooner and that once it's in place, we'll be more conscious of how our changes affect Rakudo's performance.  We've had a distinct lack of benchmarking in the last few months.  I hope this is the first of many attempts to revitalize our efforts to improve performance.

On the same note, Codespeed (which runs speed.pypy.org) was mentioned as a possibility.  I remember mentioning this in the past without effect, but hopefully the time was right at PDS.  We didn't formally ask for someone to investigate it though.  I hope it doesn't get dropped on the floor again.

The next PDS was scheduled for July 30th or 31st, which seems comfortably far away from any known conferences.  whiteknight volunteered to set up a Doodle, which is proving to be a very handy tool for scheduling these things.

The next topic to come up with profiling.  While working on Rakudo, pmichaud hacked out very quick and dirty sub-level profiler that immediately pointed out an important hotspot.  This indicated to me that we need to up the game of the profiling tools that we provide as part of Parrot.  whiteknight and I were on the same page, so one of our new roadmap goals is to dig into the current profiling runcore, find out what's keeping it from being useful and fix it.  It currently depends on IMCC to get its information about the currently running code, so there's potential for much yak-shaving.  On paper the goal is only to investigate.  I hope we can get much more done.  I love providing useful tools to people, so I'm glad to have a chance to redeem the profiling runcore.  Unfortunately having whiteknight work on profiling will mean that he won't be spending as much time figuring out how to apply 6model to Parrot, but that's what it means to have priorities.

A third concern was raised by pmichaud, who said that it's difficult to gauge what Parrot's leadership thinks about certain issues.  One of the triggers in this case was my rather foolish removal of the intiailization of Parrot's PRNG (pseudo-random number generation) using the system clock.  At the time Peter Lobsinger made the reasonable-sounding argument that there's no single way to correctly do PRNG that will satisfy the needs of every possible use case.  After too little thought, I decided to interpret that as meaning that it didn't matter that I'd changed Parrot's PRNG behavior because Rakudo should be doing what makes sense for them.  This ended up being a bad idea that caused some pain for Rakudo, and while I eventually reinstated PRNG intialization from the system clock and later from the system entropy pool, it showed the need for a better-delineated interface to gather option from Parrot's developers as a whole.  To this end, whiteknight and I will serve as a sort of ombudsmen for when technical decisions end up harming users and need to be appealed.  I don't think we'll need to put on our ombusdmen hats often, but we'll be glad to have them when we do.

Breaks in compatibility are inevitable, but what whiteknight and I hope to achieve as ombudsmen is to make sure that users have a respectful ear and will get fair consideration for their problems.  A disconnect between the needs of our users and our goals is very unhealthy and can only harm both parties.

Overall, it felt like a very productive and well-organized discussion.  pmichaud did a great job of representing Rakudo's concerns and I think that the coming months will see several improvements in Parrot's process and tools to make it a better plaform for Rakudo to build on.

cotto | reparrot | 2011-05-15 19:29:00

Which Will It Be? Winxed or NQP?

As most of you know, my latest dilemma has been deciding on which language to use for the debugger: nqp-rx or Winxed. I'm sure most of you are thinking, "Really? We're still on this?" Sadly, yes. The end of the semester is this Friday which has been making it quite hard to dedicate time to this. I spent this past weekend setting up this blog so that I could still stay connected with my mentor and the rest of the community even on those days where the homework just never seems to end. I can't believe that I was actually considering taking a summer class during GSoC. Clearly, that would be a bad, bad idea.

Anyway, I digress.

I'll try to keep things brief here. You can read my full message to parrot-dev here.

Winxed has the advantage of being clean, fast, and stable. However, it is not included with Parrot by default and I want to keep dependencies to a minimum. Dukeleto made the suggestion that I include the PIR-generated code with Parrot, thus eliminating any dependencies. This is something to consider. I'm also quite sure how I would be able to integrate parrot-instrument if I use Winxed.

Conversely, nqp-rx has the advantage of regular expressions, inline PIR, and having a Perl-like syntax. However, it's quite slow, still a little quirky, and poorly documented. Coming from a Perl background, the Perl-like syntax is definitely a plus. nqp-rx was designed for Rakudo Perl 6 development and as such, one of it's greatest features is grammars. However, I'm building a debugger; not a language parser. So after taking away one of it's most powerful features, I must consider, "is there anything left worthy of excitement?"

This has taken much longer than I would have liked. As such, I'd like to reach a decision within the next few days. Definitely before the Parrot Developer Summit this weekend.

In the next few days, I plan on doing some nqp-rx and Winxed hacking to get a feel for each language. I'm going to be manipulating bytecode quite a bit with this project so that's going to be one of the things I'd like to do with both languages. You can be sure to hear about my findings throughout this week.

Kevin Polulak | soh-cah-toa | Google Summer of Code 2011 | 2011-05-09 00:00:00


Couple of weeks ago I've put Parrot's jit prototype on hold. One of the major issue was C macro preprocessor. Now, it's time to unhold it
Read more »

bacek | Bacek's blog | 2011-05-01 20:36:37

dukeleto and I shared a hotel room at LinuxFestNorthwest and had a great opportunity to talk about M0 after our respective talks.  We went over the state of the spec and what the best forward might be.  We also tried to look at what the future M0-based Parrot workflow will look like and how we can get there, though we got distracted before the crystal ball was delivered.

First, dukeleto mentioned that M0 is less discoverable than it needs to be, especially for a project that we expect to become Parrot's new foundation.  He suggested that we write a document that someone can read to get a clear 10,000 foot view of M0 and how its pieces fit together, a glossy brochure of sorts.  This could be either an introductory section in the M0 spec or a separate document.  The important thing is to have something we can point people at so that dukeleto and I aren't the only ones who can readily articulate what M0 is and where M0 is headed.

We also made some updates to the spec to make getting values from the variables table less confusing.  This is fairly minor in the scheme of things, but so is Perl's "say".

Last of all, we hammered out a plan for how get a working M0 prototype assembler and interpreter.

atrodo has been very valuable in providing his prototype Lorito implementation, both in his documentation and in the way he's had to bring assumptions to the surface to get a runnable interpreter.  His implementation differs from the spec in a number of ways (many of which are because it predates the spec), but it's been helpful in those places because it shows us what we want by counterexample.  The next (brief) stage was a set of prototype PIR dynops of M0 I hacked together.  This was great to get some runnable code that was close to the spec, but it very quickly ran into the impedance mismatch between the high level of PIR and the low level of M0.  The effort on the m0 prototype dynops wasn't wasted, but they've reached the limit of their usefulness.

The next step we've decided to take is to implement a separate prototype M0 assembler and interpreter.  dukeleto is be working on the assembler and I'll do the interpreter, both based on the M0 spec in the m0-spec branch on GitHub.  The only interface between the two will be M0's binary representation, so we can easily change one without needing to modify the other.  We're trying to converge on the structure of both the interpreter and assembler, but we expect this to the last prototype rather than a final implementation.   We'll also be writing tests against both the interpreter and assembler which we can later use against any future implementations.

dukeleto has started hacking in the m0-prototype branch in src/m0 and managed to get some very basic tests passing before he went to sleep.  We'll both be using Perl 5.10 as an expedient, since we don't expect these projects to serve as more than prototypes.  As a temporary measure one of us will need to hand-generate a couple simple bytecode files to verify that the assembler is working correctly.  These files will live in t/m0 in the branch.  The test code will be a minimal hello world program and a slightly more complex multi-chunk M0 program to help iron out inter-chunk interaction.  We haven't decided on what the complex example will be yet.  This is a part of the spec we'll need to work on as we come to understand what implementation makes the most sense.

Overall, rooming together at LinuxFestNorthwest has been very helpful in moving M0 forward.  Both of us have used the opportunity to bounce ideas off each other and to get the M0 train out of the station.  We're still a couple stages (and probably one more face-to-face meeting with allison and/or chromatic) away from a final implementation, but we can see the light at the end of the igloo, and it's looking pretty good.

There are a couple things that still need to get done.  In the interest of trying to keep them from getting dropped on the floor, they are:

  • Map out what a future m0 workflow will look like, what we need to do now to make it possible.
  • Make M0's roadmap and status more discoverable by making a glossy brochure that will communicate the idea effectively to someone who hasn't heard of M0 before.

cotto | reparrot | 2011-05-01 10:44:08

I am still on the path of increasing test coverage in src/extend_vtable.c. It is much like a zen study, where you methodically concentrate on many tiny little pebbles, one at a time, moving them in the sand, to just the right place. According to the latest code coverage statistics, we are now at 72% code coverage, which is an increase of about 8% since my last report.

Many, many more tests involving Key PMCs were added. For an intro to what they are, take a look at my previous grant update. Many of the tests are clusters of related tests, because most VTABLEs have many similar forms which take integer, string or PMC-flavored keys. I ran into some platform-specific bugs which only manifest on Darwin machines, which were reported by Jim Keenan in TT# 2098 and which I then fixed by querying with a non-empty Key, which is more prudent.

I also ran into some actual bugs which I reported as Trac Tickets. First is that the cmp_pmc VTABLE does not seem to be working correctly from extend_vtable, which was reported in TT #2103. Then I fell into a "hole" in the VTABLE API, where ResizablePMCArray does not respond to defined_keyed(), which it should. This is described in TT #2094.

In retrospect, this was one of the most productive periods of my grant work. I estimate that I will be very close to the 95% milestone by my next grant update at this pace, which is very exciting.

Jonathan Leto | Jonathan Leto | 2011-04-28 05:51:04


“Crazy JIT prototype” is on hold. I found 2 big problems with current approach. Which will require a quite big effort from me (or anyone else) to “fix”.

  • Parsing of larger subset of C. Mostly of struct definitions and preprocessor.
  • Type analyzes.
Read more »

bacek | Bacek's blog | 2011-04-18 22:15:15

This wayward son is still on his treacherous journey to increase test coverage in src/extend_vtable.c. When we last left off our traveler, he explained what the mythical VTABLE beast looked like, and we shall continue with the study of this chimerical fauna.

According to the latest code coverage statistics, we are now at 64% code coverage, which is an increase of about 10% since my last report. Most of this grant work concentrated on vtables that required Key PMCs. A Key PMC is an object that can be used to look something up in a Hash PMC or other aggregated object that supports "keyed access". It is very much similar to a "hash key" that can be used to look up the appropriate value.

One of the lessons that I have learned in working on these tests is that it is very easy to write tests that pass on gcc, but which absolutely explode with g++. This has to do with gcc not being as strict when some questionable type casting is done. I have learned my lesson and I promise not to break the test suite anymore. I will use g++ in my testing from now on, promise!

My productivity was definitely hampered by moving to a new house and having a two week business trip in the last month, but my new home office is finally set up, so I expect productivity to approach previous levels of adding a few dozen tests per week.

Jonathan Leto | Jonathan Leto | 2011-04-06 05:49:34


I'm little bit tired and doesn't have energy to write lengthly post about “Crazy JIT prototype”. And I didn't make a lot of progress since last post. But still there is some good news and some roadblocks.

Read more »

bacek | Bacek's blog | 2011-04-05 22:53:08

Hi there

Since last post about “Crazy JIT Prototype” and progress of opsc_llvm branch I moved little bit further. Two major achievements:
  • Skeleton for generating JITted Subs is done.
  • JITting of simple ops with function calls and constants
  • Emulator of Parrot's runcore from within Parrot it self.
Read more »

bacek | Bacek's blog | 2011-03-28 21:24:32

Current results of few days of reading LLVM docs/tutorials and one day of hacking.

Basically we can create something like:

int foo() {
printf("Hello World\n");
return 42;

In run-time. With LLVM. JIT in parrot is getting closer :)

Read more »

bacek | Bacek's blog | 2011-03-22 23:50:43

Currently Parrot VM consists of about 1000 ops. Each op is smallest operation available. For example add $I0, $I1, $I2 and goto label. Ops are implemented in some kind of C with quite few macro-substitutions for accessing registers and flow control. Let's take simple add op.

Read more »

bacek | Bacek's blog | 2011-03-22 23:50:27

Hi there.

In previous post I briefly described Parrot's ops. For past 10 years we used to "parse" ops as just chunks of text almost without any semantics behind. And this approach used to work for this 10 years. But life changed and now we need more than this.

Read more »

bacek | Bacek's blog | 2011-03-22 23:50:15