I first learned to write Perl in something like 1997, and in 2001 or so became a CPAN author with my first release, Class::Tangram. Much of my programming career has been with Perl, though I also end up in C a lot of the time and of course SQL. This section of my blog will be filled only with posts purely of interest to Perl programmers and not my general rants, git-related posts etc.

RSS Atom Add a new post titled:

SamV back-catalog now available

Just realised that a bunch of my projects are not online any more. Well, now they are back up via git dumb http at http://git.utsl.gen.nz/ - gitweb to follow.

Posted Wednesday evening, June 15th, 2011

making Perl command-line scripts faster with pperl

So, you have a script which is slow, perhaps because you are using a whole collection of modern perl features, which aren't necessarily terribly fast yet. You can't wait for the runtime to implement the features natively and hence run quickly, but there is another solution.

For instance, the XML::SRS distribution on CPAN makes use of some fairly advanced features of Moose, such as meta-attribute meta-roles. These are a win from a coding and maintenance point of view, as they allow a single attribute declaration to give you a Perl class which has XML marshalling as well as type constraints. However it does have a high startup penalty.

How high? Let's try by, say, taking a script which takes a JSON document on input and passes that to a Moose constructor, then outputting the XML.

#!/usr/bin/perl
use XML::SRS;
use JSON::XS;
my $json = join "", <>;
print XML::SRS::Domain::Create->new(
    %{decode_json($json)}
)->to_xml(1);

Fairly simple, right? Now, let's pass into that the data structure from the SYNOPSIS on the man page to JSON, and see how quickly it runs:

$ json=\
'{"domain_name":"kaihoro.co.nz","contact_registrant":{"email":\
"kaihoro.takeaways@gmail.com","name":"Lord Crumb","address":{\
"city":"Kaihoro","cc":"NZ","region":"Nelson","address1":\
"57 Mount Pleasant St","address2":"Burbia"},"phone":{"subscriber":\
"499 2267","ndc":"4","cc":"64"}},"delegate":1,"nameservers":[\
"ns1.registrar.net.nz","ns2.registrar.net.nz"],"action_id":\
"kaihoro.co.nz-create-1298944261","term":12}'
$ echo $json | time ./test.pl
<?xml version="1.0" encoding="ISO-8859-1"?>
<DomainCreate Delegate="1" ActionId="kaihoro.co.nz-create-1298944261" DomainName="kaihoro.co.nz" Term="12">
  <RegistrantContact Name="Lord Crumb" Email="kaihoro.takeaways@gmail.com">
    <PostalAddress Address2="Burbia" Address1="57 Mount Pleasant St" Province="Nelson" City="Kaihoro" CountryCode="NZ"/>
    <Phone LocalNumber="499 2267" AreaCode="4" CountryCode="64"/>
  </RegistrantContact>
  <NameServers>
    <Server FQDN="ns1.registrar.net.nz"/>
    <Server FQDN="ns2.registrar.net.nz"/>
  </NameServers>
</DomainCreate>
1.14user 0.03system 0:01.19elapsed 98%CPU (0avgtext+0avgdata 113152maxresident)k
0inputs+0outputs (0major+7174minor)pagefaults 0swaps

Ok, so the script ran in 1.14s that time. Not exactly a speed demon!

But if we change one line in the script:

#!/usr/bin/perl

to:

#!/usr/bin/pperl

Then we get a much different time the second time that the script is run:

$ echo $json | time ./test.pl
<?xml version="1.0" encoding="ISO-8859-1"?>
<DomainCreate Delegate="1" ActionId="kaihoro.co.nz-create-1298944261" DomainName="kaihoro.co.nz" Term="12">
  <RegistrantContact Name="Lord Crumb" Email="kaihoro.takeaways@gmail.com">
    <PostalAddress Address2="Burbia" Address1="57 Mount Pleasant St" Province="Nelson" City="Kaihoro" CountryCode="NZ"/>
    <Phone LocalNumber="499 2267" AreaCode="4" CountryCode="64"/>
  </RegistrantContact>
  <NameServers>
    <Server FQDN="ns1.registrar.net.nz"/>
    <Server FQDN="ns2.registrar.net.nz"/>
  </NameServers>
</DomainCreate>
0.00user 0.00system 0:00.17elapsed 2%CPU (0avgtext+0avgdata 4704maxresident)k
0inputs+0outputs (0major+376minor)pagefaults 0swaps
$

Great! Down to 170ms! That's much more like an acceptable start-up time :-). Knowing the code base I happen to know that there is a lot of lazy evaluation which is responsible for a lot of that 170ms, so this could probably be improved upon. But a >80% total improvement is a pretty big win for adding a single character to the script.

That's one of the reasons I like Perl. It might suck for a number of reasons, but hey most languages suck for some reason or the other, and at least with Perl there's already a bunch of solutions available, either on CPAN or (in this case) Debian/Ubuntu.

Posted in the wee hours of Sunday night, March 1st, 2010 Tags:

I have a dream! It's a dream about an editor...

To me, the perfect editor;

  1. would run under Parrot.
  2. would be a very simple reimplementation of emacs, in PIR - supporting: files, buffers, windows, frames, major and minor modes, and keyboard <-> function mapping
  3. would be extensible in any parrot-supported language
  4. would support syntax highlighting, by attaching highlighting hints to a TGE grammar, effectively allowing you to write a parsing grammar at the same time as a highlighting mode
  5. would have a keymapping that is identical to VI. Of course being extensible it is likely that people could contrib emacs-like keybindings (sick people - even I don't use those in emacs)

(in response to The Quest for the Perfect Editor)

Posted mid-morning Sunday, September 14th, 2008 Tags:

My new Perforce Importer is Unstoppable!!'

(with apologies to mnftiu)

Ok, so I've been working on this program to convert the Perl history from Perforce, with the view to tack it on top of the previous conversion I worked on.

'gitk' looking at some of the converted perforce history

Today, I had my first successful run that includes converting the integration data. It's a huge milestone for this project - which I initially agreed with Nicholas Clark to undertake way way back in August 2006 over a pint on the banks of the river Thames. I was on my way back from my Catalyst sponsored visit to Birmingham for YAPC::Europe. I certainly didn't think I'd be still working on it a good 16 months later.

This latest conversion is based on reverse engineering Perforce 's back-end. In addition to its binary data files, it keeps a nice, simple journal file and periodically checkpoints all the data in the database, and uses RCS files for holding actual file data. This makes importing a "simple" matter of loading its checkpoint/journal files into a database, and writing a few queries. Getting all the file images out was easy enough - and I used the git fast-import interface, and within a few days of working on it came up with a version that would export all of the files for the 30k-odd revisions in the Perl repository (over 225k distinct images in about 20k RCS files) in about 5 minutes on my laptop.

Perforce has supported branching and merging for over a decade

Perforce has a lot of model similarities with Subversion - principally, that it tracks branches in the same namespace used by the filesystem of your project (this "branching is just copying" snake-oil), so that finding the list of branches is not a simple operation. In fact, it may require complicated history analysis. Another similarity is that they both store file history, rather than say tree history like git or a teetering pile of patches like darcs. I'd call Perforce's engineering quality a lot higher than Subversion, though - the design is quite elegant, it just lacks ... well, the underlying simplicity of git. To compare, git has 5 important object types, if you include references, with very few fields. Perforce has 38 tables, and though only a handful of those are really required (my script loads 7 of them), the overall complexity of the schema is far higher.

As a result, managing branched development has been an art known well only to an elite minority of users - which I think do include Perforce users. It's quite obvious looking through this history that the integration facilities have been there since the repository was imported, and the pumpkings knew how to use them.

The challenge

The principle task at hand was to convert the literally hundreds of integration records held in Perforce, and try to convert it into git native format while not throwing any information away that might be important.

In many instances, the conversion is quite clean

The logic was this - if you see a change in the history, such as "integrate changes from mainline", and there are a bunch of integration records that seem to indicate that all of the files on the branch you are merging from that have changed since you last merged, have an integration record for them, then you mark that commit as a merge. If this single piece of information agrees with hundreds of per-file records, then you have just simplified the repository, without losing any information. I thought I had this early on with this monster query, but I ran into several problems - the answer to the question "which files are outstanding for merging?" is quite difficult to answer. The first answer I came up with, to just do an index diff between the two trees, only worked for the first integration. I needed to write a merge base function capable of inspecting the history graph. It comes up with a list of changed files since the merge base, and uses that to decide what integration records should be present.

The good news is that it appeared to work very well - a huge portion of the integration information in the history corresponds to complete cross-merges between branches.

The impedence mismatch

One of the biggest concerns is that the so-called "impedence mismatch", that is, the presence of important information that simply cannot be represented by git's model.

Integration FROM an unstable branch

In particular, one thing that you can't do with just tree tracking is to detect cherry picking, where those cherry picks changed the files along the way. If the cherry picking happened without changes this is easily detectable without any metadata - but if it did change, you either need some kind of fuzzy logic, or you need to record out-of-band what happened.

Git and Perforce couldn't be more chalk and cheese about this. Perl's development model is one that a colleague dubbed the ghetto merge model, where you have some kind of "hood" where the "flyist features go to battle it out". The features left standing are subsequently moved into trailer parks (aka integration branches), where they can make a new life for themselves and prove their stability.

Perforce seemed to be happy with this merge. But why did the integration records not mention some mergable changes?

Git does support that model - for instance, Junio Hamano's proposed updates branch of git could be considered the "unstable" branch, and changes can even be completely dropped from that one. There is a branch called "next" for features which are considered going into the "next" minor release, from which changes are not removed. There is a "maint" branch which bugfixes for existing versions go onto, etc. The key difference here is that changes go in earliest-first, rather than newest-first, with variations between the revisions of them showing up as the stable branches are re-merged into the newer branches. Git's development model has already proven itself to stimulate innovation and experimentation, as well as the core team being able to produce a very stable product, though of course it is no magic bullet to make it all things to all people.

I call this "Cherry merging" - cherry picking almost every change in series

So, in the face of all of these things, I simply made the program display as much information as it could figure out at each point and ask the presumably infinitely more enlightened user for advice, and as I got bored of answering the questions, built a bit of fuzzy logic into it to make answers based on the rules of thumb that I'd figured out.

Remaining tasks

I'm very grateful to now have direct rsync access to the raw Perforce repository, courtesy of Robrt. With a little refactoring, this "raw" Perforce importer could be generally useful to other people stuck with Perforce who might find themselves completely unable to find a suitable replacement product.

There are always remaining tasks when it comes to Software Archaeology - more information that can be put in etc. For instance, I have not even looked in this conversion at doing the p5p archive scanning that I did for Chip's pumpking series, which allowed me to represent the difference between a patch submitted and a patch applied. But at some point you have to draw the line and cut a release.

I've made some history releases before - but this one is good enough that I know I have to be careful - as it is already highly functional and capable of being used for people who want to track Perl development, or for people with proposed features to develop them on their own feature branches for future pumpkings to manage. Already there are several of these repositories out there with different histories - and so, I'd like to make sure that the next release I make has as much correct author attribution information in there as possible, so that for instance people's OHLOH stats are correct.

There is still one large chunk of history which while having dozens of releases is only represented by a handful of commits, and will need the scripts used to import Chip Salzenberg's 5.003 to 5.004 series patches extended to cover the slight variations in release style used by Tim Bunce. That I'd also like to have complete first. It would be nice to have this all finished by the looming 5.10.0 Perl release, but I don't really want to compromise on correct attributions to do so.

Other than that, I do want to make sure that whenever changes are referenced in commit messages, or the metadata that appears in the commit message, that the corresponding git commit ID is placed in the message - so that you can easily bounce around them in gitk and gitweb.

And then, I think I'll have achieved what might be the most complex Perforce to Git conversion in the Free Software world to date, as well as liberating the source code for a project which is dear to many people.

Posted at teatime on Tuesday, December 18th, 2007 Tags: