Converting Python bytecode to XML

Over the last couple of days, I've been exploring Python bytecode in some detail. It started when I read The structure of .pyc files which lead me to write a tool pyc2xml for converting .pyc files into XML.

Complete tarball: pyc2xml-0.2.tar.gz

Here is how to use it:

$ ./pyc2xml foo.pyc 
Disassembling: foo.pyc  -->  foo.xml
$ head -n15 foo.xml 
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="bytecode.xsl" ?>
<!DOCTYPE bytecode SYSTEM "bytecode.dtd">
<bytecode>
    <head>
        <magic>0xb3f20d0a</magic>
        <modtime>Sat Apr 19 00:49:03 2008</modtime>
    </head>
    <code name="&lt;module&gt;">
        <co_argcount>0</co_argcount>
        <co_cellvars />
        <co_code><![CDATA[
  2           0 SETUP_LOOP              48 (to 51)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               0 (5)

The XML file contains all the information of the .pyc file. That is:

For more information see the Python Reference Manual. The element co_consts contains all literal constants used by the bytecode, these are immutable objects like integers, floats and strings, but also code objects. Thus, code objects are nested data structures. The formal grammar of the XML file is described in bytecode.dtd.

The co_code elements contain the sequence of bytecode instructions in disassembled form. This is simply the output given by the disassemble function of the dis module.

The file disassembled by pyc2xml is usually a .pyc file, but can also be a python source file. In the latter case, the python source code is (internally) compiled to bytecode, and then this bytecode is disassembled into XML.

Representation

The XSLT stylesheet bytecode.xsl offers a way to neatly represent the XML file. If your browser is not capable of XSLT processing, you can view the result XML tree, which is HTML.

Converting back to .pyc

Since the XML file contains all information about the .pyc file, it is possible to convert back to .pyc files:

$ ./pyc2xml foo.xml
Assembling: foo.xml  -->  foo.pyc

The assembly language used is of course somewhat redundant. Therefore, only the column containing the opname and the column containing the oparg are considered by the assembler. So for example:

<co_code><![CDATA[
33           0 LOAD_FAST                0 (x)
             3 LOAD_CONST               1 (7)
             6 BINARY_ADD          
             7 RETURN_VALUE        
               ]]></co_code>

and

<co_code><![CDATA[
 48          73 LOAD_FAST                0 (y22)
              4 LOAD_CONST               1 (324517)
#            26 ROT_TWO  
       ------ 6 BINARY_ADD          
=========== 217 RETURN_VALUE
                ]]></co_code>

will be assembled to the same bytecode. Blank lines and lines starting with a '#' will be ignored. When assembling, all XML attributes are ignored, these attributes are only inserted by the disassembler to make the XML more readable.

Problem with the assembler (.xml --> .pyc)

The program uses the code function of the new module, which provides an interface to the PyCode_New() C function. This function does not allow to set the co_cellvars and co_freevars read-only attributes of the new code object. Therefore code using nested function and closures can not be converted back into bytecode because the nested code object, i.e. the function inside another function, is missing information about the free variables. However, pyc2xml works in all these cases.