Converting Python bytecode to XML
Over the last couple of days, I've been exploring Python bytecode in some detail. It started when I read The structure of .pyc files which lead me to write a tool pyc2xml for converting .pyc files into XML.
Complete tarball: pyc2xml-0.2.tar.gz
Here is how to use it:
$ ./pyc2xml foo.pyc
Disassembling: foo.pyc --> foo.xml
$ head -n15 foo.xml
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="bytecode.xsl" ?>
<!DOCTYPE bytecode SYSTEM "bytecode.dtd">
<bytecode>
<head>
<magic>0xb3f20d0a</magic>
<modtime>Sat Apr 19 00:49:03 2008</modtime>
</head>
<code name="<module>">
<co_argcount>0</co_argcount>
<co_cellvars />
<co_code><![CDATA[
2 0 SETUP_LOOP 48 (to 51)
3 LOAD_NAME 0 (xrange)
6 LOAD_CONST 0 (5)
The XML file contains all the information of the .pyc file. That is:
head
-- Header:* magic
-- Four byte magic number
* modtime
-- Modification timestampcode
-- Code object:* co_argcount
-- Number of positional arguments
* co_cellvars
-- Names of local variables that are referenced by nested functions
* co_code
-- Sequence of bytecode instructions
* co_filename
-- Filename from which the code was compiled
* co_firstlineno
-- First line number of the function
* co_flags
-- Flags for the interpreter.
* co_freevars
-- Names of free variables
* co_lnotab
-- Mapping from byte code offsets to line numbers
* co_name
-- Function name
* co_names
-- Names used by the bytecode
* co_nlocals
-- Number of local variables used by the function
* co_stacksize
-- Required stack size
* co_varnames
-- Names of the local variables
* co_consts
-- Literal s used by the bytecode
For more information see
the Python Reference Manual.
The element co_consts
contains all literal constants used by the
bytecode, these are immutable objects like integers, floats and strings,
but also code objects. Thus, code objects are nested data structures.
The formal grammar of the XML file is described in
bytecode.dtd.
The co_code
elements contain the sequence of bytecode instructions in
disassembled form. This is simply the output given by the disassemble
function of the dis
module.
The file disassembled by pyc2xml is usually a .pyc file, but can also be a python source file. In the latter case, the python source code is (internally) compiled to bytecode, and then this bytecode is disassembled into XML.
Representation
The XSLT stylesheet bytecode.xsl offers a way to neatly represent the XML file. If your browser is not capable of XSLT processing, you can view the result XML tree, which is HTML.
Converting back to .pyc
Since the XML file contains all information about the .pyc file, it is possible to convert back to .pyc files:
$ ./pyc2xml foo.xml
Assembling: foo.xml --> foo.pyc
The assembly language used is of course somewhat redundant. Therefore, only the column containing the opname and the column containing the oparg are considered by the assembler. So for example:
<co_code><![CDATA[
33 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (7)
6 BINARY_ADD
7 RETURN_VALUE
]]></co_code>
and
<co_code><![CDATA[
48 73 LOAD_FAST 0 (y22)
4 LOAD_CONST 1 (324517)
# 26 ROT_TWO
------ 6 BINARY_ADD
=========== 217 RETURN_VALUE
]]></co_code>
will be assembled to the same bytecode. Blank lines and lines starting with a '#' will be ignored. When assembling, all XML attributes are ignored, these attributes are only inserted by the disassembler to make the XML more readable.
Problem with the assembler (.xml --> .pyc)
The program uses the code function of
the new module,
which provides an interface to the PyCode_New() C function.
This function does not allow to set the co_cellvars
and co_freevars
read-only attributes of the new code object.
Therefore code using nested function and closures can not be converted
back into bytecode because the nested code object, i.e. the function
inside another function, is missing information about the free variables.
However, pyc2xml works in all these cases.