ast of hello-world-border.c Olmar : Process C++ Programs in Ocaml

Warning: This is a legacy page that is only displayed for historical reasons. I stopped developing Olmar in 2008 because the Elsa maintainers successfully subverted any attempt to further develop Elsa and, more importantly, because gcc plugins provide a much better alternative for accessing C++ asts.

Olmar connects Elsa, the Elkhound based C/C++ parser and type checker, with Ocaml. More precisely, the Olmar extension can translate Elsa's internal abstract syntax tree into a value of an Ocaml variant type. This value can then be further processed with a pure Ocaml program. I prefer to have stand alone Ocaml programs. Therefore I let Elsa marshal the abstract syntax tree as an Ocaml value to disk. However, it is also possible to link the Ocaml code into the Elsa executable.

News

17 June 2007

30 Oct 2006: Version 0.2 released, changes:

Distribution

In principle Olmar is a patch for the astgen tool and for Elsa. In the future Olmar will hopefully get integrated into the Elsa/Elkhound distribution. At the moment I track the elsa-stack svn repository.

For simplicity I only distribute a complete smbase/Ast/Elkhound/Elsa/Olmar system now. If you want to have pure Elsa, please download it from the Oink repository.

Download / Compile / Use

System requirements

The Memcheck library is needed for the check_oast, the consitency checker for marshalled elsa abstract syntax trees. When encountering strange bugs in Olmar applications it is advisable to first check the consistency of the abstract syntax trees.

The Memcheck library is currently only available for Ocaml 3.09 (because it relies on camlp4). Without Memcheck, Olmar also runs fine under the latest Ocaml version.

See also Elsa's requirements (under point Download) and Elsa's success/failure matrix

(I was told that Elsa is now 64bit clean. However, the additional Olmar code is not (yet).)

Download Elsa+Olmar

choose from the following alternative:

Configure

The toplevel configure simply calls the configure scripts of all subdirectories with the same options. Therefore it is not possible to pass directory specific options to the toplevel configure. Therefore, if the automatic detection of the Memcheck library fails, you have to run configure twice: In the top-level directory and with specific options in the asttools directory.
top-level configuration
In Olmars top-level directory, run
configure -no-dash-O2
Leave out the -no-dash-O2 option if you want to compile the C++ code with -O2. You can use the environment variables CC and CXX to set the C and C++ compiler, respectively.

In order to see whether automatic Memcheck detection worked check the line starting with "Searching the memcheck library..." at the end of the configure output.

asttools configuration
Only necessary if automatic Memcheck detection failed and you do want to compile with Memcheck support. Do
(cd asttools/; ./configure -memcheck=/path/to/memcheck)

Compile

make
This will create the C++ parser elsa/ccparse (with buildin Ocaml reflection capabilities), the AST Graph utility asttools/ast_graph, and the Olmar example application asttools/count-ast and its variant asttools/count-ast-new. If the Memcheck library is available, the Ocaml Ast consistency checker check_oast is also built.

Try it

New features in the elsa parser ccparse

In ccparse the option -oc <file> will activate Olmar and set the filename to write the abstract syntax tree to. Alternatively one can use the tracing option marshalToOcaml (add -tr marshalToOcaml). Then ccparse will derive the filename for the abstract syntax tree itself (input-file.oast).

Olmar's contribution: ast_graph, visualising C++ syntax trees

At the moment the asttools subdirectory in the distribution contains only one useful tool: Ast graph. Ast graph generates the abstract syntax tree in the dot language. One can then use the tools from the graphviz package to visualise the syntax tree.

Usage

Normal C++ files tend to have abstract syntax trees with 10.000 to 1.000.000 nodes. Including iostream alone gives almost 250.000 nodes. Most of the graphics software I tried fails on the sheer size of these graphs. To visualise the tree I have found the following possibilities:
zgrviewer
zgrviewer can display dot files directly (relying on a dot background job). It has nice zooming and scrolling functions. It's a pity that Java runs out of memory on graphs with 10.000 nodes already.
convert to postscript (dot -Tps) and use gv
Works for huge graphs. Scrolling in gv is ok, zooming relatively poor. gv seems to allocate a pixmap in the X server. For huge graphs one has therefore to limit the bounding box using ast_graph's -size option (or putting a suitable size attribute into the dot file).
convert to xfig (dot -Tfig) and use xfig
xfig is the fastest of the alternatives. Zooming is relatively good in xfig, scrolling a bit poor. The display is cluttered with all sorts of handles (because xfig assumes you want to change the graph). (xfig worked great for me until some debian etch update pleased me the folowing bugs #373005, #387296, #387298)
kgraphviewer
In my first attempt kgraphviewer did not show any labels. Still investigating.

syntax tree of ocaml's
  minor garbage collector Technology

The goal of Olmar is to make the abstract syntax tree of a C or C++ program available as an Ocaml variant type, such that one can use pattern matching to process C and C++ programs.

Elsa can output its internal abstract syntax tree in XML, or (mainly for debugging purposes) in plain ASCII. In principle one could read the XML into Ocaml, for instance with PXP. PXP reads XML into an Ocaml object hierarchy. As far as I know, there is, however, no simple way to translate XML into an Ocaml variant type. With PXP one could either write a pull parser or a visitor on the Ocaml object tree. Both approaches are a kind of high-level XML parsing that require some form of type checking the XML and a lot of error code. I did not want to write this kind of XML type-checking code. Therefore Olmar uses a completely different approach.

Olmar simply adds a method toOcaml to each class in Elsa's abstract syntax tree. This method traverses the syntax tree, thereby reconstructing it in Ocaml. At the end the Ocaml value is marshaled into a file. Elsa is linked with some Ocaml code, the Ocaml runtime and some C++ glue code. (In reality the whole story is slightly more complicated, because Elsa's abstract syntax tree can be circular and because C++ pointers might be NULL. Anyway ...)

Elsa internal abstract syntax tree falls into two parts. 32 different node types (about 135 classes) describe the C++ syntax. Elsa's type checker adds 10 types of nodes to describe C++ types in a syntax independent way. A node type might be split into several subtypes (very similar to Ocaml variants). The node type for C++ expressions, for instance, is modelled with 36 classes, for each kind of expression one. In Ocaml such node types are of course modelled with a variant type. Elsa's abstract syntax tree contains also unstructured node types (i.e., without subtypes). In Ocaml those nodes are represented as a tuple or a record.

I wanted to keep Olmar mostly independent from the encoding of variant constructors in Ocaml. Therefore, I register an Ocaml call-back function for each variant constructor and each tuple type. The C++ code calls these call-backs in order to construct Ocaml values (instead of allocating memory itself and filling it). Only list and option values are created directly in C++. For now I prefer this hopefully less error prone variant over more efficient code.

The code for the 32 syntax node types is generated automatically from an ast description file. Therefore, to add the toOcaml method to these syntax classes one only needs to patch astgen. With Olmar astgen additionally generates an Ocaml type definition and Ocaml code for the above mentioned call-back functions. Finally astgen also generates the toOcaml method in C++.

The syntax tree nodes for Elsa's type checker are, unfortunately, not generated from ast descriptions. I had to write all the necessary Ocaml and C++ code myself. In the end this turned out to be much more work than improving astgen...

Using Olmar

You can use Olmar in two ways: In asttools/count-ast.ml and its variant asttools/count-ast-new.ml you find two very simple Olmar example applications.

Abstract syntax tree type definition

The type definition is in the following files
elsa/cc_ast_gen_type.ml
contains the type definition of all ast nodes
elsa/cc_ml_types.ml
flag types used in the syntax nodes
elsa/ml_ctype.ml
flag types used in type nodes
elsa/ast_annotation.mli
ast annotations (see below)
The whole abstract syntax tree that is marshaled from ccparse has type
annotated translationUnit_type = annotated * (annotated topForm_type list) * (annotated scope option)

Ast annotations

The whole abstract syntax tree is polymorphic in one type parameter, which is a placeholder for user defined annotations. Every node of the syntax tree carries a (unique) slot of this annotation type. Annotations are meant for client use. One can easily define a new annotation type and use it to store client data in it.

ccparse generates the abstract syntax tree with annotations of type annotated (see elsa/ast_annotation.mli). Every node is guaranteed to contain a unique annotation value. The annotation carries an (positive) integer (accessible via id_annotation) that uniquely identifies the syntax tree node of this annotation. These integers are dense, that is all integers between 1 and the maximal one are used in the abstract syntax tree. In addition to this unique integer the annotation contains the address of the C++ object from which the Ocaml value was generated (shifted to the right to fit into an Ocaml int).

Complications

The abstract syntax tree is circular. A naive iteration over the tree will therefore in general not terminate. Currently there are 14 fields that might make the tree circular: The fields in the first group (up to enclosing) have a (hopefully hinting) option ref type. The field variable.overload has type list ref and the fields variables and typeTags in scope have type (string, variable) Hashtabl.t. An iteration over the abstract syntax tree will terminate, if you do not recurse into these fields. However, there might be some tree nodes only reachable via one of these fields.

The Olmar example count-ast.ml shows how to use annotations and dense sets to traverse all nodes in a syntax tree.

Utilities

Dense sets of positive integers
The interface is a subset of the module Set.S of Ocamls standard library (of course with with type elt = int). Internally it uses an array of strings as bitmap.
Syntax tree utilities
Contains functions to access fields that are present in each variant of a given node type. For instance for annotations and source locations.

Problems/Questions/Suggestions

Feel free to contact me at hendrik@askra.de with anything that is Olmar or Elsa related.

Known problems

64-bit platforms
I was told that recent elsa versions are 64 bit clean now. However, not all Olmar changes are 64 bit clean.

inefficiency on large graphs
When generating the Ocaml tree every node in the C++ tree contains a pointer to the corresponding Ocaml node. These pointers are scanned for every minor collection, which results in a big performance hit for the default minor heap size of Ocaml (as distributed Olmar increases it from 32K to 8MB). See also the related discussion on the Ocaml list.

missing node types
The syntax node classes BaseClassSubobj and TemplateParams seems not to appear in all of Elsa's regression test programs. The toOcaml method of this classes currently just contains an assert(false). I am grateful for any example program that triggers these assertions.

ocaml 3.10
The memcheck library relies on some camlp4 code that has not yet been ported to ocaml 3.10. Therefore a complete Olmar installation with Memcheck requires ocaml 3.09. Without Memcheck Olmar runs fine with ocaml 3.10.

last changed on 22 Mar 2011 by Hendrik