The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<html><head><title>FYI: LAOLA file system, Mar/23/97</title></head>
<body bgcolor="#ffffff" link="#0000ff" alink="#ff0000" vlink="#0000c0">
<!--$Id: guide.html,v 1.1.1.1 1998/02/25 21:13:00 schwartz Exp $-->

&nbsp;<p>
<H1 align=left><font color="#7f007f">LAOLA file system</font></H1>

<H3 align=left><font color=#ff0000>"Structured Storage"</font></H3>

<H4>The binary structure of Ole / Compound Documents</H4>

<font size="-1"> &copy; 1996, 1997 by
<a href=mailto:schwartz@cs.tu-berlin.de>martin schwartz@cs.tu-berlin.de</a>
</font><p>

<H1>Hacking guide</H1>
<dl><dl>

A - <a href=#a>Preface</a><font size="+2"> <br></font>
B - <a href=#b>General</a><font size="+2"> <br></font>
C - <a href=#c>Basics</a><font size="+2"> <br></font>
<p>
1. <a href=#n1>Header Block</a><font size="+2"> <br></font>
2. <a href=#n2>Big Block Depot</a><font size="+2"> <br></font>
3. <a href=#n3>Big Data Blocks</a><font size="+2"> <br></font>
&nbsp;&nbsp;&nbsp;&nbsp;3.1 <a href=#n31>Small Block Depot</a><font size="+2"> <br></font>
&nbsp;&nbsp;&nbsp;&nbsp;3.2 <a href=#n32>Property Set Storage</a><font size="+2"> <br></font>
&nbsp;&nbsp;&nbsp;&nbsp;3.3 <a href=#n33>Property Storage</a><font size="+2"> <br></font>
&nbsp;&nbsp;&nbsp;&nbsp;3.4 <a href=#n34>Where Are The Small Data Blocks?</a><font size="+2"> <br></font>
&nbsp;&nbsp;&nbsp;&nbsp;3.5 <a href=#n35>Property Sets</a><font size="+2"> <br></font>
4. <a href=#n4>Trash Blocks</a><font size="+2"> <br><p></font>
<p>
Table 1: <a href=#t1>LAOLA Header</a><font size="+2"> <br></font>
Table 2: <a href=#t2>Property Storage</a><font size="+2"> <p></font>

</dl></dl>

<a href=ari.html>Whites only - ein Stück Apartheid in Berlin</a><p>

<a href="index.html">Back to Laola HomePage</a><p>

<a name=a>
&nbsp;<p>
<H3>A - Preface</H3>

One day I started writing some program that should have access to documents
done with Microsoft Word for Windows 6. I wanted to keep it portable, so it
was necessary not to use methods specific to operating systems. So I decided 
to learn to understand the binary structure of the documents.<p>

When looking at the document binaries I soon got confused. At a first glance 
the documents file format seemed to differ very much from that of files stored 
with Word 2. When looking more closely it got clear that some very familiar 
binary pieces was stored within that new looking data. In fact a Word 6 
document is somewhat a Word 2 format document stored among additional data.
These additional data makes out a kind of file system.<p>

As far as I know there is no publically available information source on how 
this file system (document format) works. Generally I demand that producing 
industry has the duty to show what ingredients are in the products. In case 
of authoring systems I think that people ought to know what kind of 
information is in their (may be public distributed) documents. E.g., if 
documents contain information about the creation date, printer(s), directory 
structure or serial numbers. Or even worse, if documents contain or might 
contain other private data.<p> 

Summarizing this topic for the below described file format, it always 
stores the last modification date of objects. Because of a bad 
implementation for Windows 3.x and older distributions of 32 bit Windows
systems it still always contains some "data trash" sections. These sections
might contain personal data. Of course, depending on the authoring program, 
other private data might be stored invisible, too.<p>

This text does *not* explain how a Microsoft Word file is structured.
This text does explain how the file system works that younger Windows 
programs like Microsoft Word use to store their documents. So actually it 
should be called OLE file system, as the philosophy behind this file system 
is Microsoft's OLE / Com technology. But in lack of any binary level 
technical specification about this topic my explanation might differ in 
some cases or even be wrong. In this cases I certainly would not explain 
the OLE filesystem, but something similar. So I decided to take a similar 
name, either. The name is LAOLA.

<p>
<font color="#009f00"><em>Copying.</em></font>
This file and the here referenced source codes are distributed under the 
terms of Version 2 of the GNU General Public License from June 1991. If you 
have no copy you should find one <a href=COPYING>here</a>.<p>

Diese Veröffentlichung erfolgt ohne Berücksichtigung eines eventuellen
Patentschutzes. Warennamen werden ohne Gewährleistung einer freien 
Verwendung benutzt.<p>


<a name=b>
&nbsp;<p>
<H3>B - General</H3>

Digital documents tend to consist of more than just one file. When storing 
or exchanging such documents it used to be a problem to bind all those 
files together and to be sure about the same hierarchy conventions. LAOLA 
file format allows to store files in a hierarchical order into a single 
file. This may contain one or more directories, that each may contain
one or more files and directories.<p>

Actually this work could have being done well by promoting some of the 
popular archives like the whole "zip" family. But Microsoft went their own 
way. As I think not only because of their market philosophy, but also 
because it seems, the intention to develop a file system has been directed
by their "OLE philosophy", that in a way demands to have a hierarchical 
file structures.<p>

Unfortunately Microsoft did not include mechanisms to assure the well 
being of a document. So, if e.g. a Laola document gets corrupted, this 
normally stays undetected. If somebody tampers with the document, normally 
nobody can notice. If a document contains much unused space, this normally 
will stay so. Microsoft's strategy involved no compensation for the 
disadvantages of a new file system.<p>

<a name=c>
&nbsp;<p>
<H3>C - Basics</H3>

<font color="#009f00"><em>No guaranty!</em></font>
The things here described are assumptions, mainly based on speculation and 
experiment and only few on documentation. Although I'm sure things are 
pretty ok, there might be errors!<p>

<font color="#009f00"><em>What is a Laola file?</em></font>
In short, a Laola file is an archive. The archive can maintain files and
directories. Each archive entry has a 0x80 bytes long info block. To store the
files the archive maintains a list of big data blocks and a list of
small data blocks. Files with a size less than 0x1000 bytes will be stored
into the small data blocks, the other files into big data blocks.<p>

<font color="#009f00"><em>Data types.</em></font>
Laola files have three basic data types:<p>

<pre>
   1.  4 byte integers ("long") 0x12345678           ->  0x78 0x56 0x34 0x12
   2.  2 byte integers ("word") 0x1234 0x5678        ->  0x34 0x12 0x78 0x56
   3.  1 byte integers ("char") 0x12 0x34 0x56 0x78  ->  0x12 0x34 0x56 0x78
</pre>

Integers are stored in "little endian" ("VAX", "x86") mode. That means they 
are stored eightbitwise, the least significant byte first. Char streams are 
stored first in, first out.<p>

<font color="#009f00"><em>Blocks.</em></font>
In a first step each Laola file is divided into 0x200 (512) byte "big 
blocks", so each Laola file's size is a multitude of 0x200 bytes. Each 
block corresponds to an enum, starting with -1 for the first 0x200 bytes, 
then counting upwards. So the file is made out of the set of blocks:<p>

<pre>
    file <=> union of big blocks {-1, 0, 1 .. $maxblock},

    ($maxblock = (sizeof(file)-1) / 0x200 -1), $maxblock e {1, 2, .. }

</pre>


<font color="#009f00"><em>Basic parts.</em></font>
The big blocks divide the file into the four basic parts:<p>

<dl><dl><ol>
<li><a href="#n1">Header Block</a><font size="+1">&nbsp;</font>
<li><a href="#n2">Big Block Depot</a><font size="+1">&nbsp;</font>
<li><a href="#n3">Big Data Blocks</a><font size="+1">&nbsp;</font>
<li><a href="#n4">Trash Data Blocks</a><font size="+1">&nbsp;</font>
</ol></dl></dl>

<font color="#009f00"><em>Dead non trash information.</em></font>
The Laola blocks seem to contain several constant data entries. Some of them 
seem to be (still?) with absolutely no function, the functions of others are 
still not clear. Anyway, for the creation of Laola files it would be enough 
just to copy these values. In the tables at the end of this document the 
constant values are marked with a dot ".", the known and changing values 
with an exclamation mark "!".<p>


<a name=n1>
&nbsp;<p>
<H3>1. Header Block</H3>

The header is built of the first block (block -1, running from file offset
0x00 to 0x1ff). The header starts with the eight bytes long hex string 
{d0 cf 11 e0 a1 b1 1a e1}. The header contains some first structure 
information. Function will be told later in this document.<p>

Example:<p>

<pre>
    00000: d0 cf 11 e0  a1 b1 1a e1  00 00 00 00  00 00 00 00  
    00010: 00 00 00 00  00 00 00 00  3b 00 03 00  fe ff 09 00 
    00020: 06 00 00 00  00 00 00 00  00 00 00 00  01 00 00 00
    00030: 01 00 00 00  00 00 00 00  00 10 00 00  02 00 00 00
    00040: 01 00 00 00  fe ff ff ff  00 00 00 00  00 00 00 00
    00050: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff *


00:     stream $laola_id          identifier {d0 cf 11 e0 a1 b1 1a e1}
2c:	long $num_of_bbd_blocks   Number of big block depot blocks
30:	long $root_startblock     Root chain's first big block
3c:	long $sbd_startblock      small block depot's first big block
4c[]:   long $bbd_list[i]         array of $num_of_bbd_blocks big block numbers

(for detailed info look at: <a href=#t1>Table 1</a>)

</pre>


<a name=n2>
&nbsp;<p>
<H3>2. Big Block Depot</H3>

The big block depot manages the big blocks. Big blocks have a length of
exactly 0x200 (512) bytes. Often the big block depot will consist exactly 
out of one big block.<p>

<pre>
    big block depot <=> union of big blocks {bbd_list[i]},  

    bbd_list  consists out of $num_of_bbd_blocks elements, stored from
              position &lt;header:4c&gt; on.
</pre>
Example:

<pre>
   00200: fd ff ff ff  05 00 00 00  fe ff ff ff  04 00 00 00
   00210: 06 00 00 00  fe ff ff ff  07 00 00 00  08 00 00 00
   00220: 09 00 00 00  0a 00 00 00  0b 00 00 00  fe ff ff ff
   00230: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff *

</pre>

The big block depot is representing a table of block numbers, their index
starts with zero. Entry 0, in this example with the value 0xfffffffd (-3), 
refers to block 0. Entry 1, with the value 0x00000005 refers to block 1. 
Entry 2 refers to block 2 ... and so on. Each entry may have one of these 
values:<p>

<pre>
   0xfffffffd (-3) : this block is a special block
   0xfffffffe (-2) : end of chain
   0xffffffff (-1) : unused
   0 .. $maxblock  : next element of chain (a big block number)
   $maxblock+1 ..  : not defined

</pre>


In the header the variable $root_startblock has been initiated, the 
example gives the value 1 to it. These value tells which block is the first 
in a chain of blocks belonging to the "root". In the example it would be
read out as follows:<p>

<font color="#009f00"><em>How to read a block chain.</em></font>
Starting at position 1. So the first block belonging to root is block 1. The 
value of the big block depots entry with position 1 is: 0x00000005. So the 
next block belonging to root is block 5. The value of the big block depots 
entry with position 5 is 0xfffffffe (-2). That means: here is the end of 
the chain. So "root" finally consists out of the blocks: {1, 5}<p>

From the header also the value of variable $sbd_startblock is known. Try
to find it's value, then try to get the belonging block chain! (If you
want to see the solution, look at the end of this document)<p>

Note: when reading in a chain, only the values "0 .. $maxblock" and "-2" 
are ok. If other values do occur in a chain, some error happened.<p>

Note: The small block depot may be absent. In that case $sbd_startblock
is 0xfffffffe (-2).<p>

Summary: with the help of header block and big block depot the values of 
the big block lists: @root_list and @sbd_list are known.<p>



<a name=n3>
&nbsp;<p>
<H3>3. Big Data Blocks</H3>

<a name=n31>
<H4>3.1  Small Block Depot</H4>

The small block depot manages the small blocks. Small blocks have a length 
of exactly 0x40 (64) bytes. Often the small block depot will consist exactly 
out of one big block. Some documents do not have a small block depot at all.<p>

<pre>
    small block depot <=> union of big blocks {sbd_list[i]},  

    sbd_list  consists out of (number of chain_elements(sbd_list))
              elements. The list is read out from the big block depot,
              the lists start is $sbd_startblock (-> Section 2)
</pre>

Example:
<pre>
    00600: 01 00 00 00  fe ff ff ff  ff ff ff ff  ff ff ff ff
    00610: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff

</pre>

The entries of the small block depot do *not* refer to absolute positions 
in the file, like the entries of the big block depot. They refer to the 
positions in the (file made out of big blocks) according to the the big 
block list @sbd. This list is denoted by the root entry of the <em>property 
storage</em>. So first <em>property storage</em> has to be explained.


<a name=n32>
<H4>3.2  Property Set Storage</H4>

From section 2 the values of the big block list @root_list are known. 
These blocks do contain the property set storage (ppss) blocks.<p>

<pre>
    property set storage blocks <=> union of big blocks {root_list[i]},  

    root_list  consists out of (number of chain_elements(root_list))
               elements. The list is read out from the big block depot,
               the lists start is $root_startblock (-> Section 2)

</pre>


<a name=n33>
<H4>3.3  Property Storage</H4>

The ppss blocks (-> section 3.2) split into 0x80 byte blocks. These 0x80 
blocks are building a property storage (pps). Every pps gets a number,
starting with zero. A pps refers to a "file" or it does represent a
"directory".<p>

<pre>
Example:

   00400: 52 00 6f 00  6f 00 74 00  20 00 45 00  6e 00 74 00   R o o t  E n t
   00410: 72 00 79 00  00 00 00 00  00 00 00 00  00 00 00 00   r y 
   00420: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 
   00430: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 
   00440: 16 00 05 00  ff ff ff ff  ff ff ff ff  03 00 00 00
   00450: 00 09 02 00  00 00 00 00  c0 00 00 00  00 00 00 46
   00460: 00 00 00 00  00 00 00 00  00 00 00 00  86 29 f6 1f
   00470: ad 57 bb 01  03 00 00 00  00 0f 00 00  00 00 00 00

40:  word $pps_sizeofname      size of $pps_rawname
42:  byte $pps_type            type of pps (1=storage|2=stream|5=root)
44:  long $pps_prev            previous pps
48:  long $pps_next            next pps
4c:  long $pps_dir             directory pps
74:  long $pps_sb              starting block of property
78:  long $pps_size            size of property

(for detailed info look at: <a href="#t2">Table 2</a>)

</pre>

The first 0x40 bytes are reserved for the name of the pps. The length of the
name stands in $pps_sizeofname. The name can be converted to an ASCII string 
$pps_name just by removing every second char. In this example the length
of the name is 0x16, and in the end $pps_name is "Root Entry\00". The 
C-style zero should be removed. If the case occurs that $pps_sizeofname
is zero, then this 0x80 block is no pps and has to be ignored.<p>

Each pps can have a successor and a predecessor. Each pps also can be a
directory (or "storage"). $pps_prev, $pps_next, $pps_dir refer to the
ascending number of the 0x80 blocks as mentioned above. So in the example
the pps starting at 0x400 gets the handle (number) 0, the pps starting
at 0x480 gets the handle 1, the pps starting at 0500 gets the handle 2
and so on. When read skillfully, an ordered listing of pps results in the 
end (look at function get_pps_chain in "<a href="laola.pm">laola.pm</a>").<p>

<font color="#009f00"><em>Property types.</em></font>
Each pps has a type out of this three:<p>

<ol>
<li>Storage, that is a directory
<li>Stream, that is a file
<li>Root, that is the root directory
</ol><p>

If $pps_size is not zero, $pps_sb points to the starting block of the
belonging property. The starting block refers to the big block depot,
if $pps_size is greater or equal 0x1000 (4096) bytes. If the property's 
size is smaller, $pps_sb refers to the small block depot. There is one 
exception: $pps_sb of the Root entry (always pps 0) does always refer 
to the big block depot.<p>

It now is easy to read the "files" in: the big or small block list has
to be catched (as did before with root_list and sbd_list) from the big 
or small block depots, the so referred blocks have to be read and at the 
last step the size might have to be truncated to fit to $pps_size.<p>

If the type of a pps is root or storage, at least the variables $pps_ts2d
and $pps_ts2s get initialized. Together these variables build a 64 bit
integer variable, that represents time and date. This variable counts
all 10^-7 seconds, starting at 01/01/1601 00:00.

If the type is root, $pps_sb is pointing to the first big block of the
small block list @sb_list. See just below:<p>


<a name=n34>
<H4>3.4  Where Are The Small Data Blocks?</H4>

The root entry is an exception to the 0x1000 bytes rule of section 3.3. 
The size in the example is 0xf00, so it actually should belong to the small 
block depot. In fact the file of the root entry always refers to the big 
block table. The start block here is 3. When examining this (copy from above):
<p>

<pre>
    00200: fd ff ff ff  05 00 00 00  fe ff ff ff  04 00 00 00
    00210: 06 00 00 00  fe ff ff ff  07 00 00 00  08 00 00 00
    00220: 09 00 00 00  0a 00 00 00  0b 00 00 00  fe ff ff ff
    00230: ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff *

</pre>

it results in the chain: {3, 4, 6, 7, 8, 9, a, b}. The so yielded chain 
gives the blocks that provide the space for the small blocks. So small 
block references refer to positions in the "small block file", that in 
this example is build of the big block chain starting with block 3.<p>

<pre>
    small data blocks <=> union of big blocks {sb_list[i]},  

    sb_list  consists out of (number of chain_elements(sb_list))
             elements. The list is read out from the big block depot,
             the lists start is 
             $root_startblock -> property storage 0 -> pps_sb

</pre>

With this in mind, everything needed to pull out some file out of a Laola 
archive is available. You could test this with 
"<a href=lls>lls -s</a>". But there is still more behind the thing.<p>


<a name=n35>
<H4>3.5  Property Sets</H4>

Apart from just storing plain files it is possible, to store special 
structured "database" files. These structured files are called property sets. 
There is a good article about how they are build at Microsoft's web service,
<a href=http://www.microsoft.com/oledev/olesrvr/propset.htm>"OLE Property Sets Exposed" by Charlie Kindel</a>. 
It claims, that accurate information about property sets is provided with 
the Win32 SDK, too.<p>

Summary and further information are
<font color=red>- still to be done ! -</font><p>


<a name=n4>
&nbsp;<p>
<H3>4. Trash Blocks</H3>

Last of the four chapters is "trash data blocks". These trash blocks are
blocks stored in the document without being referred by Laola system. There
should be no such trash, but there is <em>sometimes</em>. Pretty famous
about this is the Microsoft Word Option "fast saving" (switch this off, if
you haven't yet). Thus stored documents are usually consisting by about the
half out of garbage. Another example would be Star Writer 3.1, that by
principle produces 2 big blocks of trash.<p>

Some blocks are just partially consisting out of trash, they could be called 
stinky blocks. This is because of the size of a property does just by 
chance fit exactly to the 0x200 (0x40 at small blocks) bytes border. So the 
rest of the last block of a chain does <em>nearly always</em> contain some 
bytes of rubbish.<p>

Like all trash, data trash is troublesome. In some cases it is simply
annoying because it blows up the files size. In each case it is relevant 
with reference to data security. Because you cannot know what's in there 
you have a lack of control to your own data. In Usenet has been even 
reported that private mail and encrypted password happened to stick in 
a Word file.<p>
 
The good thing is, that unlike nuclear trash, this data trash is removable. 
Just look at demonstration program "lclean" at the source code section of
this document. In case of data trash I've heard, that Microsoft knows
about this OLE bug and provides a fix for 32 bit Windows systems. However,
if you use Windows 3.1 you probably have to rely on lclean.<p>

Comments appreciated!<p> 
<i>Martin</i><p>
<font color="#7f007f"><p align=center>- <i>The End</i> -</font><p>

<a name=n5>
&nbsp;<p>
<H3>TABLES</H3>

<pre>
<a name="t1">

Table 1: Block 0 (laola header)

offset  type value           const?  function
00:     stream $laola_id          !  identifier {d0 cf 11 e0 a1 b1 1a e1}
08:     long 0                    .  ?
0c:     long 0                    .  ?
10:     long 0                    .  ?
14:     long 0                    .  ?
18:     word 3b                   .  ? revision ?
1a:     word 3                    .  ? version  ?
1c:     word -2                   .  ?
1e:     byte 9                    .  ?
1f:     byte 0                    .  ?
20:     long 6                    .  ?
24:     long 0                    .  ?
28:     long 0                    .  ?
2c:     long $num_of_bbd_blocks   !  Number of big block depot blocks
30:     long $root_startblock     !  Root chain 1st block
34:     long 0                    .  ?
38:     long 1000                 .  ?
3c:     long $sbd_startblock      !  small block depot 1st block
40:     long 1                    .  ?
44:     long -2                   .  ?
48:     long 0                    .  ?
4c[]:   long $bbd_list[i]         !  array of $num_of_bbd_blocks big block
                                     numbers
The rest of block 0 should be:
	long -1                   .  

####
<a name="t2">

Table 2: Property Storage

offset  type value           const?  function
00:     stream $pps_rawname       !  name of the pps
40:     word $pps_sizeofname      !  size of $pps_rawname
42:     byte $pps_type		  !  type of pps (1=storage|2=stream|5=root)
43:	byte $pps_uk0		  !  ?
44:	long $pps_prev	          !  previous pps
48:     long $pps_next            !  next pps
4c:     long $pps_dir             !  directory pps
50:     stream 00 09 02 00        .  ?
54:     long 0                    .  ?
58:     long c0                   .  ?
5c:     stream 00 00 00 46        .  ?
60:     long 0                    .  ?
64:     long $pps_ts1s            !  timestamp 1 : "seconds"
68:     long $pps_ts1d            !  timestamp 1 : "days"
6c:     long $pps_ts2s            !  timestamp 2 : "seconds"
70:     long $pps_ts2d            !  timestamp 2 : "days"
74:     long $pps_sb              !  starting block of property
78:     long $pps_size            !  size of property
7c:     long                      .  ?

</pre>
<hr size=2 noshade><p>

<H4>Solutions:</H4>

<dl>
<dt><b>Lesson 1:</b>
<dd>Find the block chain belonging to variable: $sbd_startblock<p>

<dt><b>Solution:</b>
<dd>The value of $sbd_startblock stands at position 0x3c of 
    the header. It's value is: 0x00000002 == 2. The value of 
    the big block depots entry with position 2 is: 0xfffffffe,
    that means: end of chain. So the sbd consists only out of
    big block 2.<p>
</dl>

<p></body></html>