Search                        Top                                  Index
TEACH DATASTRUCTURES                                   A.Sloman May 1988


DATA STRUCTURES

This teach file explains what data-structures are in general and gives
an introduction to the data-structures available in POP-11, whilst
mentioning facilities in other languages.

CONTENTS - (Use <ENTER> g to access required sections)

 -- Introduction
 -- What are data-structures?
 -- Data types
 -- Primitive and derived data-types
 -- Formal specificaton of a data-type
 -- Data-types in POP-11
 -- Compile time vs run time types
 -- How are higher level types implemented?
 -- Common data types
 -- Lazy evaluation
 -- Virtual data-types in POP-11
 -- Implicit structures - the POP-11 user stack, etc.
 -- Implementation issues
 -- Where does a structure end?
 -- Tagged architectures
 -- Garbage collection
 -- Association structures
 -- Structure sharing: context mechanisms
 -- Control structures
 -- Operating systems require data-structures
 -- Compilers require data-structures
 -- Conclusion
 -- Reading
 -- Relevant TEACH, HELP and REF files

-- Introduction -------------------------------------------------------

In a typical course on datastructures there would be a survey of some of
the main kinds of datastructures used in computing systems, techniques
for implementing them on digital computers, and algorithms for operating
on them.

For example, such a survey might include a discussion of integers,
floating point numbers, strings, words, vectors, arrays, records, lists,
trees and graphs (networks), and ways in which they could be implemented
in terms of bit patterns, the basic structures provided in a computer.
Algorithms discussed might include methods for searching a structure for
an element, re-ordering elements of a structure into numerical or
alphabetical order, matching two structures to see if they have the same
elements, and so on.

The books listed at the end include such surveys. However, this file not
an attempt to summarise or duplicate such standard material, but is
concerned with more general, almost philosophical, issues about what
datastructures are and their uses both inside and outside computers.

-- What are data-structures? ------------------------------------------

Data-structures are the entities that programs operate on. They are
metaphorically referred to as things in the machine (in memory or on
disk), but are usually "virtual" entities like lists, trees, networks,
strings, words, numbers, or even bit-patterns, rather than physical
entities like transistors, electrons, voltages, or bits of iron oxide.

This is analogous to the way in which a book contains virtual objects
like words, sentences, arguments, accounts, stories, etc. rather than
simply bits of paper, ink, atoms, molecules, etc. Similarly a person's
estate may include a piece of land, various buildings, money in the
bank, partial ownership of a company and so on. The contents of the
estate would change if some of the land were sold. The change is not one
that would show up as a visible new boundary on some field: it might be
accomplished simply by alterations to some paper (or computer-based)
records. The estate is a virtual data-structure, as are most of the
things in it.

Data-structures in a computer are usually used to represent things
outside the machine, and operations on them are usually intended to
represent operations on external things. This may be done for purposes
of keeping records, making predictions, controlling things, etc.
However, some structures in the machine are often used to refer to other
structures in the machine, and are therefore called "pointers".

Some data-structures are "persistent" or "permanent", i.e. they can be
stored in files which continue to exist whether a program is running or
not. Others are created within the program's address space while a
process runs, and are lost when the program finishes. An example of the
former would be an online library catalogue, or database of employee
information in a company. An example of the latter would be a POP-11
list, or a procedure activation record. (Structures stored in "saved
images" are an intermediate case.) (The POP-11 DATAFILE library program
allows some structures, like lists, to be treated as permanent.)

-- Data types ---------------------------------------------------------

Data-structures are usually divided into different "types", where each
type may have many instances. There are some exceptions. In POP-11, the
type "termin" has exactly one instance, the object -termin-, and the
type "boolean" has exactly two instances -true- and -false-.

Each type has a class of procedures which operate on its instances. In
some ways that is what defines the type, although there may be
similarities between different types because similar operations on them
are possible. Thus in POP-11 the list [a b c d] is similar to the vector
{a b c d} because you can ask for the -length- of either, or access or
update the third element of either, print either, and so on. But they
are different types.

This is because although you can ask for the first element of either,
the list has a special feature. You can ask for its "tail" (using the
procedure -tl- or -back-). If you take the tail of a list, and update
its first element, this will update the second element of the original
list. I.e. the tail is part of the original. For a vector
    allbutfirst(1, vector)
is superficially like tl(list):

    allbutfirst(1, {a b c d}) =>
    ** {b c d}

but this produces a new vector containing the last three elements. It is
not part of the original vector, unlike the tail of a list.

    vars list1=[a b c d];
    vars list2=tl(list1);
    list2 =>
    ** [b c d]
    99 -> hd(list2);
    list2 =>
    ** [99 c d]
    list1=>
    ** [a 99 c d]

I.e. updating list2 changed list1.

Moreover with a list, but not a vector, you can splice in new elements
and extend it, e.g.
    vars list = [a b c d];
    list =>
    ** [a b c d]
    [99 ^^(tl(tl(list)))] -> tl(tl(list));
    list =>
    ** [a b 99 c d]

So -list- has been lengthened by 1, by adding a new list link (a new
pair) into the chain. (See TEACH * WAL).

In addition to the procedures specific to a particular data-type there
are usually general procedures that operate on a range of different
data-types: e.g. the printing procedure or the equality tester operate
on all types, as do the procedures -datakey-, -dataword-, -isinteger-,
-islist-, -isvector- and other "recognizer" procedures. Even though the
same procedures can be used on two different types of objects, they will
generally produce different results. In POP-11 the concatenator <> works
on instances of most but not all types.

Summary:
    What defines a data-type is primarily its functionality, i.e.
    the class of procedures that can be used to create instances or
    operate on instances, and the results they produce; more generally
    what defines a data-type is what can be done with its instances.

Describing a data-type then involves specifying the procedures that can
operate on it, including the procedures for constructing instances,
recognising instances, accessing components (of a compled data type),
changing components, and so on. Sometimes this would include specifying
the relationships of the data-type to other types. E.g. in POP-11 the
components of strings and words are characters (i.e. 8-bit numbers). So
part of the specification of a string is that the string constructor
-consstring- takes characters, and that the component accessor -subscrs-
returns characters. (See HELP * STRINGS)

-- Primitive and derived data-types -----------------------------------

Some data-structures in a language are "primitive", whereas others
are "derived", i.e. define in terms of them. For example, in POP-11
pairs are given as primitive structures, but lists are a derived kind
of data-structure built out of pairs. If lists did not exist, users
could define them in terms of pairs. (See TEACH * WAL - What Are Lists).

Similarly, arrays are constructed from vectors.

From the point of view of the language implementor the only primitive
datatype is a bit, or perhaps a pattern of bits, since the machine will
provide procedures for operating on collections of bits 8, 16, or 32
bits long, (or more). So from an implementors point of view all the
familiar data-types are derived, data-types, though from the user's
point of view they may be primitive. Similarly the electronic engineers
who build the computers will treat the bit patterns as derived
abstractios, built on more primitive electronic structures and their
operations.

-- Formal specificaton of a data-type ---------------------------------

It is generally not important for the user HOW a data-type and its
procedures are implemented, as long as they work as specified. The
notion of "data-abstraction" is sometimes used to refer to this kind of
specification. It aids modular programming and allows the implementation
of a particular type to be changed without changing the other programs
that use it. For example, if all you know about pairs is that they are
created using -conspair- and that the procedures -front- and -back- are
used for accessing or updating their elements, then as long as you
simply use these three procedures on pairs your programs should not be
affected if a new technique for representing pairs in the machine is
implemented. Similarly it should matter if you move to another machine
where instead  of being implemented using 32-bit addresses they use
64-bits.

Alas, sometimes the method of implementation can affect users. For
example, when things go wrong deep in the system, the error message may
not be intelligible except in relation to how they are implemented.
Similarly performance issues may be affected, such as how much space
they use, or how fast programs run.

Occasionally different types are provided which are identical in their
abstract functionality, but differ in implementation details that make a
difference only to efficiency. E.g. there may be two ways of
implementing a kind of structure, one of which makes it faster to access
components, while the other makes instances take up much less space.
Which should be used would depend on whether space or time is more
important. An example in POP-11 would be -assoc- and -newproperty- both
of which create structures for storing relationships between objects.
However -assoc- does it by making a list of the associations, and
-newproperty- builds a "hash-coded" association table. For more on this
see HELP * ASSOC. The structures used by assoc will be smaller and
faster to search when they contain very few elements. The structures
built by newproperty will be faster if they contain lots of elements.
(There are other functional differences which are not relevant to this
discussion.)

It is useful if the nature of a data-type and the kinds of procedures
that can operate on it can be specified mathematically. This makes it
possible to have a well defined notion of what the implementation is
supposed to do, so that it can be tested. More importantly, if
application programs use data-types that are mathematically specified,
then makes it possible to give a mathematical specification of what the
programs are supposed to do, and this leads to the hope that:

    (a) it will be possible formally to prove that the program itself
    is correct according to the specification

    (b) it may even be possible (in some cases) for the program to be
    automatically generated from the specification.

This kind of work on "formal methods" is becoming increasingly
important in software engineering.

An example of the sort of thing that might go into a mathematical
specification would be an equation defining the relationships between
various functions. For examble if "<>" is a structure concatenator then
its specification might include the the following:

    length(S1 <> S2) = length(S1) + length(S2)

The specification for pairs might include

    front(conspair(X,Y)) = X

I.e. if you build a pair containing two objects then front gets you back
the first of them.

Alternatively a formal specification might indicate the behaviour of
pieces of program by relating prior states, the program fragment, and
the subsequent states.

E.g. if A is an array, N is an integer, X is any object, and
V is a variable, and initially A(N) = X, then after the assignment

    A(N) -> V

the following will be true
    V = X

-- Data-types in POP-11 -----------------------------------------------

POP-11 has a wide variety of built in data-types including
    numbers
        integers, bigintegers, decimals, ddecimals,
        ratios, complex numbers
    booleans
        i.e. true and false
    pairs (used to implement lists)
    strings
    vectors
    arrays
    procedures (yes they are a kind of data)
    closures (of procedures)
    properties

and more. For a full list of POP-11 data-types see REF * DATA.


Two important kinds of data-types are record-types and vector-types.
Examples of record-types are pairs and references. Examples of
vector-types are strings and full vectors. A record type has instances
with a number of fields, each field having a separate procedure for
accessing or updating it, whereas a vector type has instances whose
fields are accessed only via numerical subscripts, possibly using a
special subscripting procedure.

Thus, the field names for pairs are FRONT and BACK. For vectors the
subscripting procedure is -subscrv-, and for strings -subscrs-. Both
are vector classes.

Procedures defined to work on one data-type will not necessarily
work on another. E.g.

    vars pair,array,int;

    conspair(3,4) -> pair;
    newarray([1 5 1 5]) -> array;
    12345678 -> int;

    pair, array, int =>
    ** [3|4] <array [1 5 1 5]> 12345678

    front(pair) =>
    ** 3
    front(array) =>
    ;;; MISHAP - PAIR NEEDED
    ;;; INVOLVING:  <array [1 5 1 5]>
    ;;; DOING    :  front compile

    boundslist(array) =>
    ** [1 5 1 5]
    boundslist(pair) =>
    ;;; MISHAP - ARRAY NEEDED
    ;;; INVOLVING: [3|4]
    ;;; DOING    :  mishap boundslist compile

    int * int =>
    ** 152415765279684

    pair * pair =>
    ;;; MISHAP - NUMBER(S) NEEDED
    ;;; INVOLVING:  [3|4] [3|4]
    ;;; DOING    :  * compile

POP-11 provides recognisers for different data-types:

    ispair(pair), ispair(array), ispair(int) =>
    ** <true> <false> <false>

    isarray(pair), isarray(array), isarray(int) =>
    ** <false> <array [1 5 1 5]> <false>

    isinteger(pair), isinteger(array), isinteger(int) =>
    ** <false> <false> <true>

A recogniser should work with everything. The only recogniser that
recognises itself is -isprocedure-
    isprocedure(isprocedure) =>
    ** <true>

Besides the built-in data-types, POP-11, like many other languages,
provides facilities for the user to define new classes of records or
vectors. Higher level virtual data-types can be defined by the user in
ways sketched below.


-- Compile time vs run time types -------------------------------------

In some languages all identifiers have types and these are used by
the compiler to check consistency, allocate store, and do optimisations.

Standard ML is an example of a more recent language with a "polymorphic
type" system. Instead of defining one list membership function for lists
of integers and another for lists of words you can define a single
list membership function which will, at compile time, be specialised
according to the type of the arguments, where possible.

In POP-11 there are some identifiers of type "procedure", e.g. all infix
operations and identifiers declared using

    vars procedure f;  or   vars procedure (f1, f2, f3);

When such identifiers are used in procedure calls the compiler can
produce instructions for a fast procedure call, whereas for others it
has to produce instructions that check that the value really is a
procedure.

Apart from this, POP-11 has (at present) no syntactic typing. So only
when a value has been assigned to a variable is it possible to find out
what kind of object it has as its value. Lisp and Prolog are like this
too. It gives more flexibility and generality (and faster compilation)
at the cost of slower running because of the type checking.

The distinction between compile time and run time types is sometimes
described as a difference between syntactic and semantic types.

-- How are higher level types implemented? ----------------------------

The people who built the computers did not take account of the needs of
POP-11 and its data-types. So how does the computer get to know aout
them?

Computers are generally built with certain kinds of primitive structures
and resources for manipulating them. E.g. they come with mechanisms for
handling bit-patterns of various lengths and provide operations for
doing things like:

    tests:      are p1 and p2 equal?
                is the N'th bit of p1 set?

    copying:    copy contents of loc1 to loc2

    alteration: set the N'th bit of p1
                shift p1 N bits left or right

    arithmetic: treat p1 and p2 as numbers and store the result
                of adding them in loc1.


Unfortunately bit patterns are not an adequate representation for most
of the things we want to apply computers to. I.e. we need to represent
the shapes of objects, the structures of plans, sentences, family trees,
the contents of images, the organisation of a factory, the current state
of an order book, the stock in a warehouse, the contents of a
dictionary, etc. etc.

Similarly the operations that we need to perform on representations are
not necessarily the operations provided by the computer. E.g. we may
want to match two networks (graphs) and produce a description of how
they differ.

Computer scientists have therefore devised ways of implementing more
abstract data-types (virtual data-types) in terms of the primitive ones.
This enables us to talk about the computer as containing such things as
strings of characters, words, arrays, lists of words, trees, networks,
inference rules, data-bases, association tables, procedures, etc.

Associated with each class of data, or data-type, will be a family of
procedures, which help to define what that data-type is. E.g. associated
with a data-type may be procedures concerned with:

    constructing instances,
    recognising them,
    matching two structures,
    interrogating:
        e.g. how big is it? what is its N'th component?, etc.
    traversing (e.g. doing something to every component),
    searching,
    changing the contents of instances:
        e.g. inserting , deleting, replacing a component,
    sorting components
        (e.g. into alphabetic or numeric order)
    merging instances,
    building networks of instances,
    printing,
    reading in external representations,
    associating other entities with instances,
    sending messages to instances.


-- Common data types --------------------------------------------------

Most programming languages support booleans, integers, reals (decimals),
strings (one-dimensional arrays or vectors of characters),
multi-dimensional arrays, records of fixed length.

Usually it is up to the user to implement additional structures in terms
of these. Most books on data-structures show various ways to implement:

    stacks
    queues
    linked lists
    trees
    graphs (networks - directed or undirected)

From books on data-structures and algorithsm you will discover how to
implement a range of procedures associated with particular classes of
structures. E.g. there are algorithms for searching a large string to
find out whether a given string is contained in it. Similarly searching
a graph to see if a  smaller graph is containined in it, or checking
whether two graphs match.

Algorithms have to be very carefully designed: a bad design may be so
much slower, or require so much more space, that it is unusable on
larger data-structures.


-- Lazy evaluation ----------------------------------------------------

Sometimes structures are created only as required. This is sometimes
called "lazy evaluation". For example POP-11 allows the creation of
dynamic lists, defined in terms of a procedure which returns a new
result each time it is called. (A generator procedure). The procedure
PDTOLIST (ProceDure TO LIST) when applied to such a generator returns a
dynamic list.

    lvars counter=0;
    define count()-> result;
        lvars result=counter;
        counter + 1 -> counter;
    enddefine;

    vars dlist=pdtolist(count);
    dlist =>
    ** [...]    ;;; unexpanded dynamic list
    hd(dlist) =>
    ** 0
    hd(tl(dlist)) =>
    ** 1
    dlist =>
    ** [0 1 ...]
    hd(tl(tl(dlist))) =>
    ** 2
    dlist =>
    ** [0 1 2 ...]
    "cat" -> dlist(5);
    dlist =>
    ** [0 1 2 3 cat ...]

DLIST is a list of (potentially) infinite length.

AI programs often need to create structures that are potentially
infinite. For example, a knowledge store may contain some axioms and
inference rules, and may be used as if all the consequences derivable
from the axioms via the rules were in the store already. Each time you
ask for some information, if it is not explictly there the data-base
manager checks whether it is derivable, and if so adds it explicitly,
reporting that it is there. The user cannot tell that it wasn't there
all along, except that there may be a delay before the answer comes.


-- Virtual data-types in POP-11 ---------------------------------------

In POP-11 the range of procedures associated with a data-type can be
found in the HELP CLASSES file. You can find more technical detail in
REF DATA and REF KEYS.

POP-11 is unusual in that it allows data-structures to be treated as
procedures, in which case the "class_apply" procedure for the data type
will be invoked. This is analogous to the approach of "object oriented"
systems, which allow "messages" to be sent to data.

In addition to general procedures associated with a range of data-types
there will also be very specific procedures. E.g. arithmetic operations
such as subtraction and multiplication are not defined on lists, though
they are defined on a range of number types:

    integers, big integers, ratios,
    decimals, ddecimals, complex numbers.

Similarly there is a collection of procedures associated with arrays
(See HELP ARRAYS), which will not work on other objects.

Some languages have a fixed set of data-types. Others, like POP-11, C,
Pascal, Algol68, ML, and modern versions of Lisp, allow the user to
define new data-types, since the built in set may not meet the needs of
all applications.

However, even the built-in methods of extending data-types may be too
restricted.

For example, the most common extensions are to allow new record classes,
and new vector classes (or array classes). The built in mechanisms will
then provide a means of building a new vector or record of the newly
defined class.

However, if what you want is to be able to represent a more complex
structure e.g. in the form of a network or tree, then you will usually
have to define for yourself the procedures which create such entities
out of those the system can already handle. Similarly, the built in
mechanisms, which usually provide procedures for recognition, accessing
and modification of structures, may not provide all the kinds of
procedures required. E.g. you may wish to have an accessing procedure
which not only tells you what the first element of a structure is, but
also records the number of times it has been invoked on that structure
by modifying one of the fields of that structure. If so you will have to
define it yourself.

In this way, programs can define more and more abstract kinds of
"virtual" structures built layer by layer on top of those given by the
machine. An interesting question for AI is whether the kinds of things
that we refer to in talking about the mental states of intelligeng
agents could be defined in terms of abstract virtual data-structures.
For example beliefs, desires, intentions, skills, personality traits,
might all be virtual data-structures, perhaps often made up of component
virtual data-structures. E.g. the belief that food is in the oven, the
desire to eat the food and the intention to eat it may all share a
common sub-structure. However these are controversial issues.


-- Implicit structures - the POP-11 user stack, etc. ------------------

The POP-11 stack is not a data-structure that users can point to. It is
a unique part of the address space used by the system. It can (with
caution!) be treated as a variable length vector, and the procedure
SUBSCR_STACK (and its updater) can be used to access its components.

It is accessed implicitly by:
    assignments:            -> x      (remove top of stack)
    variable accesses:      x;        (copy value to top of stack)
    calls of procedures:    f(x)      (arguments taken off stack,
                                        results left on stack)
    calls of updaters:      -> f(x)   (if something is taken off stack)

TEACH * STACK explains more. HELP * STACK summarises stack operations,
including explicit stack operatoins like * DUP, SUBSCR_STACK and =>


Besides the built in user stack, there are other "hidden" stacks used
by POP-11, e.g. the stack of procedure activations, the Prolog
continuation stack.

In addition users may create their own stacks, e.g. using lists (with a
garbage collection overhead) or vectors.


-- Implementation issues ----------------------------------------------

Given that the computer is provided with a linear (virtual) memory
composed of bit patterns (bytes, words, long-words), how can a rich
variety of data-types be created?

The usual technique is to set aside a portion of the memory called
the "heap", and to divide it up into sub-portions representing different
sorts of structures. Systems vary in the degree of flexibility. Issues
include:

1. Does the heap have to have a fixed size or can it be dynamically
enlarged or compressed?

2. Does the allocation of the heap to different data-types have to be
rigid or can the system determine allocation dynamically as required?

3. Can the system automatically reclaim and re-allocate space that is
no longer required? (I.e. is garbage collection provided.)

4. Must everything always be created explicitly or can some structures
be represented implicitly and extended "on demand" (e.g. dynamic lists
in POP-11 or sparse arrays, most of whose locations do not exist).

5. Can type-checking (i.e. ensuring that a procedure is not applied to
the wrong sort of data) be done at compile time or must it be done at
run time? Languages like Pascal, C and ML do it at compile time, whereas
Pop-11 and Lisp do it at run-time.

6. Should run-time checking be implemented by means of "tags" or by
"keys". This issue is discussed below.

If a structure is represented by a portion of the heap of arbitrary size
then it will not generally be able to fit into the computer's registers,
nor be able to be operated on as a single chunk. It is therefore
normally represented by a "pointer" i.e. a smaller structure (usually
the size of a machine word) that says where it is in the heap, i.e.
gives its address.

Entities that can fit into a single machine word need not be represented
by pointers. In POP-11 there are two such types: integers and decimals:
they are called "simple" and all other types are called "compound".
(But BIGINTEGERS and DDECIMALS are compound.)


-- Where does a structure end? ----------------------------------------

Given a pointer that indicates where a structure is in the heap a
procedure to operate on that structure will need to know how much of the
heap is part of the same structure, i.e. what its size is.

There are three main cases, depending on the type of structure:

1. Contiguous fixed length (record classes):
    All instances of the type have the same length. E.g. all PAIRS have
    two elements. All RATIOS have two elements. All REFERENCES have only
    one element. A user-defined record calss might consist of three-
    element structures.

    In this the type of structure and its starting point in the heap
    (i.e. its address) suffices to determine how much of the heap
    the structure occupies.

2. Contiguous variable length (vector classes):
    Here different instances of the same type have different lengths.
    E.g. 'cat' and 'catch' are strings (character vectors) with three
    and five elements respectively. So knowing that something is a
    string does not tell you how long it is. Usually the length is
    indicated near the beginning of the structure. E.g. it might be
    in the first word of the structure.

3. Non-contiguous structures.
    Lists in POP-11 are made of two-element links called pairs. The
    pairs comprising one list may be scattered all over the heap. In
    this case the basic mechanisms manage only the contiguous structures
    and special purpose list processing procedures have to be defined
    for dealing with the non-contiguous virtual structures. E.g. the
    LISTLENGTH procedure chains down the list counting elements until
    it gets to the end. (Would it be possible to associate the remaining
    length with each list link, if each link had three, instead of two
    fields?)

There are also cases where the whole structure is never created in full.
Instead bits of it are created as they are needed. The dynamic list,
illustrated above, demonstrates this.

There are (at least) two reasons why non-contiguous structures are
sometimes needed.

a. At the time the structure is created it may not be known how many
elements it is going to have eventually. So the initial size allocated
on the heap may not be enough. If it later has to be extended there
may not be spare space adjacent to it on the heap. Instead of shifting
everything about to make more room it may be simpler to "chain"
different structures together, treating them as one large structure.

b. It may be necessary for one substructure to be part of two or more
larger structures. This can be achieved by allowing larger structures to
share sub-structures in different parts of the heap. E.g suppose you want
to make a list representing all the subsets of a set of elements in a
given list [a b c]. Including the empty set, which is a subset of
everything, the eight sub-sets would be (using the convention that
elements are stored in alphabetical order):

    1   2    3    4    5      6      7      8
    []  [a]  [b]  [c]  [a b]  [a c]  [b c]  [a b c]

By letting list 3 be the "tail" lf list 5, and letting list 4 be the
tail of lists 6 and 7, and letting 7 be the tail of 8, we can save quite
a lot of construction of new lists.


-- Tagged architectures -----------------------------------------------

Sometimes the type of a structure determines its length. Sometimes it
doesn't. But in any case it is necessary to determine the type of a
structure, e.g. in order to decide how an operation is to be applied
to it (e.g. how should it be printed? or how should two of them be
compared by the equality test?).

One way to indicate type is to use some extra bits in the "pointer",
which then has a type-field (or "tag") and an address-field saying where
the stucture is in the heap. Any procedure attempting to identify the
type need only look at the tag bits, and use them as a key into a
table.

A related technique is to divide the address space into different
"cages" and put data of different types in the different cages. The
different cage boundaries can then, in principle, be shifted around
depending on how many different instances there are of each type.

Another alternative, used in POPLOG, is to add an extra field to each
structure (apart from the simple ones). The extra field points to a
"key" for the data-type. The key is itself a record of a special type,
and contains information about the type, e.g. how to print it, how to
select or update fields, etc. This is a very simple mechanism but has
the disadvantage that checking the type of any item requires getting
the key field out of memory, and this can be slower than simply
inspecting the pointer to examine the tag bits or check which address
range it points to.

The problem with tag bits is that it may require a non-standard memory
(e.g. lisp machines have 36 bit word length), and moreover there may not
be enough tag bits to cater for all the data-types required.


-- Garbage collection -------------------------------------------------

There are many different techniques for this, depending on the kind of
data representation used. POPLOG uses a "stop and copy" method, which
requires more space to be available during garbage collection but has
the advantage that only the non-garbage items are ever looked at.

Alternatives involve shuffling items down the heap to compact them,
which has the advantage that the order of items on the heap always
corresponds to the order in which they were created, which can be
useful.

Incremental garbage collectors are useful for programs that need to
operate in "real time" and therefore cannot suddenly stop for a length
garbage collection.

Some garbage collection methods require an extra bit of every item to be
made available to mark it as already examined during the sweep through
the heap.


-- Association structures ---------------------------------------------

These are of various kinds. E.g.
    two arrays with N'th elements of each associated
    one array of two element fields
    association lists (see HELP *ASSOC)
    hash tables (see HELP * NEWPROPERTY, HELP * NEWANYPROPERTY)


-- Structure sharing: context mechanisms ------------------------------

Sometimes it is useful to associate values with objects in such a way
that the value can change over time, or be different in different
"contexts", e.g. different hypothetical situations considered during
planning.

"context" or "viewpoint" mechanisms make this possible without having to
duplicate everything that changes, by allowing one context to be a
descendent of another in a context tree, and only storing in a lower
level context items that have changed from higher level contexts.
An illustrative library package is described in TEACH * VIEWS and
HELP * VIEWS (in POPLOG 12.6 onwards).


-- Control structures -------------------------------------------------

Different control structures are required for operations on different
sorts of data. For example, traversing a linear list or vector can
easily be done by means of a loop:

    for x in list do .... x.... endfor

    for x from 1 to datalength(vector) do
        .....subscrv(x,vector)...
    endfor


However, for traversing a tree it is normally useful to use recursion.
E.g. to count all the atoms in a tree:

    define count(tree) -> total;
        lvars tree,total;
        if tree == [] then 0 -> total
        elseif atom(tree) then 1 -> total
        else count(hd(tree)) + count(tl(tree)) -> total
        endif
    enddefine;

    vars tree=[1 2 [3 [4 [5] ] 6]];

    count(tree) =>
    ** 6

    trace count;
    count(tree) =>
    >count [1 2 [3 [4 [5]] 6]]
    !>count 1
    !<count 1
    !>count [2 [3 [4 [5]] 6]]
    !!>count 2
    !!<count 1
    !!>count [[3 [4 [5]] 6]]
    !!!>count [3 [4 [5]] 6]
    !!!!>count 3
    !!!!<count 1
    !!!!>count [[4 [5]] 6]
    !!!!!>count [4 [5]]
    !!!!!!>count 4
    !!!!!!<count 1
    !!!!!!>count [[5]]
    !!!!!!!>count [5]
    !!!!!!!!>count 5
    !!!!!!!!<count 1
    !!!!!!!!>count []
    !!!!!!!!<count 0
    !!!!!!!<count 1
    !!!!!!!>count []
    !!!!!!!<count 0
    !!!!!!<count 1
    !!!!!<count 2
    !!!!!>count [6]
    !!!!!!>count 6
    !!!!!!<count 1
    !!!!!!>count []
    !!!!!!<count 0
    !!!!!<count 1
    !!!!<count 3
    !!!<count 4
    !!!>count []
    !!!<count 0
    !!<count 4
    !<count 5
    <count 6
    ** 6


-- Operating systems require data-structures --------------------------

An operating system will generally require a range of data-structures
representing the current state of the file-store, the different
processes currently running, the allocation of processes to main memory
and to disk, information about various users and the groups they belong
to, the various devices attached to the computer (e.g. disks, terminal
channels, frame-stores, etc.). In many systems there will be accounting
information associated with each user, e.g. a record of the amount of
CPU time, the number of disk accesses, the size of memory used for
various lengths of time, etc.


-- Compilers require data-structures ----------------------------------

A compiler needs to store information about various symbols that can be
recognised in the input stream (a symbol table), information about the
syntactic rules of the language, information about the machine code
equivalents of various kinds of operations in the language being
compiled (i.e. its semantics). During compilation it has to build
temporary structures representing what it has read in so far and how it
has analysed it. Some compilers build parse-trees as an intermediate
representation.

The result of compilation may be a new structure, a record consisting of
machine instructions to be run and some additional data-structures
representing information required when the procedure runs.

In POP-11 procedures are compiled into records which contain a range of
information, e.g.
    the key (i.e. a pointer to the procedure key)
    what type of procedure it is (there are several types in POP-11)
    the PDPROPS (usually containing the name of the procedure0
    the PDNARGS (a field specifying the number of arguments)
    the UPDATER (a field specifying the action to be executed when
        the procedure is called to the right of an assignment arrow.)

In addition there will be information not accessible to the user, about
the local variables and other structures used by the procedure, and the
other procedures it invokes.

-- Conclusion ---------------------------------------------------------

This file does not pretend to be a complete survey of the concepts,
problems and techniques associated with the notion of a datastructure.
It merely introduces the issues. A particularly important issue that has
not been addressed, and which is specially relevant to the design of
intelligent systems, is how a machine can use structures within it to
refer to things outside itself. But this problem arises also for human
beings. At least we know that machines can use pointers to refer to
things inside themselves.

-- Reading ------------------------------------------------------------

In general books on this topic tend to be rather dated and simplistic.
But you should read some anyway, in order to see what the "standard"
material looks like. New books are coming out all the time, and it is
very likely that there are some very good new ones that I don't know
about.

The first boook is good and cheap.

Seymour Lipschutz
    Theory and problems of data structures
    Schaum's outline series
    McGraw-Hill 1986            Paperback 7.95

Here are more, culled from a reading list produced by Jon Cunningham,
with his comments, plus a few cross-references found in other texts:

Mark Elson
    Data Structures
    Science Research Associates, 1975
    This one looks quite good, but the library only has three copies.

E Horowitz & S Sahni.
    Fundamentals of Data Structures
    Apparently they invent their own language, which could be
    confusing.

The following two books use PASCAL, but you should be able to make
sense of them.

Page & Wilson
    Information representation and manipulation in a computer
    (2nd Edition) Cambridge Univ Press, 1978

Aho, Hopcroft & Ullman.
    Data Structures and Algorithms

D. Knuth
    Fundamental Algorithms, (chapter 2).

M.J.R Shave
    Data Structures
    McGraw-Hill, London 1975.
    Elementary, but dated and uses ALGOLW

D. Coleman
    A structured approach to data
    Macmillan, London 1978

-- Relevant TEACH, HELP and REF files ---------------------------------

There are several TEACH, HELP, and REF files in POPLOG that provide
additional information that you may find useful:

TEACH * WAL tells you how lists are represented in terms of PAIRS in
            POP-11.

TEACH * LISTS, * SETS, * SETS2 (how to represent sets in terms of lists)

The following explain some commonly used data types
HELP * PAIR, * LISTS, * VECTORS, * STRINGS, * ARRAYS, * REF

HELP * PDTOLIST     - on dynamic lists
HELP * PROPERTIES   - on structures mapping objects to objects
HELP * CLOSURES     - structures combining a procedure and some data

Overview help files
HELP * DATASTRUCTURES, * CLASSES,

Defining new record or vector types
HELP * RECORDCLASS, * VECTORCLASS

HELP * MATH, and REF * NUMBERS, describe number data types in POPLOG
    and the operations on them.

REF  * DATA, * RECORDS, * VECTORS, * STRINGS, * KEYS, * PROPS

--- C.all/teach/datastructures -----------------------------------------
--- Copyright University of Sussex 1988. All rights reserved. ----------