Finding the i'th Letter of the n'th Word of a DOL-Sequence

FINDING THE iTH LETTER OF THE nTH WORD OF A DOL-SEQUENCE

David Swart, University of Waterloo

December 1998

The following is the documentation for the algorithm my MMath thesis was based on. The technical report outlining the main results of my thesis should give a good idea of the purpose and the structure of the algorithm. I have provided the code for the implementation that uses the C++ double data type as well as a version which handles arbitrarily large integers.

We chose to implement our algorithm in C++, although most other procedural languages would have also sufficed. If the reader wishes to implement the algorithm or perhaps even improve upon what we provide, this chapter provides details of our implementation. This chapter also serves as the documentation for our program.

Running the Program

To run our program type:

./main [-bfmp]

Our program takes the specification for a homomorphism Phi, the integers i and n, and the letter a, from standard input. Our program then proceeds to compute the ith letter of Phiⁿ(a) and then outputs the result to standard output

Including the flag b tells the program to compute the letter using the Basic Tree Descent Algorithm.

Including m allows the user to specify a range of values for i such as '1-5,13,28' instead just one integer.

The f flag is used to tell the program that the given homomorphism has an infinite fixed point starting with a and allows the program to calculate the value of i accordingly.

Finally, the p flag is used to let the program prompt the user for the required input, which is useful when entering data using a keyboard rather than a file.

Sample Input and Output
The following illustrates the required input format.

a bc
c da
b be
d dd
e e
@
100000 
a 
10000

The homomorphism is specified with one production per line. Each line is of the form ax (spaces and tabs are ignored) where Phi(a)=x. The @ symbol signifies the end of Phi's specification. Finally n, a, and i follow, delimited by white space.

Here we have the coresponding output.

*************The Alphabet:
abcde
*************The Homomorphism:
e => e
d => dd
b => be
c => da
a => bc

Vital: abcde
Mortal:
Recursive: abcde
Expanding: d
Exponential: acd
Find:
  Find called with n=100000,a=a,i=10000.
  bb
Case 3
  -1
The output begins by printing the alphabet, the homomorphism, and the
sets of vital letters, mortal letters, and so on.
When Find is called, it reports the values of n, a, and i,
and then starts descending the derivation tree.
In this case, the derivation path starts at 'a', and proceeds to 'b'
and then encounters another 'b'; a repeated letter is found.
We see that our algorithm enters Case 3 and does a binary search for 
t = 9999.  Since the value of x is polynomial, 
the algorithm starts again with n=9999, a='b' and 
i=10000.
This time we enter Case 2, BasicTreeDescent is called, and
the result 'e' is reported.


Program Modules

The data types that our program deals with are not too complex. We want tools 
available to manipulate sets of letters, arbitrarily large integers, 
and matrices and vectors with such integers as elements.
Our algorithm builds upon these to manage higher levels of abstraction
such as homomorphisms and derivation trees.  
Here, we discuss each of the data types mentioned above as well as their
associated methods. 



A set of symbols is needed to represent things such as the letters accessible 
from 'a', or the set of mortal letters with respect to Phi.
The SymbolSet class stores this data as an array of sixty-two integers 
which correspond to the occurrences of each of the upper and lower 
case letters and ten numerals.
The SymbolSet class has methods to add and delete letters to a set, to
test whether a set contains a specified letter, to test whether two 
sets are equal or disjoint, 
to supply the Parikh vector or a string corresponding to a given set,
and to supply the cardinality of the set itself.



With the intention of eventually replacing the data type used to store 
integers with a data type more capable of holding larger numbers, 
a separate module is used to define the LargeNumber data type.
Initially, LargeNumbers were simply defined as doubles and,
as we will see below, replaced with
a data type capable of storing arbitrarily large numbers.



The code dealing with matrices and vectors is found in the 
MatrixUtil module.  While there are many publicly available
methods to implement matrices, we also require that our matrices
have LargeNumbers as elements.
Naturally, we use an array of LargeNumbers to store the data. 
In addition to the standard operations one might expect in a Matrix class 
(such as specifying the value of an element, or multiplying two matrices), 
we include the procedures Power and SuperPower as special methods.


Recall that we require Power and SuperPower to restrict
each incidence matrix to only the letters accessible from a given word.
Hence, we also include another method Clean to remove the rows and 
columns corresponding to these unaccessible letters.
At first, Clean merely set the elements of each unnecessary row and
column to zero, but too much time was wasted adding and multiplying zeros 
together.  The method we finally arrived at stores two different values 
for the size of the array: one value to describe the
number of elements allocated in memory, the other to specify the how many
rows and columns are actually being used.  
Thus, the procedure Clean entails copying the useful elements of the 
matrix to the top left corner, and then reducing the number of 
rows and columns begin used.



Our most important data structure is our homomorphism.
Paired with an alphabet, a homomorphism becomes a D0L-scheme and hence the
DOLScheme is the most important class our algorithm deals with.
The DOLScheme class has methods which allow us to create its incidence 
matrix and to calculate the various sets of symbols
such as the polynomially growing letters.
More importantly, however, the procedures Length, BasicTreeDescent,
GenericLength, and Find are also part of the DOLScheme class.
The DOLScheme stores the alphabet Sigma as a SymbolSet
and the homomorphism (implemented separately as the class Homom)
is a linked list of character-string pairs.


As mentioned above, the DOLScheme module contains a 
substantial amount of our algorithm.  
The details of our algorithm have been explained and defined in 
the technical report mentioned above and our program 
is mostly a straightforward implementation of the given pseudocode.
We say `mostly' since the pseudocode 
does not provide tedious details such as those showing how 
to manage the allocation and deallocation of memory,
to restrict a matrix to the letters accessible from a given word,
or to implement a binary search.
Since such details are not terribly enlightening, we omit them here.



In BasicTreeDescent it is not necessary to store any details
of the derivation tree. In Find, however, we need to keep track of 
the derivation path and the branches not taken at level j, that is,
a_j, x_j, and y_j.
The class DerivTree is written for such a purpose and contains several
useful methods. 
For instance, we have methods to set and read the values of 
a_j, x_j, and y_j,
to find whether a repeated letter has been 
encountered yet, and to return the set of letters contained in 
x and y (useful for determining whether x 
or y is exponentially growing, for example).
Again, we simply use two arrays of SymbolSets and one array 
of letters to store this data.



While the Interface module and the Main module do not define any 
data structures or classes, they do provide a means 
to call the algorithm with a variety of inputs and parameters.  
The Interface module allows the user to 
input homomorphisms and arbitrarily large integers.
The Main module determines whether the user should be prompted for 
input or not, whether BasicTreeDescent or Find is called, and
whether tracing should be turned on or off.
Once all the necessary data has been input, then the ith letter of 
Phiⁿ(a) is calculated.  
Since both the Interface module and the Main module 
are not vital to our algorithm,
they can be replaced by any other program that wishes to call Find.

Multiple-precision Version
 
It is necessary to ensure that our algorithm can handle arbitrarily large 
integers. For instance, on the machine our algorithm was implemented 
on, the double data type does not have enough precision to calculate that
Phi^2¹⁰⁰+1(b)=1 for 
'b'->'b1', '1'->'10', and '0'->'00'.
Hence we needed a package by providing a multiple-precision 
data type, capable of storing arbitrarily large numbers.  

The requirements for our package are fairly simple.
In addition to being able to assign the value of one integer a to another
integer b, we required the ability to compute 
a+b, a-b, a*b, 
a mod b, floor(a-b), 
and to determine whether 
a<b, a=b, a>b. 
Since most multiple precision packages already have this capability, I 
chose to use the Berkeley MP package
which some Unix systems provide (in our case, SunOS version 4.1).


There are many other multiple precision packages available if the reader 
wishes to use some other package. 
To do so, the LargeNumber module need only incorporate the new multiple 
precision package to provide the following functions.


move(a,b) : moves the value of a to b 

madd(a,b,c) : puts the value a+b into c 

msub(a,b,c) : puts the value a-b into c 

mult(a,b,c) : puts the value a*b into c 

mdiv(a,b,c,d) : puts the value floor(a/b) into c 
                       and the value a mod b into d 

mcmp(a,b) : returns 0 if a=b, 1 if a>b, 
                   and -1 if a<b 
 
mprint(a) : simply outputs the value of a in decimal 

itom(i) : returns a pointer to a new variable initialized to i

mfree(a) : frees the memory allocated for the variable a.



An unfortunate circumstance of Berkeley MP occurs when 
calculating expressions such as j<- tq+r.  
Such calculations must be 
broken up into the two statements: mult(t,q,j); madd(j,r,j);.
which render the code a little less readable.
In retrospect, the capabilities of C++ could have been used to define 
operators between a newly defined multiple-precision class --- as opposed to 
just the data type it is now --- which would have allowed for more 
complex expressions.
          
Testing for Correctness

If the reader plans on implementing our algorithm, or just modifying our 
implementation it may be useful to have the test cases we developed.
Our original test set was meant to realize each case of our algorithm.
As the program was being developed, various values of 
Phi, n, a, and i would produce erroneous output.  
A new test case was added for each bug 
fixed until we finally arrived at our
final set of test cases.



 DMS 1998