FINDING THE iTH LETTER OF THE nTH WORD OF A DOL-SEQUENCE


David Swart, University of Waterloo

December 1998

The following is the documentation for the algorithm my MMath thesis was based on. The technical report outlining the main results of my thesis should give a good idea of the purpose and the structure of the algorithm. I have provided the code for the implementation that uses the C++ double data type as well as a version which handles arbitrarily large integers.

We chose to implement our algorithm in C++, although most other procedural languages would have also sufficed. If the reader wishes to implement the algorithm or perhaps even improve upon what we provide, this chapter provides details of our implementation. This chapter also serves as the documentation for our program.

Running the Program

To run our program type: Our program takes the specification for a homomorphism Phi, the integers i and n, and the letter a, from standard input. Our program then proceeds to compute the ith letter of Phin(a) and then outputs the result to standard output

  • Including the flag b tells the program to compute the letter using the Basic Tree Descent Algorithm.
  • Including m allows the user to specify a range of values for i such as '1-5,13,28' instead just one integer.
  • The f flag is used to tell the program that the given homomorphism has an infinite fixed point starting with a and allows the program to calculate the value of i accordingly.
  • Finally, the p flag is used to let the program prompt the user for the required input, which is useful when entering data using a keyboard rather than a file.

    Sample Input and Output
    The following illustrates the required input format.

    a bc
    c da
    b be
    d dd
    e e
    @
    100000 
    a 
    10000
    
    The homomorphism is specified with one production per line. Each line is of the form ax (spaces and tabs are ignored) where Phi(a)=x. The @ symbol signifies the end of Phi's specification. Finally n, a, and i follow, delimited by white space.

    Here we have the coresponding output.

    *************The Alphabet:
    abcde
    *************The Homomorphism:
    e => e
    d => dd
    b => be
    c => da
    a => bc
    
    Vital: abcde
    Mortal:
    Recursive: abcde
    Expanding: d
    Exponential: acd
    Find:
      Find called with n=100000,a=a,i=10000.
      bb
    Case 3
      -1
    The output begins by printing the alphabet, the homomorphism, and the
    sets of vital letters, mortal letters, and so on.
    When Find is called, it reports the values of n, a, and i,
    and then starts descending the derivation tree.
    In this case, the derivation path starts at 'a', and proceeds to 'b'
    and then encounters another 'b'; a repeated letter is found.
    We see that our algorithm enters Case 3 and does a binary search for 
    t = 9999.  Since the value of x is polynomial, 
    the algorithm starts again with n=9999, a='b' and 
    i=10000.
    This time we enter Case 2, BasicTreeDescent is called, and
    the result 'e' is reported.
    
    
    

    Program Modules

    The data types that our program deals with are not too complex. We want tools available to manipulate sets of letters, arbitrarily large integers, and matrices and vectors with such integers as elements. Our algorithm builds upon these to manage higher levels of abstraction such as homomorphisms and derivation trees. Here, we discuss each of the data types mentioned above as well as their associated methods.


    A set of symbols is needed to represent things such as the letters accessible from 'a', or the set of mortal letters with respect to Phi. The SymbolSet class stores this data as an array of sixty-two integers which correspond to the occurrences of each of the upper and lower case letters and ten numerals. The SymbolSet class has methods to add and delete letters to a set, to test whether a set contains a specified letter, to test whether two sets are equal or disjoint, to supply the Parikh vector or a string corresponding to a given set, and to supply the cardinality of the set itself.


    With the intention of eventually replacing the data type used to store integers with a data type more capable of holding larger numbers, a separate module is used to define the LargeNumber data type. Initially, LargeNumbers were simply defined as doubles and, as we will see below, replaced with a data type capable of storing arbitrarily large numbers.


    The code dealing with matrices and vectors is found in the MatrixUtil module. While there are many publicly available methods to implement matrices, we also require that our matrices have LargeNumbers as elements. Naturally, we use an array of LargeNumbers to store the data. In addition to the standard operations one might expect in a Matrix class (such as specifying the value of an element, or multiplying two matrices), we include the procedures Power and SuperPower as special methods.

    Recall that we require Power and SuperPower to restrict each incidence matrix to only the letters accessible from a given word. Hence, we also include another method Clean to remove the rows and columns corresponding to these unaccessible letters. At first, Clean merely set the elements of each unnecessary row and column to zero, but too much time was wasted adding and multiplying zeros together. The method we finally arrived at stores two different values for the size of the array: one value to describe the number of elements allocated in memory, the other to specify the how many rows and columns are actually being used. Thus, the procedure Clean entails copying the useful elements of the matrix to the top left corner, and then reducing the number of rows and columns begin used.


    Our most important data structure is our homomorphism. Paired with an alphabet, a homomorphism becomes a D0L-scheme and hence the DOLScheme is the most important class our algorithm deals with. The DOLScheme class has methods which allow us to create its incidence matrix and to calculate the various sets of symbols such as the polynomially growing letters. More importantly, however, the procedures Length, BasicTreeDescent, GenericLength, and Find are also part of the DOLScheme class. The DOLScheme stores the alphabet Sigma as a SymbolSet and the homomorphism (implemented separately as the class Homom) is a linked list of character-string pairs.

    As mentioned above, the DOLScheme module contains a substantial amount of our algorithm. The details of our algorithm have been explained and defined in the technical report mentioned above and our program is mostly a straightforward implementation of the given pseudocode. We say `mostly' since the pseudocode does not provide tedious details such as those showing how to manage the allocation and deallocation of memory, to restrict a matrix to the letters accessible from a given word, or to implement a binary search. Since such details are not terribly enlightening, we omit them here.


    In BasicTreeDescent it is not necessary to store any details of the derivation tree. In Find, however, we need to keep track of the derivation path and the branches not taken at level j, that is, aj, xj, and yj. The class DerivTree is written for such a purpose and contains several useful methods. For instance, we have methods to set and read the values of aj, xj, and yj, to find whether a repeated letter has been encountered yet, and to return the set of letters contained in x and y (useful for determining whether x or y is exponentially growing, for example). Again, we simply use two arrays of SymbolSets and one array of letters to store this data.


    While the Interface module and the Main module do not define any data structures or classes, they do provide a means to call the algorithm with a variety of inputs and parameters. The Interface module allows the user to input homomorphisms and arbitrarily large integers. The Main module determines whether the user should be prompted for input or not, whether BasicTreeDescent or Find is called, and whether tracing should be turned on or off. Once all the necessary data has been input, then the ith letter of Phin(a) is calculated. Since both the Interface module and the Main module are not vital to our algorithm, they can be replaced by any other program that wishes to call Find.

    Multiple-precision Version

    It is necessary to ensure that our algorithm can handle arbitrarily large integers. For instance, on the machine our algorithm was implemented on, the double data type does not have enough precision to calculate that Phi2100+1(b)=1 for 'b'->'b1', '1'->'10', and '0'->'00'. Hence we needed a package by providing a multiple-precision data type, capable of storing arbitrarily large numbers.

    The requirements for our package are fairly simple. In addition to being able to assign the value of one integer a to another integer b, we required the ability to compute a+b, a-b, a*b, a mod b, floor(a-b), and to determine whether a<b, a=b, a>b. Since most multiple precision packages already have this capability, I chose to use the Berkeley MP package which some Unix systems provide (in our case, SunOS version 4.1).

    There are many other multiple precision packages available if the reader wishes to use some other package. To do so, the LargeNumber module need only incorporate the new multiple precision package to provide the following functions.

    move(a,b) : moves the value of a to b
    madd(a,b,c) : puts the value a+b into c
    msub(a,b,c) : puts the value a-b into c
    mult(a,b,c) : puts the value a*b into c
    mdiv(a,b,c,d) : puts the value floor(a/b) into c and the value a mod b into d
    mcmp(a,b) : returns 0 if a=b, 1 if a>b, and -1 if a<b
    mprint(a) : simply outputs the value of a in decimal
    itom(i) : returns a pointer to a new variable initialized to i
    mfree(a) : frees the memory allocated for the variable a.

    An unfortunate circumstance of Berkeley MP occurs when calculating expressions such as j<- tq+r. Such calculations must be broken up into the two statements: mult(t,q,j); madd(j,r,j);. which render the code a little less readable. In retrospect, the capabilities of C++ could have been used to define operators between a newly defined multiple-precision class --- as opposed to just the data type it is now --- which would have allowed for more complex expressions.

    Testing for Correctness

    If the reader plans on implementing our algorithm, or just modifying our implementation it may be useful to have the test cases we developed. Our original test set was meant to realize each case of our algorithm. As the program was being developed, various values of Phi, n, a, and i would produce erroneous output. A new test case was added for each bug fixed until we finally arrived at our final set of test cases.
    DMS 1998