CF Web>CSCFSoftwareNotes>ClientProjects>GradOfficeWebApplication>GradOfficeSpecs (2013-06-13, DanielAllen)

Grad Office Specs

Grad Office Specs

Functional specification (users' demand)

Application records come from the UW Graduate Studies Office.

In addition, CS-specific paper forms come directly from the applicants to the CS GO. These will be scanned and added to the appropriate records in the CS graduate applications database.

The database will be available in different views to the applicants, members of the CS faculty and UW GSO. Particularly, the GradSubmissionSummary will be generated for any applicant on demand.

See the resulting data model in the GradOfficeDataModel page, the state diagram in GradAdmissionsSystemDesign and the data flow in GradOfficeApplicationProcess.

The webapp's URLs should be formed concisely according to the kind of queries and sub-queries. It should be possible to re-use URLs of any query later. The following conditions should be verified by the webapp:

The user is authenticated.
The user is authorized to run the provided query.
The query contains consistent parameters.

The records coming from UW GSO, as well as the load files for the CS database, should be archived in a version control system to trace back potential data loss.

No backups of the admission data older than 2 years past the end of studies should exist, according to the University's retention policies.

Limitations and complications of input data

The UW GO records will be scanned periodically by Quest. Records specific to MAT-CS will be emailed in 8 pieces, or extracts.

Subplans.
Employment.
Education.
Financial Support.
Admission Conditions.
Referees.
Test Scores.
Address/Email.

The extracts are not normalized, which means some fields contain duplicate values. Certain values are spelled differently but they are supposed to be the same. It would be nice to have an opportunity to group differently spelled values together.

The extracts may present more than one version of the expected entity, such as Person or Application, blurring the relationships into the "many to many" kind. This might happen for the following reasons.

The same person submitted a web form more than once.
The same person registered under different OUAC accounts.
The same person received multiple UW IDs.

Implementation (programmer's supply)

Storing incoming extract emails in a version control repository

Automatic extracts will be mailed to ijmorlan@uwaterloo.ca. A procmail rule will detect extracts by the sender's address, sarecords@uwaterloo.ca, and the subject line, ^UW Grad - .*$. Detected messages will be forwarded to an artificial email account, odyssey-gso-extract@cs. The artificial account will handle messages by sending them to a script ogsas-read-mail, as user odyssey. ~~ST#58054~~.

The ogsas-read-mail script will run etl.py. This will save CSV attachments into a Subversion repository. The names of the attachments will be constructed based on the corresponding subject line of the message. ~~ST#58056~~.

The etl.py script will be controlled by configuration parameters in gso.conf. The list of the parameters follows.

Associations between the subject lines and the CSV file names. CSV file names, save their ".txt" extension, are presented by configuration file's section names starting with e_, such as e_score. These sections will contain a parameter subject with a regular expression unique to the subject line of corresponding messages.
Location of the Subversion repository, repository. For simplicity, only server-side API will be utilized, so the location needs to be a file path such as /u/odyssey/svn/gso. If not specified, the file system's current directory will be used instead as the root of the extract directories.
Location of the batch directory, batch_dir, relative to the above repository.
Location of the daily batch assembly directory, track_dir, relative to the repository. When the daily directory is deemed full, its files are moved (copied and deleted) on top of the older batch in batch_dir. If a previous copy of a daily file was found in track_dir when adding the new copy, a warning is sent to the administrator.
Override of the CSV attachments' character set, e_mail_charset. It appears that the attachments were encoded in latin1 but their envelope indicated us-ascii. The e_mail_charset parameter, if specified, will override the character set shown by the envelope. The etl.py script will decode CSV attachments into wide Unicode, then encode to e_store_charset (usually, "utf-8") before storing the files.
A pattern to filter messages by their sender, e_from. Currently, this is set to sarecords@uwaterloo\.ca. May be omitted.
A pattern of the trusted email domain, trusted_network. This is set to ^.*\.uwaterloo\.ca$. When not empty, the option restricts processing to messages sent by and relayed through the hosts matching the mask. The relays are trusted to tag messages with proper Received or X-Received headers.

The etl.py script can be seen in the OdysseyApplicationsSVN repository odyssey under the directory ogsas/trunk/etl/.

https://www.cs.uwaterloo.ca/~odyssey/viewvc/odyssey/ogsas/trunk/etl/

The configuration file describing transformations of the GSO extracts lives in ogsas/trunk/,

https://www.cs.uwaterloo.ca/~odyssey/viewvc/odyssey/ogsas/trunk/

The data repository ogsas can be seen through the web interface by a limited number of users (see OdysseyApplicationsSVN on modifying the ViewVC access list).

https://www.cs.uwaterloo.ca/~odyssey/viewvc/ogsas/

Pre-processing

The CSV files will be preprocessed before being stored into the repository. Currently, CSV rows will be sorted according to the sort fields specified with the presort configuration parameter. This will minimize deltas stored by the repository and shown in the notification emails.

Purging backups

(a) Purging version control repository

To prevent the version repository from keeping historical data forever, one should purge the oldest transactions from it. Subversion's Migrating Repository Data Elsewhere

chapter shows how to dump select revisions. The range of revisions to dump should be given with the -r switch in the svnadmin dump command. The dump may then be loaded into a new repository with svnadmin load.

    set -e   # A bash option to exit on first error.
    svnadmin dump -r${OLDREV}:HEAD $REPOS_PATH > ${OBLITERATING_DUMP}
    rm -rf ${REPOS_PATH}
    svnadmin create ${REPOS_PATH}
    svnadmin load ${REPOS_PATH} < ${OBLITERATING_DUMP}

(b) Purging file system backups

As long as the existing file-level backup age is limited to few weeks, there is no need in limiting it. See man 5 nsr on creating a Legato

control file .nsr.

(c) Purging the faculty database

The CSV transformation script etl.py is configured to save all recent records of program status changes in the data files. Two factors account for only recent (6 years and younger) records entering the database: that the date filter in gso.conf is configured to consider CSV records from the last 6 years only and that the incoming CSV extracts might be corrupting details of the preceding records.

Despite the fact that only recent (6 years and younger) records are stored in the data file, the July 2007 version of the database update script db_update.py does not remove outdated records from the database. This results in stale database records. A separate script may be written to purge the stale records from the database. There is not an ST task for such a script yet.

Batch de-synchronization

It is difficult to identify messages with their batches based just on the message order in the stream. The currently implemented best effort will flush the queue if the new message replaces an older message in the queue. This technique will result in permanent batch de-synchornization when the first message of one of the batches corresponds to a missing message in the previous batch. Then every day's batches will be broken as their first message will be considered belonging to the older batch and flushed.

As an out-of-band work-around, a crontab-scheduled check will warn the administrator if track_dir is not found empty at the time of the day when all pieces of the batch are supposed to be received and moved to batch_dir. The check should run twice a day, before and after the batch is expected. This will discard incomplete batches and batches containing early duplicates. The late duplicates will be safely discarded by the "before batch period" check, preserving the previous and the new batches.

Ideally, messages should carry their batch identifier.

Parsing and storing the incoming extracts in the database

The extracts are described in GradAdmissionCSVExtracts. The GradOfficeDataModel page shows the data model for the local database.

~~ST#57967~~

The configuration file gso.conf will hold the following parameters.

The list of joining fields of the CSV extract files. See the output primary and payload field syntax in another paragraph below for the specific "single-join" operation during which records with duplicate tuples named in "using" will be logged and discarded.

                [DEFAULT]
                # The fields to join by.
                using = appID, faculty, department, plan, effectiveDate, decision,
                    uwID, lastName, firstName, middleName,
                    admitTerm, level, partTime,

The default filtering "where" clause. This is an ordered list of "fixed" and "aggregate" (cumulative) conditions.

                where = 
                    effectiveDate: >= _today - 6years,
                    effectiveDate: _max,

The primary key and the payload fields for each of the transformed tables, according to the data model. Payload fields may be "single-joined" from any source table by using the TARGET_FIELD: JOINING_TABLE.JOINING_FIELD syntax. "Single-joining" means that every record with a unique "using" tuple in the source table will be matched against a single record in the joining table. Both target and joining records with duplicate tuples named in "using" will be logged and discarded. In the following example, the output table t_gso_referee is described as follows:

                [t_gso_referee]     #####################
                pk_source = e_referee
                pk = 
                    appID, 
                    ID: refID,

                unjoin =
                    name: refName,
                    title: refTitle,
                    organization: refOrgName,
                    email: refEmail,
                    phone: refPhone,
                    address1: refAddress1,
                    address2: refAddress2,
                    address3: refAddress3,
                    address4: refAddress4,
                    city: refCity,
                    
                #   FIXME: the refProvState field is not provided by CSV extracts.
                    provState: refProvState,
                    
                    postalCode: refPostalCode,
                    country: refCountry,

In addition to the above, few more details will be implemented.

Certain t_... transformed tables will have a modified filtering clause.

                [t_gso_program_status]  #################
                pk_source = e_contact

                where = 
                #   Keep all status changes.
                    effectiveDate: >= _today - 6years,

                pk = 
                    progAppID: appID, 
                    progFaculty: faculty, 
                    progDept: department, 
                    progPlan: plan, 
                    effectiveDate, 
                    decision,

                fkeys = 
                    (progAppID, progFaculty, progDept, progPlan): t_gso_program.(appID, faculty, dept, plan),

                unjoin =
                    decisionReason,

Relationships with other tables will be verified based on the foreign keys fkeys. Non-resolving records will be logged and discarded.

                [t_gso_referee]     #####################
                pk_source = e_referee
                pk = 
                    appID, 
                    ID: refID,

                unjoin =
                    name: refName,
                    title: refTitle,

                fkeys = 
                    appID: t_gso_applicant.ID,

Should a limited number of records for a certain column be transposed into fields, the following configuration option might be helpful.

                [t_gso_program]     #####################
                pk_source = e_contact

                pk = 
                    appID, 
                    faculty, 
                    dept: department, 
                    plan,

                # Syntax:
                # records_to_fields =
                #   [t_field:] [j_table.]j_field if j_multivalue_field v|(v,..),
                #   ..
                # For example,
                # records_to_fields = 
                #     subPlan_%(_recNo)s: e_subplan.subPlan if _recNo (1, 2, 3),
                #     subPlanDescr_%(_recNo)s: e_subplan.subPlanDescr if _recNo (1, 2, 3),

                unjoin =
                    admitTerm,
                    admitType,
                    progLevel: level,
                    partTime,

Non-string field types will be specified with the types and types_more options so that the following requirements are met.
- The database system and its bulk load command require that non-string empty values are NULLs (\N in the CSV file).
- The parameter-separating, type-strict DB API implementation pgsql requires explicit type conversion for non-string fields.

Implementation choices

There are multiple ways to implement the "join-unjoin" transformation of extracts to data load files. The following algorithms were considered:

Dump the CSV extracts into a temporary database and execute SQL queries against the CSV database to build the required database.
Sort the CSV extracts according to the join fields and step through corresponding rows of all CSV extracts in parallel, adding records to the transformed tables.
Use the Entity-Attribute-Value idea and step through the records of each CSV extract, populating the transformed tables.
- To find a row in the resulting data file, the latter algorithm requires pairing of the current CSV record with pk_source of the data file.

The latter idea was chosen because it was the greatest common denominator to all three. More details on the implementation are available in the comments to etl.py and transform.py. The script's outline and comments can be extracted with pydoc etl, pydoc transform.

Transformation filters expressed in SQL

Overview

The location of a condition inside the SQL statement depends on the condition's type (based on my observations. -- IL).

Condition type	SQL statement	Comment
Fixed	`SELECT * FROM t WHERE fixed_field = 'value'`
Aggregate	`SELECT field_1,..,max(cumulative_field) FROM t GROUP BY field_1,.. ORDER BY field_1,..`
Pre-aggregate fixed	`SELECT field_1,..,max(cumulative_field) FROM t WHERE field_1 = 'value' GROUP BY field_1,.. ORDER BY field_1,..`	Narrowed the aggregate's scope of definition by fixing `field_1`.
Post-aggregate fixed	`SELECT field_1,..,cumulative_field as cum_field_max FROM t WHERE (field_2,..,cumulative_field) in ( SELECT field_2,..,max(cumulative_field) FROM t GROUP BY field_2,.. ORDER BY field_2,.. ) AND field_1 = 'value'`	Widened the former aggregate's scope by withdrawing `field_1`.
Post-aggregate aggregate	`SELECT field_2,..,cumulative_field as cum_field_max, max(field_1) FROM t WHERE (field_2,..,cumulative_field) IN ( SELECT field_2,..,max(cumulative_field) FROM t GROUP BY field_2,.. ORDER BY field_2,.. ) GROUP BY field_2,..,cum_field_max ORDER BY field_2,..,cum_field_max`	Widened the former aggregate's scope by withdrawing `field_1`.

Single aggregate filters

Consider a pair of a fixed and cumulative filter conditions in the script's configuration file. Note that fixed conditions filter out a record based solely on its own fields. Therefore fixed conditions include not only equality constraints, but also lower and upper boundary constraints.

                pk_source = e_contact

                where =
                    effectiveDate: >= _today - 6years,
                    effectiveDate: _max,

                pk = 
                    progAppID: appID, 
                    progFaculty: faculty, 
                    progDept: department, 
                    progPlan: plan, 
                    effectiveDate, 
                    decision,

                unjoin =
                    decisionReason,

The "area of definition" (or "free") fields for the aggregate (cumulative) condition effectiveDate: _max is pk minus fields referred by cumulative where conditions. The script will automatically treat the free pk variables as GROUP BY and ORDER BY fields as well.

The maximum effectiveDate will be collected over all records with same tuples (appID, faculty, department, plan, decision). The following SQL statement should perform a similar task. It should find the latest record for every decision from a table with the same structure as e_contact.

                select progAppId, progFaculty, progDept, progPlan, max(effectiveDate), decision
                    from gso_program_status 
                    where effectiveDate >= current_date - interval '6 years'
                    group by progAppId, progFaculty, progDept, progPlan, decision
                    order by progAppId, progFaculty, progDept, progPlan, decision

To attach the missing payload field decisionReason with a single value per group, the above query may be rewritten as a correlated subquery.

                select progAppId, progFaculty, progDept, progPlan, effectiveDate, decision, decisionReason
                    from gso_program_status
                    where (progAppID, progFaculty, progDept, progPlan, effectiveDate, decision) in
                        (
                            select progAppID, progFaculty, progDept, progPlan,
                                max(effectiveDate), decision
                            from gso_program_status
                            where effectiveDate >= current_date - interval '6 years'
                            group by progAppID, progFaculty, progDept, progPlan, decision
                            order by progAppID, progFaculty, progDept, progPlan, decision
                        )
                        and decisionReason in (select decisionReason limit 1)
                    group by progAppId, progFaculty, progDept, progPlan, decision, effectiveDate, decision, decisionReason
                    order by progAppId, progFaculty, progDept, progPlan, decision, effectiveDate, decision, decisionReason

Another way to do the same is a self-join,

                select ps1.progAppID, ps1.progFaculty, ps1.progDept, ps1.progPlan, 
                        ps1.effectiveDate, ps1.decision, ps1.decisionReason
                    from gso_program_status as ps1, gso_program_status as ps2
                    where ps2.progAppID = ps1.progAppID
                        and ps2.progFaculty = ps1.progFaculty
                        and ps2.progDept = ps1.progDept
                        and ps2.progPlan = ps1.progPlan
                        and ps2.effectiveDate >= current_date - interval '6 years'
                        and ps2.effectiveDate >= ps1.effectiveDate
                        and ps2.decision = ps1.decision
                        and ps2.decisionReason in (select ps2.decisionReason limit 1)
                    group by ps1.progAppID, ps1.progFaculty, ps1.progDept, ps1.progPlan, 
                        ps1.effectiveDate, ps1.decision, ps1.decisionReason
                    having count(*) = 1
                    order by ps1.progAppID, ps1.progFaculty, ps1.progDept, ps1.progPlan, 
                        ps1.effectiveDate, ps1.decision, ps1.decisionReason

Composite aggregate filters

Suppose we need to remove one of the fields such as decision from the "area of definition" of max(effectiveDate) and reconnect it once a maximum is determined. To resolve the ambiguity of multiple records matching the subquery's records, we apply another max() against decision despite the promise of input data to have exactly one decision per every effective date.

Using the script syntax, this might be accomplished by adding another cumulative "where" condition. The fact that the new "where" condition is cumulative automatically removes the referred field from the "areas of definition" for max(effectiveDate) and max(decision). The positions of cumulative and fixed conditions in the "where" list define the order of free variable eliminations. (The result of the following script configuration was not tested).

                pk_source = e_contact

                where =
                    effectiveDate: >= _today - 6years,
                    effectiveDate: _max,
                    decision: _max(*, APPL, COND, ADMT),

                pk = 
                    progAppID: appID, 
                    progFaculty: faculty, 
                    progDept: department, 
                    progPlan: plan, 
                    effectiveDate, 
                    decision,

                unjoin =
                    decisionReason,

Should the decision field be limited by a fixed condition such as decision: APPL, the field has to be removed from the pk list. This is because fixed conditions, unlike cumulative conditions, don't treat the referred fields as free.

Leaving a fixed decision in pk would result in extra records that had APPL decision in the past. If this is a desired result, it would be better to explicitly place the new fixed condition before effectiveDate: _max.

The following SQL statement should perform the same filtering as the above piece of the script configuration syntax, except that it is run against a transformed table rather than the input CSV extract. (Actual equivalence neither proven by the EXPLAIN path nor tested).

                select progAppID, progFaculty, progDept, progPlan, effectiveDate,
                    substr(max(lpad(position(decision in ' APPL COND ADMT'), 2, '0') || decision), 3),
                    decisionReason
                from gso_program_status
                where (progAppID, progFaculty, progDept, progPlan, effectiveDate) in
                    (
                        select progAppID, progFaculty, progDept, progPlan, 
                            max(effectiveDate)
                        from gso_program_status
                        where effectiveDate >= current_date - interval '6 years'
                        group by progAppID, progFaculty, progDept, progPlan
                        order by progAppID, progFaculty, progDept, progPlan
                    )
                    and decisionReason in (select decisionReason limit 1)
                group by progAppID, progFaculty, progDept, progPlan, effectiveDate, decisionReason
                order by progAppID, progFaculty, progDept, progPlan, effectiveDate, decisionReason

Automatic data feed

Topic revision: r39 - 2013-06-13 - DanielAllen

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit