Main article: OGSASPhaseOneDesign.
Application records come from the UW Graduate Studies Office.
In addition, CS-specific paper forms come directly from the applicants to the CS GO. These will be scanned and added to the appropriate records in the CS graduate applications database.
The database will be available in different views to the applicants, members of the CS faculty and UW GSO. Particularly, the GradSubmissionSummary will be generated for any applicant on demand.
See the resulting data model in the GradOfficeDataModel page, the state diagram in GradAdmissionsSystemDesign and the data flow in GradOfficeApplicationProcess.
The webapp's URLs should be formed concisely according to the kind of queries and sub-queries. It should be possible to re-use URLs of any query later. The following conditions should be verified by the webapp:
The records coming from UW GSO, as well as the load files for the CS database, should be archived in a version control system to trace back potential data loss.
No backups of the admission data older than 2 years past the end of studies should exist, according to the University's retention policies.
The UW GO records will be scanned periodically by Quest. Records specific to MAT-CS will be emailed in 8 pieces, or extracts.
The extracts are not normalized, which means some fields contain duplicate values. Certain values are spelled differently but they are supposed to be the same. It would be nice to have an opportunity to group differently spelled values together.
The extracts may present more than one version of the expected entity, such as Person or Application, blurring the relationships into the "many to many" kind. This might happen for the following reasons.
Automatic extracts will be mailed to ijmorlan@uwaterloo.ca
. A procmail
rule will detect extracts by the sender's address, sarecords@uwaterloo.ca
, and the subject line, ^UW Grad - .*$
. Detected messages will be forwarded to an artificial email account, odyssey-gso-extract@cs
. The artificial account will handle messages by sending them to a script ogsas-read-mail
, as user odyssey
. ST#58054.
The ogsas-read-mail
script will run etl.py
. This will save CSV attachments into a Subversion repository. The names of the attachments will be constructed based on the corresponding subject line of the message. ST#58056.
The etl.py
script will be controlled by configuration parameters in gso.conf
. The list of the parameters follows.
".txt"
extension, are presented by configuration file's section names starting with e_
, such as e_score
. These sections will contain a parameter subject
with a regular expression unique to the subject line of corresponding messages.
repository
. For simplicity, only server-side API will be utilized, so the location needs to be a file path such as /u/odyssey/svn/gso
. If not specified, the file system's current directory will be used instead as the root of the extract directories.
batch_dir
, relative to the above repository.
track_dir
, relative to the repository. When the daily directory is deemed full, its files are moved (copied and deleted) on top of the older batch in batch_dir
. If a previous copy of a daily file was found in track_dir
when adding the new copy, a warning is sent to the administrator.
e_mail_charset
. It appears that the attachments were encoded in latin1
but their envelope indicated us-ascii
. The e_mail_charset
parameter, if specified, will override the character set shown by the envelope. The etl.py
script will decode CSV attachments into wide Unicode, then encode to e_store_charset
(usually, "utf-8") before storing the files.
e_from
. Currently, this is set to sarecords@uwaterloo\.ca
. May be omitted.
trusted_network
. This is set to ^.*\.uwaterloo\.ca$
. When not empty, the option restricts processing to messages sent by and relayed through the hosts matching the mask. The relays are trusted to tag messages with proper Received
or X-Received
headers.
The etl.py
script can be seen in the OdysseyApplicationsSVN repository odyssey
under the directory ogsas/trunk/etl/
.
ogsas/trunk/
,
The data repository ogsas
can be seen through the web interface by a limited number of users (see OdysseyApplicationsSVN on modifying the ViewVC access list).
presort
configuration parameter. This will minimize deltas stored by the repository and shown in the notification emails.
-r
switch in the svnadmin dump
command. The dump may then be loaded into a new repository with svnadmin load
.
set -e # A bash option to exit on first error. svnadmin dump -r${OLDREV}:HEAD $REPOS_PATH > ${OBLITERATING_DUMP} rm -rf ${REPOS_PATH} svnadmin create ${REPOS_PATH} svnadmin load ${REPOS_PATH} < ${OBLITERATING_DUMP}
man 5 nsr
on creating a Legato.nsr
.
etl.py
is configured to save all recent records of program status changes in the data files. Two factors account for only recent (6 years and younger) records entering the database: that the date filter in gso.conf
is configured to consider CSV records from the last 6 years only and that the incoming CSV extracts might be corrupting details of the preceding records.
Despite the fact that only recent (6 years and younger) records are stored in the data file, the July 2007 version of the database update script db_update.py
does not remove outdated records from the database. This results in stale database records. A separate script may be written to purge the stale records from the database. There is not an ST task for such a script yet.
As an out-of-band work-around, a crontab-scheduled check will warn the administrator if track_dir
is not found empty at the time of the day when all pieces of the batch are supposed to be received and moved to batch_dir
. The check should run twice a day, before and after the batch is expected. This will discard incomplete batches and batches containing early duplicates. The late duplicates will be safely discarded by the "before batch period" check, preserving the previous and the new batches.
Ideally, messages should carry their batch identifier.
The extracts are described in GradAdmissionCSVExtracts. The GradOfficeDataModel page shows the data model for the local database.
The configuration file gso.conf
will hold the following parameters.
[DEFAULT] # The fields to join by. using = appID, faculty, department, plan, effectiveDate, decision, uwID, lastName, firstName, middleName, admitTerm, level, partTime,
where = effectiveDate: >= _today - 6years, effectiveDate: _max,
TARGET_FIELD: JOINING_TABLE.JOINING_FIELD
syntax. "Single-joining" means that every record with a unique "using" tuple in the source table will be matched against a single record in the joining table. Both target and joining records with duplicate tuples named in "using" will be logged and discarded. In the following example, the output table t_gso_referee
is described as follows:
[t_gso_referee] ##################### pk_source = e_referee pk = appID, ID: refID, unjoin = name: refName, title: refTitle, organization: refOrgName, email: refEmail, phone: refPhone, address1: refAddress1, address2: refAddress2, address3: refAddress3, address4: refAddress4, city: refCity, # FIXME: the refProvState field is not provided by CSV extracts. provState: refProvState, postalCode: refPostalCode, country: refCountry,
In addition to the above, few more details will be implemented.
t_...
transformed tables will have a modified filtering clause.
[t_gso_program_status] ################# pk_source = e_contact where = # Keep all status changes. effectiveDate: >= _today - 6years, pk = progAppID: appID, progFaculty: faculty, progDept: department, progPlan: plan, effectiveDate, decision, fkeys = (progAppID, progFaculty, progDept, progPlan): t_gso_program.(appID, faculty, dept, plan), unjoin = decisionReason,
fkeys
. Non-resolving records will be logged and discarded.
[t_gso_referee] ##################### pk_source = e_referee pk = appID, ID: refID, unjoin = name: refName, title: refTitle, fkeys = appID: t_gso_applicant.ID,
[t_gso_program] ##################### pk_source = e_contact pk = appID, faculty, dept: department, plan, # Syntax: # records_to_fields = # [t_field:] [j_table.]j_field if j_multivalue_field v|(v,..), # .. # For example, # records_to_fields = # subPlan_%(_recNo)s: e_subplan.subPlan if _recNo (1, 2, 3), # subPlanDescr_%(_recNo)s: e_subplan.subPlanDescr if _recNo (1, 2, 3), unjoin = admitTerm, admitType, progLevel: level, partTime,
types
and types_more
options so that the following requirements are met.
pk_source
of the data file.
The latter idea was chosen because it was the greatest common denominator to all three. More details on the implementation are available in the comments to etl.py
and transform.py
. The script's outline and comments can be extracted with pydoc etl
, pydoc transform
.
The location of a condition inside the SQL statement depends on the condition's type (based on my observations. -- IL).
Condition type | SQL statement | Comment |
---|---|---|
Fixed | SELECT * |
|
Aggregate | SELECT field_1,..,max(cumulative_field) |
|
Pre-aggregate fixed | SELECT field_1,..,max(cumulative_field) |
Narrowed the aggregate's scope of definition by fixing field_1 . |
Post-aggregate fixed | SELECT field_1,..,cumulative_field as cum_field_max |
Widened the former aggregate's scope by withdrawing field_1 . |
Post-aggregate aggregate | SELECT field_2,..,cumulative_field as cum_field_max, max(field_1) |
Widened the former aggregate's scope by withdrawing field_1 . |
Consider a pair of a fixed and cumulative filter conditions in the script's configuration file. Note that fixed conditions filter out a record based solely on its own fields. Therefore fixed conditions include not only equality constraints, but also lower and upper boundary constraints.
pk_source = e_contact where = effectiveDate: >= _today - 6years, effectiveDate: _max, pk = progAppID: appID, progFaculty: faculty, progDept: department, progPlan: plan, effectiveDate, decision, unjoin = decisionReason,
The "area of definition" (or "free") fields for the aggregate (cumulative) condition effectiveDate: _max
is pk
minus fields referred by cumulative where
conditions. The script will automatically treat the free pk
variables as GROUP BY and ORDER BY fields as well.
The maximum effectiveDate
will be collected over all records with same tuples (appID
, faculty
, department
, plan
, decision
). The following SQL statement should perform a similar task. It should find the latest record for every decision from a table with the same structure as e_contact
.
select progAppId, progFaculty, progDept, progPlan, max(effectiveDate), decision from gso_program_status where effectiveDate >= current_date - interval '6 years' group by progAppId, progFaculty, progDept, progPlan, decision order by progAppId, progFaculty, progDept, progPlan, decision
To attach the missing payload field decisionReason
with a single value per group, the above query may be rewritten as a correlated subquery.
select progAppId, progFaculty, progDept, progPlan, effectiveDate, decision, decisionReason from gso_program_status where (progAppID, progFaculty, progDept, progPlan, effectiveDate, decision) in ( select progAppID, progFaculty, progDept, progPlan, max(effectiveDate), decision from gso_program_status where effectiveDate >= current_date - interval '6 years' group by progAppID, progFaculty, progDept, progPlan, decision order by progAppID, progFaculty, progDept, progPlan, decision ) and decisionReason in (select decisionReason limit 1) group by progAppId, progFaculty, progDept, progPlan, decision, effectiveDate, decision, decisionReason order by progAppId, progFaculty, progDept, progPlan, decision, effectiveDate, decision, decisionReason
Another way to do the same is a self-join,
select ps1.progAppID, ps1.progFaculty, ps1.progDept, ps1.progPlan, ps1.effectiveDate, ps1.decision, ps1.decisionReason from gso_program_status as ps1, gso_program_status as ps2 where ps2.progAppID = ps1.progAppID and ps2.progFaculty = ps1.progFaculty and ps2.progDept = ps1.progDept and ps2.progPlan = ps1.progPlan and ps2.effectiveDate >= current_date - interval '6 years' and ps2.effectiveDate >= ps1.effectiveDate and ps2.decision = ps1.decision and ps2.decisionReason in (select ps2.decisionReason limit 1) group by ps1.progAppID, ps1.progFaculty, ps1.progDept, ps1.progPlan, ps1.effectiveDate, ps1.decision, ps1.decisionReason having count(*) = 1 order by ps1.progAppID, ps1.progFaculty, ps1.progDept, ps1.progPlan, ps1.effectiveDate, ps1.decision, ps1.decisionReason
Suppose we need to remove one of the fields such as decision
from the "area of definition" of max(effectiveDate)
and reconnect it once a maximum is determined. To resolve the ambiguity of multiple records matching the subquery's records, we apply another max() against decision
despite the promise of input data to have exactly one decision per every effective date.
Using the script syntax, this might be accomplished by adding another cumulative "where" condition. The fact that the new "where" condition is cumulative automatically removes the referred field from the "areas of definition" for max(effectiveDate)
and max(decision)
. The positions of cumulative and fixed conditions in the "where" list define the order of free variable eliminations. (The result of the following script configuration was not tested).
pk_source = e_contact where = effectiveDate: >= _today - 6years, effectiveDate: _max, decision: _max(*, APPL, COND, ADMT), pk = progAppID: appID, progFaculty: faculty, progDept: department, progPlan: plan, effectiveDate, decision, unjoin = decisionReason,
Should the decision
field be limited by a fixed condition such as decision: APPL,
the field has to be removed from the pk
list. This is because fixed conditions, unlike cumulative conditions, don't treat the referred fields as free.
Leaving a fixed decision
in pk
would result in extra records that had APPL
decision in the past. If this is a desired result, it would be better to explicitly place the new fixed condition before effectiveDate: _max
.
The following SQL statement should perform the same filtering as the above piece of the script configuration syntax, except that it is run against a transformed table rather than the input CSV extract. (Actual equivalence neither proven by the EXPLAIN path nor tested).
select progAppID, progFaculty, progDept, progPlan, effectiveDate, substr(max(lpad(position(decision in ' APPL COND ADMT'), 2, '0') || decision), 3), decisionReason from gso_program_status where (progAppID, progFaculty, progDept, progPlan, effectiveDate) in ( select progAppID, progFaculty, progDept, progPlan, max(effectiveDate) from gso_program_status where effectiveDate >= current_date - interval '6 years' group by progAppID, progFaculty, progDept, progPlan order by progAppID, progFaculty, progDept, progPlan ) and decisionReason in (select decisionReason limit 1) group by progAppID, progFaculty, progDept, progPlan, effectiveDate, decisionReason order by progAppID, progFaculty, progDept, progPlan, effectiveDate, decisionReason