CS 240E Welcome Post

This is a forum dedicated to course-related announcements and answering any student doubts or clarifications. Please have a look at the Piazza guidelines listed here before posting. For each assignment, also make sure to check its corresponding Piazza post, which will contain answers to frequently asked questions and other clarifications.

Link to the official course website: https://www.student.cs.uwaterloo.ca/~cs240e/s19
Link to the CS 240 website in case you're interested: https://www.student.cs.uwaterloo.ca/~cs240/s19

We will be using MarkUs for assignment submissions: https://markus.student.cs.uwaterloo.ca/markus_cs240e_s/en/main

Hope you enjoy the term!


First Tutorial

There will be no tutorials held in the first week of classes for both versions of CS 240. They will start on Mon. May 13 with Tutorial 0.


Assignment 0 Official Post and FAQ

Assignment 0 has been posted to the course website and is due Wed., May 15 at 5:00 PM. It introduces Latex, the recommended tool for creating your assignment submission PDFs. We will cover a bit of this in next Monday's tutorial. The assignment is optional, but awards you up to 6 bonus marks which will be added to your Assignment 1 mark.


What is this?

This section of the post will be updated with any clarifications or answers to common questions as students go through the assignment. Be sure to check here regularly for any updates.


CS 240E/240R Enrollment

The course add deadline for the Spring 2019 term is May 17, 2019. Students who wish to transfer between the two CS 240 versions or otherwise enroll in one must make sure to follow the steps in this post for successful enrollment.

To enroll in CS 240E:
You must send an email to course instructor Simon Abelard at sabelard@uwaterloo.ca including a copy of your most recent transcript before May 15. If applicable, please specify any prerequisites that you don't satisfy and a justification of why you think you will succeed despite that. Then add the course on Quest before May 17 (this may require additional permission from a CS advisor).

To enroll in CS 240R:
Be sure to make any required changes on Quest before the add deadline on May 17.

For both CS 240E and CS 240R:
After you've been enrolled in your desired course on Quest, notify Karen Anderson by email at kaanders@uwaterloo.ca so that your marks can be transferred correctly.

If you already intend to switch to CS 240R or CS 240E, it is recommended to get access to the correct Piazza and MarkUs in advance by sending an email to cs240@uwaterloo.ca. Even if you haven't been officially enrolled, this way you can keep up to date with assignments and submit to the proper course version on time. You will be able to access the PDFs for assignment 0 and 1 on both the 240R and 240E course websites. Note that although A0 is essentially identical for both versions, A1 will be different and you will be expected to submit the one you're enrolled in at the due date.


Assignment 1 Official Post and FAQ

Assignment 1 has been posted to the course website and is due Wed., May 22 at 5:00 PM. As always, be sure to follow the assignment guidelines while working on your submission.



What is meant in Q1 by "All functions map positive integers to real numbers"?

The question has been updated, and previously said "All functions map positive integers to positive integers". The "positive integers" was just a simplification we allowed you to assume for the unknown functions f(n) and g(n) in your proofs, to make them easier to work with (i.e. you can drop some absolute value signs). Neither version of the sentence is meant to restrict you on counterexamples you may come up with f(n) or g(n); it only applies for generalized f(n) and g(n) which have not been defined by you and are unknown.


Equivalence in the limit rule

Following a few questions I had during class, I will recall what is true or not about the limit rule.

The limit rule states that if we have information on the limit of $$f(n)/g(n)$$, we can deduce information about $$f$$ being in $$o(g)$$, $$\omega(g)$$ or $$\Theta(g)$$.

In class we have proven that conversely, if $$f(n)\in o(g(n))$$, then $$\lim_{n\to\infty}f(n)/g(n)=0 $$ so we have an equivalence.

We can also prove that if $$f(n)\in\omega(g(n))$$, then $$\lim_{n\to\infty}f(n)/g(n)=\infty $$.

Indeed, this is equivalent to $$g(n)\in o(f(n))$$ so by what we proved in class about o-notation, we have $$\lim_{n\to\infty}g(n)/f(n)=0$$.

Then, inverting the quotient we get $$\lim_{n\to\infty}f(n)/g(n)=\infty $$

However, if $$f(n)\in\Theta(g(n))$$, it is not true that $$\lim_{n\to\infty}f(n)/g(n)$$ is a constant.

The Example 2 slide gives a counterexample where this limit does not even exist.

Conclusion: in slide 23 about limit rule, the statements about $$o$$ and $$\omega$$ are equivalences but the statement about $$\Theta$$ is not.


Example of functions with no comparison possible

I was asked in class to provide an example of functions $$f(n)$$ and $$g(n)$$ for which strictly less than two relationships (o,O,$$\omega$$, $$\Omega$$) held.

Example 1: $$f(n)=O(g(n))$$ and nothing else is true

Define $$f(n)$$ as 1 if $$n$$ is odd and $$n$$ if $$n$$ is even (try to draw it to see how it looks like).

Define $$g(n)=n$$.

Since $$|f(n)|\le |g(n)|$$ we have $$f(n)\in O(g(n))$$.

Since $$f(n)/g(n)$$ does not have a limit, $$f(n)\notin o(g(n))$$.

Assume that $$f(n)\in\Omega(g(n))$$. Then $$g(n)\in O(f(n))$$ so by first principle there is a $$c$$ such that for $$n$$ large enough, $$|g(n)|< c|f(n)|$$.

This means that for any large enough odd number we have $$n < c$$, which is impossible because c is a constant, so $$f(n)\notin\Omega(g(n))$$.

Since being a $$\omega$$ is stronger than being a $$\Omega$$, we deduce that $$f(n)\notin\omega(g(n))$$.

Thus, only the $$O$$ relationship is satisfied.

Example 2: no relationship between f and g

Keep the previous $$f(n)$$ and define $$g(n)$$ as 1 if $$n$$ is even and $$n$$ if $$n$$ is odd (we switch the parity in the definition of $$f$$).

Then for $$n$$ odd we have $$|f(n)| < |g(n)|$$ and for $$n$$ even we have $$|f(n)| > |g(n)|$$.

This constant switching will prevent any comparison result of the type $$|f(n)| < c|g(n)|$$ or $$|f(n)| > c|g(n)|$$  to hold even when $$n$$ is large enough.

Thus, the definitions of $$O$$ and $$o$$ cannot be satisfied by $$f$$ and $$g$$ and so there is no comparison result between $$f$$ and $$g$$.


Office hours starting next week

All instructor and ISA office hours have been posted to the Personnel page of the course website, and are effective starting next week. Bill's hours are open to both versions of CS 240, but students from the enriched section will be given priority in the indicated timeslots.


Tutorial 1 first principles proof with log(n)

A couple of students were asking how to finish off today's Q1 proof with a $$c$$ and $$n_0$$. I will give some hints here which should be enough to do the last step.

Generally, it's easier to find $$c$$ when working with products and quotients rather than sums and differences, because you can just change the value of $$c$$ to match the coefficient you want. If we have $$\log{\frac{n}{2}}$$ and want to clean it up, this is annoying because it simplifies into a difference $$\log{n} - \log{2}$$. We would much rather have it be in the form $$c\cdot \log{n}$$. Thus, you can show that $$\frac{n}{2} \ge \sqrt{n}$$ for a certain $$n_0$$, which gives you $$\log{\frac{n}{2}} \ge \log{n^{1/2}} = \frac{1}{2}\log n$$.


Tutorial 01 Solutions

Tutorial 01 partial solutions have been posted here: https://www.student.cs.uwaterloo.ca/~cs240e/protect/


Reference for amortized cost analysis and heap merge

The textbook Algorithm Design and Applications, by Goodrich and Tamassia covers some amortized analysis in section 1.4.1 (page 36 of the 2014 edition).

The accounting technique is what I presented about dynamic arrays having amortized cost in $$O(1)$$, but the analysis is more imaged so you may find it helpful.

Then it deals with potential function technique which has been introduced in previous offerings of the course.

This is just a reference for people who are curious about theses aspects, it is absolutely not a priority to study because coming up with appropriate potential functions is way beyond the scope of the course.

For your culture, you may want to have a look at an extremely efficient structure to achieve an even faster merge: https://en.wikipedia.org/wiki/Shadow_heap

This is research material beyond the scope of our class, just to show that it is possible to achieve much better than $$O(\log n)$$ amortized cost.


Proof that deterministic choice yields a linear worst case

In class we discussed about the possibility of choosing the rules to decide between left and right subtrees when merging. This time, I will try to convince you in full generality that for every rule there is a linear worst-case scenario.

1-Formal definition of a decision rule

First let us formalize what is a rule: I define it as a function $$R$$ taking as input a node $$e$$ and its children $$c_1$$ and $$c_2$$ and returning either left or right. It should not depend of any other factor and in particular it only takes local information (a node and its children): there is no possible memory of the previous nodes encountered and no possible knowledge of the children of $$c_1$$ and $$c_2$$ (otherwise it breaks the recursion and make the code more greedy in terms of memory).

Note that to a rule $$R$$ we can associate a walk on the tree that we build recursively by choosing the appropriate child and stopping at a leaf.

The length of the walk is defined as the number of nodes visited during the process, it is trivially bounded by the height of the tree.

2-Proof that the worst case is linear, by induction

The statement we want to prove is the following: for any $$n$$ and any decision rule $$R$$, there exists a tree such that the runtime of the walk on that tree is at least $$n/2$$.

We can assume now that $$R$$ is fixed, but since it can have any value the proof will hold for any $$R$$.

Initialization: trivial if $$n=1$$, $$n=2$$ or $$n=3$$ (only one of them is enough however)

Induction hypothesis: for any $$k<n$$, there is a tree $$T$$ such that the walk determined by the decision rule $$R$$ has length at least $$n/2$$.

Let us now build a tree of size $$n$$. Let us consider a tree of size $$n-2$$ on which the walk determined by $$R$$ has length at least $$(n-2)/2=n/2-1$$ (it exists by induction hypothesis).

Let $$r$$ be the root of that tree, let $$e$$ and $$c$$ be some elements, we design a tree whose root is $$e$$ and take a look at $$R(e,c,r)$$. If for the values taken the result is right, then the right child of $$e$$ will be $$r$$ and $$c$$ will be the left child. Otherwise we do the exact opposite. Thus, our newly built tree has size $$n$$ and following the rule $$R$$ the first step is by construction to reach the root of the tree $$T$$.

Thus, the length of the walk is at least 1 plus the length of the walk on $$T$$. So it is at least $$n/2-1+1=n/2$$ by definition of $$T$$. And the result is proven by induction.


No matter which deterministic rule you pick, I can find a worst case running in time at least $$n/2$$.

This means that randomization is an absolute necessity.


Reminder of Victoria Day Holiday - Mon. May 20

Due to the holiday, there will be no tutorials or consulting hours on Monday, May 20. Students should not expect to get responses from course staff on the day of the holiday.


Assignment 2 Official Post and FAQ

Assignment 2 has been posted to the course website and is due Wed. June 5 at 5:00 PM. As always, be sure to follow the assignment guidelines while working on your submission - there is a programming component this time!



Are we allowed to use auxiliary structures in our insert and delete algorithms for Q1a)?

For part a), you should stick to the ADT and its exact implementation as given at the beginning of Q1. You can make modifications to improve efficiency in b) and c).

See @44 for an update to Q2c

What does Q3c mean by replacing count sort with partition?

You can use your solution from b) and assume it takes an additional parameter to specify which digit is being partitioned on (@52). The partition function should fulfill the same purpose as count sort to group elements based on a certain digit, but not necessarily in the same way of having one bucket for every possible digit.

Do we have to justify correctness in any of the algorithm questions?

This is the general policy for algorithm design, but all the A2 questions should provide enough guidance that no additional proofs of correctness are required (@54).

Can I only modify the PQ-sort pseudocode in Q1b?

You're allowed to indirectly modify it by changing how you insert and delete items from the bucket queue.


A1 Solution Release

Solutions for each assignment will be made available shortly after the due date. You can access them through the protected files area on the website, where A1 solutions have now been uploaded.

Many questions have multiple acceptable answers; the posted solutions are not meant to comprehensively cover these answers. They will generally have one or two example responses with at least the main ideas present.


A0 Marks Available

All marks for Assignment 0 have been released on MarkUs. Your assignment grade will be converted up to 6 bonus marks added to your Assignment 1 grade.

See the Mark Appeals policy for information on remark requests.


New Monday office hours

The Personnel page has been updated with my new office hours which will be on Mondays 4-5 PM starting next week. They are open to both the regular and enriched sections.


In-place MSD radix sort

Following the interesting question asked in class about in-place MSD radix, I looked on the Internet and found it is still a subject of research.

The short answer is that it is possible but not necessarily interesting in practice, I found a report about an implementation using other ideas that make their algorithm different from the MSD-sort we saw in class. The implementation suggested does not beat the regular MSD sort but the author refers to projects of IBM to exploit distributed computing.

Here is a link to the report (it looks like a student report so it is quite readable): https://stanford.edu/~rezab/classes/cme323/S16/projects_reports/he.pdf


Update of A2 Question 2.c)

There was a commented hint in the .tex file which was not included in the pdf. This hint was originally not supposed to be here but it is now in both files.


Analysis of quickselect (induction skipped in class)

Here is a detailed analysis for the average-case complexity of quickselect, if you have not tried the proof by induction, I highly recommend you try it first on your own before reading the solution.

1- Justification of the recurrence formula of slide 11

We want to compute the average runtime $$T(n,k)$$ of Quickselect(k) over all the possible instances of size $$n$$. Assuming all the elements in the array are distinct, each situation can be represented by a permutation. Indeed, we care only about relative order of elements, not their value so each instance can be associated to a permutation $$(i_0, i_1, \ldots, i_{n-1})$$ where $$i_k$$ represents the rank of $$A[k]$$, i.e. its position in the sorted array. Thus for any instance of size $$n$$, the runtime depends only on the associated permutation $$\sigma$$ in the symmetric group $$S_n$$, we denote this runtime by $$T(\sigma,k)$$.

There are $$n!$$ such permutations, so the average complexity is $$T(n,k)=1/n! \sum_{\sigma\in\S_n} T(\sigma,k)$$. Now we split cases depending on where is the pivot index $$i$$ and say that the average runtime is the sum for any $$i$$ of the average runtime for instances having pivot index $$i$$. The number of permutations having pivot index $$i$$ is the number of permutations who do not change the rank of the $$i$$-th element, it is in bijection with the number of permutations on $$n-1$$ elements.

Thus we have $$T(n,k)=(n-1)!/n!\sum_{i=0}^{n-1} T(\sigma,k \text{ such that pivot index is }i)$$.

Then several cases occur depending on how $$i$$ compares to $$k$$:

-if $$i=k$$ we have found the $$k$$-th element and the complexity to do is $$cn$$, the cost of partitioning

-if $$i < k$$, we recurse in the right subarray so $$T(n,k)= cn+T(n-i-1,k-i-1)$$

-if $$i > k$$, we recurse in the left subarray so $$T(n,k)=cn+ T(i,k)$$

Substituting these in the above sum, we get the result of slide 11.

2-Proof by induction that $$T(n,k) \le 4cn$$

Initialization:   for $$n=1$$, $$T(1,k)=0 \le 4c$$.

Induction hypothesis: we fix $$n > 1$$ and assume that for all $$j < n$$ and all $$k < j$$ we have $$T(j,k)\le 4cj$$.

Let $$k$$ be an integer smaller than $$n$$, by the recurrence relation we have

$$T(n,k)=cn+\frac{1}{n}\sum_{i=0}^{k-1} T(n-i-1,k-i-1)+\frac{1}{n}\sum_{i=k+1}^{n-1} T(i,k).$$

Using the induction hypothesis on the T() in these sums we get

$$T(n,k)\le cn+\frac{4c}{n}\sum_{i=0}^{k-1} (n-i-1)+\frac{4c}{n}\sum_{i=k+1}^{n-1} i.$$

These are sums of consecutive integers so you can compute them precisely, but it is possible to make lighter computations by first noting that the sum of these two parts is bounded by $$S=2\sum_{i=n/2}^{n-1} i$$ (write the sums as we did in class). We thus have $$T(n,k)\le cn+\frac{4c}{n}S$$.

Let us compute $$S$$ by writing that $$S=2\sum_{i=1}^{n-1} i-2\sum_{i=1}^{n/2-1} i$$ and the using the result from module 1 we have $$S= (n-1)(n-2)-(n/2-1)(n/2-2)$$. By expanding we see that $$S\le 3n^2/4$$.

Bounding $$S$$ in the above inequality we get $$T(n,k)\le cn+\frac{4c}{n}\frac{3n^2}{4}=4cn$$. The induction is therefore proven.


[fail] Height bound for AVL trees


We want to prove that any AVL of height $$h$$ on $$n$$ nodes satisfies $$h\le \log_{\psi}(n+1)$$.

To do that, we lower bound $$N(h)$$ the minimal number of nodes in an AVL of height $$h$$ by the quantity $$b(h)=\psi^{h}-1$$.

Then we say that any AVL tree of height $$h$$ with $$n$$ nodes is such that $$n\ge N(h)\ge b(h)$$.

Thus, we have $$\log_\psi( \psi^{h}) \le \log_{\psi}(n+1)$$ which means $$h \le \log_{\psi}(n+1)$$.

And this holds for any AVL, so even in the worst case we have $$h=O(\log n)$$.

2-Lower bound on $$N(h)$$:

By the relation proven in class, $$N(h)=1+N(h-1)+N(h-2)$$.

Let us denote $$\psi=\frac{\sqrt{5}+1}{2}$$ and remark that $$\psi^2=\psi+1$$.

By induction, we prove that $$N(h)\ge \psi^h-1$$.

Initialization: $$N(0)=1\ge \psi^0-1$$ because there is one node in a tree of height 0.

Induction hypothesis: For any $$k < h$$ we have $$N(k)\ge \psi^k-1$$.

By the relation we have $$N(h)=1+N(h-1)+N(h-2)$$, and applying the induction hypothesis we get

$$N(h) \ge 1+\psi^{h-1}-1+\psi^{h-2}-1 \ge \psi^{h-1}+\psi^{h-2}-1 \ge \psi^{h-2}(\psi+1)-1$$.

Since $$\psi+1=\psi^2$$, we conclude $$N(h)\ge \psi^{h}-1$$ and the induction is proven.


Any AVL of height $$h$$ has a number of nodes $$n$$ at least equal to $$N(h)$$, by definition.

Thus, for any AVL of height $$h$$, we have $$n\ge N(h)\ge \psi^h-1$$, which means $$\psi^h \le n+1$$.

Apply the non-decreasing $$\log_\psi$$ function, we get $$h \le \log_{\psi}(n+1)$$.

The worst-case height of an AVL tree with $$n$$ nodes is then in $$O(\log n)$$.


Assignment 1 Marks Available

All marks for Assignment 1 have been released on MarkUs. Due to limitations of MarkUs grading, the denominator of your displayed mark still includes the 4 bonus marks from Q2d) - your official mark will be recorded with the non-bonus denominator.

The Post-Mortem can be found on the Assignments page, outlining common mistakes found during marking.


See the Mark Appeals policy for information on remark requests. You must send an e-mail to cs240e@uwaterloo.ca within two weeks of this post for your request to be handled.


Tutorial 2 Solution

Tutorial 02 partial solutions have been posted here: https://www.student.cs.uwaterloo.ca/~cs240e/protect/


Assignment 3 Official Post and FAQ

Assignment 3 has been posted to the course website and is due Wed. June 19 at 5:00 PM. As always, be sure to follow the assignment guidelines while working on your submission.



A set of conventions has been added to Question 3 to clarify the WAVL structure.


Question/error in final proof

I had an interesting question after class about the computation of $$F$$, a primitive of $$f$$, and it made me aware of a mistake written on the board.

Remember we have $$f(x)=\sum_{i=0}^\infty ix^{-i-1}$$ and I chose $$F(x)=-\sum_{i=0}^\infty x^{-i}$$.

Then I wrote $$F(x)(1-x)=-1$$ which is actually false, if you do the computation carefully you will find that $$F(x)(1-x)=-x$$.

Thus, $$F(x)=-\frac{x}{1-x}=-\frac{1}{1-1/x}$$ which is coherent with the other way to compute $$F$$ using geometric summation $$-\sum_{i=0}^\infty (1/x)^{i}$$.

However, this mistake has no impact on the rest of the proof because $$-\frac{x}{1-x}-1=-\frac{1}{1-x}$$ so both $$-\frac{x}{1-x}$$ and $$-\frac{1}{1-x}$$ are primitives of the function $$f$$.

Do not worry if you did not follow everything, I went a bit fast on this one because it is less important than the second proof we will see on Tuesday.


A2 Solution Release

Solutions for Assignment 2 have now been posted to the protected files area on the website.


Tutorial 3 Solution

Tutorial 03 partial solutions have been posted here: https://www.student.cs.uwaterloo.ca/~cs240e/protect/


Subtlety about complexity of interpolation search

I had a good remark after class: if you directly study the complexity of the algorithm described in the class, you will not have the same recurrence relation because it splits the array into two parts instead of four. To code it properly you would need to compute not only $$m$$ but the two indices $$m_l=m-\sqrt{n}$$ and $$m_r=m+\sqrt{n}$$ and then add comparisons to decide whether to recurse in $$A[m_l,m]$$, $$A[m+1,m_r]$$ (both are good cases) or in $$A[l,m_l-1]$$ or $$A[m_r+1,r]$$ (both are bad cases).

However in practice you probably want to stick to the code written in the slides because it has much less comparisons and if statements.

It is another example of a common fact we already discussed: the algorithms that reach the best complexity bounds can be (very) different from the algorithms that have the best performance in practice.


Tutorial 04 Solution

Tutorial 04 partial solutions have been posted here: https://www.student.cs.uwaterloo.ca/~cs240e/protect/


Assignment 2 Marks Available

All marks for Assignment 2 are now released on MarkUs. As before, your grade on MarkUs will be displayed as if the Q4d) bonus marks are included in the denominator - the official mark will be recorded with the corrected denominator.


Update: The Post-Mortem can now be found on the Assignments page, going over some common mistakes found during marking.


See the Mark Appeals policy for information on remark requests. You must send an e-mail to cs240e@uwaterloo.ca within two weeks of this post for your request to be handled.


Information about midterm

During last lecture I made an announcement about the coming midterm, I recall the information here so that nobody misses it.


Modular hashing choice of $$M$$ prime

I had a very good remark at the end of the class about why it matters to choose $$M$$ a prime number when designing a hash function of the form $$h(k) = k \bmod M$$.

In fact if your keys are uniformly distributed, it will not matter whether $$M$$ is prime or not, because the remainders will also be uniformly distributed.

But if an abnormally high proportion of the keys are such that $$k=a\bmod d$$ with $$a$$ fixed and $$d$$ dividing $$M$$, then we will have an abnormally high proportion of keys which will collide because they are all hashed to $$a$$ via the hash function $$h(k) = k \bmod M$$.

A solution to avoid that is to choose an $$M$$ with almost no divisors, i.e. $$M$$ prime. This corresponds to the necessity that your hash function must not be sensitive to possible patterns of the data.


Fixing this morning's proof

First a recall of the lemma we use: if $$X$$ is a random variable taking non-negative integer values then we have

$$\mathbb{E}(X)=\sum_{i=0}^{\infty} i\mathbb{P}(X=i)=\sum_{i=1}^{\infty} i\mathbb{P}(X=i)=\sum_{i=1}^{\infty}\mathbb{P}(X \ge i)=\sum_{i=0}^{\infty}\mathbb{P}(X > i).$$

Note the various possible statements depending on indices and large/strict inequalities.

Then define $$X$$ to be the number of probes, a probe being defined as an access to hash table. This way, even if the first spot we look for is empty we still make one probe.

With this definition, $$\mathbb{P}(X \ge 1)=1$$ (in any case we probe at least one slot) then $$\mathbb{P}(X \ge 2)=\alpha$$ because $$X\ge 2$$ if and only if the first slot contains something (which happens with probability $$\alpha$$) and the same reasoning leads to

$$\mathbb{E}(X)= 1+\alpha+ \alpha^2+\cdots=\frac{1}{1-\alpha}$$.

If we instead defined $$X$$ as the number of "steps", i.e. the number of times we look at the next spot, then this definition is exactly the previous definition minus 1 and in that case we would get $$\frac{1}{1-\alpha}-1=\frac{\alpha}{1-\alpha}$$.


Midterm Help Session

Since the Friday timeslot was unfortunately not available, I will be holding our help session on Sat. June 22, 2:00-3:30 PM in MC 4064 (can go later if people have questions at the end).

The session will roughly consist of:

1) Practice problems for us to discuss and work through together

2) Brief notes and review of key topics from the modules

The list of practice questions will be posted to the Tutorials page by the end of Friday (Update: The questions have now been posted). Feel free to try them out in advance and/or use them to think of course topics you have questions about. More details on midterm coverage can be found here and here: @78

Hope to see you there!


RSA cryptosystem and its difference with hashing

The RSA cryptosystem was mentioned in class, so if you are curious you can look on the very detailed Wikipedia page https://en.wikipedia.org/wiki/RSA_(cryptosystem).

Material is also found in the textbook by Goodrich and Tamassia, section 24.4.

The fundamental difference with hash functions is that the main point of RSA is encrypting.

Cryptographic hashing will transform a message into something that nobody can recover (the author can only show that his text precisely gives the same hash and use this as a proof that the hash he sent really corresponds to the message).

Encrypting however, will alter the message in such a way that some people (those who know some shared secret) are still able to recover the original message easily, while others shouldn't be able to recover it.


A3 Solution Release

Solutions for Assignment 3 have been posted to the protected files area on the website.


Midterm Reference Sheet

The reference sheet for the CS 240E midterm has been posted to the Exams page. This is the same sheet you will get at the exam, so you won't have to worry about memorizing as many definitions.

A reminder to check your assigned seating on Odyssey because Portal...doesn't quite work right now.


Tutorial Partial Solutions

Tutorial 04 solutions have been updated to include skip lists. Tut 05 solutions have been posted.


Midterm week logistics

The CS 240E midterm exam is Tue. June 25, 4:30-6:20 PM. Your assigned seating can be found on Odyssey.

A lecture will be held at the usual time on Tuesday, covering tries from Module 06. Tomorrow's tutorial is also being held as scheduled.

I will not have my scheduled Wednesday office hours because I'll be busy marking your exams! I'll still be in on Monday and Tuesday. Simon also has his Monday office hours.

For those wondering about the midterm help session, I don't think I'll have time to type up official solutions, unfortunately. You are still free to ask about the questions on Piazza or in office hours; I am very open to answering things like "How do I get started on this question?" or "Can I have a hint on this?".


Assignment 3 Post-mortem

The A3 post-mortem has been posted here to give you some extra feedback on common mistakes before the midterm.


Assignment 4 Official Post and FAQ

Assignment 4 has been posted to the course website and is due Wed. July 10 at 5:00 PM. As always, be sure to follow the assignment guidelines while working on your submission. Note that the single-file submission should now follow the convention a4submit.pdf, not a4.pdf.

MarkUs will be set up later this week; you will not be able to submit any files or run public tests before then. UPDATE: MarkUs is now available with public tests for search.cpp.



The input asked in Q5f is input2 from question c) not d)

Q5d intends for "the best option" to be $$\min ( [l_{bin}, r_{bin}], [l_{interp}, r_{interp}] )$$, but it's also acceptable to improve it and use $$[ \max (l_{bin}, l_{interp}), \min(r_{bin}, r_{interp} ) ]$$ - just make sure you catch when this results in an invalid range.

To get "significant runtimes" in Q5bc, note that the file size limit on MarkUs is 5 MB; you only need runtimes which are enough to differentiate between binary and interpolation search. This can be achieved by using higher precision (cout << fixed << ((float)t/CLOCKS_PER_SEC); where t is your clock_t variable) or comparing based on running each some number of times (1000 repetitions of the same binary search vs. 1000 of the same interpolation search)

In Q3c, you still have access to the files on the server and can compute any of their hash values, but you should not reconstruct the entire Merkle tree since you want to stay within the $$O(\log n)$$ bounds.


Midterm Question 4 hints

I think it could be interesting for you to spend more time on the challenging parts of the midterm, and tackle them as if they were assignment problems (when you have access to your marked exams). No solution will be released but you can ask me or come see the solution in my office.

For Question 6 it should be detailed enough so you just have to spend time on it (feel free to ask if you are blocked somewhere).

For Question 4 I give you two hints: first, your buckets should be concentric "rings" and second, the size of these buckets should be designed such that they all have the same area (and thus they contain the same expected number of points). It is important that you convince yourself (and remember) that equidistribution among buckets is precisely what makes bucket sort work in linear time.


Canada Day on July 1

The coming Monday is a holiday; there will be no classes or office hours. To make matters more confusing, Tuesday follows a Monday schedule, so on July 2 we will have our regular tutorial rather than a lecture, and I'll have office hours at 4:00 - 5:00 PM (Simon will have his at 3:00 - 4:00 PM). The rest of the week proceeds as normal.


Tutorial 06 Partal Solutions

Tutorial 06 solution for quadratic probing is now posted here https://www.student.cs.uwaterloo.ca/~cs240e/protect/index.php


A bound on height/spread factor of quadtrees

I gave a second thought to this question and I reached a cleaner bound. This is not a very subtle reasoning so it is certainly possible to do better.

Let us assume that we have $$n$$ points randomly picked and equidistributed in a circle of radius $$R$$. Note that I chose a circle rather than a box to simplify analysis, but the quad tree is built using a box (of size $$2R$$ by $$2R$$) as in class, it's just that we won't have points in some parts. We denote by $$P_i$$ these points and $$||P_i-P_j||$$ denotes the Euclidean distance between two such points.

Fix $$\varepsilon>0$$, using equidistribution property, if a point is fixed the probability that another point is at distance smaller than $$\varepsilon$$ of this point is $$\frac{\varepsilon^2}{R^2}$$ (just the quotient of the areas of the two circles).

Thus, the probability that the minimal distance is smaller than $$\varepsilon$$ is bounded by $$\sum_{i<j}\mathbb{P}(||P_i-P_j||<\varepsilon)\le n^2\frac{\varepsilon^2}{R^2}$$.

Therefore, denoting by $$\beta$$ the spread factor we have $$\mathbb{P}(\beta > b)=\mathbb{P}(min distance < 2R/b)\le 4n^2/b^2$$. (we just use the definition of $$\beta$$ and replace $$\varepsilon$$ by $$2R/b$$ in the above result).

Finally, since the height of the quadtree is in $$\Theta(\log \beta)$$ we rephrase the result as $$\mathbb{P}(\log\beta \ge H) \le 4n^22^{-2H}$$.

Conclusion: the probability that the height of a quadtree storing $$n$$ equidistributed points is greater than $$H$$ decreases exponentially with $$H$$.

If we do like with skiplists, we have $$\mathbb{P}(\log\beta \ge 2\log n) \le 4\frac{n^2}{2^{4\log n}}=4\frac{n^2}{n^4}=\frac{4}{n^2}$$.

I bound the spread factor here but the height is bounded by $$c\log \beta$$ for some $$c$$ so it is the same thing.


Assignment 3 Marks Available

All marks for Assignment 3 are now released on MarkUs. As before, your grade on MarkUs will be displayed as if the Q5d) bonus marks are included in the denominator - the official mark will be recorded with the corrected denominator.


The Post-Mortem can be found on the Assignments page, going over some common mistakes found during marking.


See the Mark Appeals policy for information on remark requests. You must send an e-mail to cs240e@uwaterloo.ca within two weeks of this post for your request to be handled.


CS 240E Midterm Results

The CS 240E midterm marking has been released from Crowdmark. The CS 240E midterm is out of 50 marks and the average is 75% after some post-Crowdmark adjustments. See summary of grades (https://www.student.cs.uwaterloo.ca/~cs240e/cgi-bin/displayMarks.cgi) on the Course Info page for your actual midterm mark as a percentage and what will be used in course grade calculations. If you have concerns about your performance on the midterm, please speak with your instructor.

Midterm re-mark requests:

When to submit: Do NOT submit until an announcement on procedures is made (Update: This has now been posted at @124). Also do not submit before reading the midterm post mortem. The deadline for submitting Re-mark requests will be two weeks after the announcement that we are accepting re-mark requests.

When to expect results: all re-marks will be processed together after the re-mark request deadline. Requesters will be notified by email or a post when the results are ready and after the request is answered in Crowdmark.


Midterm Remark Requests

We are now accepting midterm re-mark requests via an on-line form on the exams page. Students should check the postmortem (also on the exams page) before submitting. Only requests using the form will be accepted.

The deadline to submit the form is Fri. July 19, 5:00 PM. All re-marks will be processed together after this deadline. Requesters will be notified by email or a post when the results are ready and after the request is answered in Crowdmark.


Tutorial 07 Solution

Tutorial 07 partial solution has been posted here : https://www.student.cs.uwaterloo.ca/~cs240e/protect/index.php


Assignment 5 Official Post and FAQ

Assignment 5 has been posted to the course website and is due Wed. July 24 at 5:00 PM. As always, be sure to follow the assignment guidelines while working on your submission.


Question 3 should use the bad character heuristic @149

Q3d)i) is the number of comparisons done for a specific guess within T for P, not the entire algorithm (although you will compute this in part iii)


A5Q3 Clarification

For all parts of question 3 on the assignment, the "Boyer-Moore algorithm" refers to the bad character heuristic only, and does not include the good suffix heuristic.


Evaluation of CS 240E

The course evaluation is now open on https://evaluate.uwaterloo.ca/ until July 30.

You are all encouraged to answer this survey as it is of importance for both the school and myself (feedback is an important way to improve).

Your answers are anonymous and I will see them only after the final grades are released so there is absolutely no risk for you to speak freely.


Tutorial 08 Solutions

Tut08 Solutions have been released here: https://www.student.cs.uwaterloo.ca/~cs240e/protect/index.php


Justification of compression ratio

The formula presented in class comparing the $$|text|\log|alphabet|$$ may seem a bit strange so here is the justification.

Given a text (which can be either the source or the encoded text), the good measure for its length is the number of characters times the size of each character (i.e. the number of bits used to store a character).

Then you can convince yourself that if your characters are among an alphabet of size $$N$$, you can manage to represent them using $$\log N$$ bits. For instance by ordering them, and giving them numbers from 0 to $$N-1$$, meaning that the number corresponding to a character is $$< N$$ and so each character corresponds to $$\log N$$ bits (if you don't want trouble you need to do fixed length and pad the begininning of small numbers with 0, or use a trick like Elias encoding which costs at most a factor 2 in size).


Assignment 4 Marks Available

All marks for Assignment 4 are now released on MarkUs. Since there were no bonus marks this time, your grade should be fairly accurate besides minor adjustments.


The Post-Mortem can be found on the Assignments page, going over some common mistakes found during marking.


See the Mark Appeals policy for information on remark requests. You must send an e-mail to cs240e@uwaterloo.ca within two weeks of this post for your request to be handled.


Difference between LZ77 and LZ78

Actually the ideas behind both are phrased very differently in the original papers, what we saw in class is very close to LZ78.

LZ77 uses sliding windows and reasoning based on sequence productibility while LZ78 uses the adaptive dictionary as we did.

The main difference is that in the worst case LZ78 will encode a single character by a single character of the dictionary (in the version  of our slides it is still bad because we encode an ASCII character on 7 bits by a block of 12 bits) whereas LZ77 will add more information to signal that the character has not been previously encountered, which accounts for a worse compression ratio.

I haven't found any advantage of LZ77 over LZ78 so you can safely assume that LZ78 is better than LZ77 though the latter is still commonly used (deflate or NTFS, the Windows file system).

Edit: As Felipe mentioned these algorithms inspired newer algorithms and in some cases descendants of LZ77 can outperform descendants of LZ78.

Giving precise comparisons other than runtimes would require to study them more deeply so I can't give more information on that, however as often it seems that there is no absolute comparison like "one is always better than another". I found people claiming to have inputs on which LZW beats DEFLATE.


Typo in Karp-Rabin algorithm

Following a discussion during office hours, I think there is a typo in the slide about Karp-Rabin with rolling hash (slide 11 of module 9).

Computing the hash value for next guess should also be done for i=0 (otherwise the first character never leaves), so the if statement should be removed.

To avoid missing the first guess, the if statement $$h_T=h_P$$ should be moved before the update of $$h_T$$.

This is mostly implementation detail, it does not change the philosophy of the algorithm which is "compare hashes of guesses and pattern and only compare the actual guess and pattern when an equality is found", which is in that case coupled with the use of a rolling-hash so that the hash of a shift by one can easily be computed from the previous hash value.


Tutorial 09 Solution

Tutorial 09 partial solutions have been released here https://www.student.cs.uwaterloo.ca/~cs240e/protect/index.php


Office hours before final exams

Next Monday being a holiday and the following Monday being the exam, I will have two office hours next week:

- Tuesday, August 6 from 3 pm to 4 pm

- Friday, August 9 from 3 pm to 4 pm

If you have exams in the meantime, feel free to come a bit earlier or a bit later.

You can also come to my office if you need (I am usually there between 9am-5pm and often earlier and later but not consistently).

You can of course send me an email before you come to make sure that I will be there.

Bill's exam period office hours in MC 4065:

Wed. Jul. 31, 10 AM - 12 PM

Tue. Aug. 6, 4 PM - 5 PM

Wed. Aug. 7, 1 PM - 2 PM

Fri. Aug. 9, 1 PM - 2 PM


Final exam information

The Exams page has been updated with some relevant information regarding the final. Note that Module 11 on External Memory is not part of the exam coverage. As with the midterm, the provided reference sheet will be posted sometime next week and the help session will also be announced (please vote in @182 before end of today if you haven't already).

Simon will continue to have office hours next week (@186); I will announce my schedule in a later post, but for this week, I will have my Wednesday 10-12 as usual.


Tutorial 10 Solution

Tutorial 10 partial solutions (suffix array building) have been released here https://www.student.cs.uwaterloo.ca/~cs240e/protect/index.php


Final exam help session

The help session is scheduled for Sat. Aug. 10, 4:00-5:30 PM in MC 4064 (or later if there are still questions). The format will be comparable to the midterm session with both practice problems and brief review notes.


The list of practice questions will be posted to the Tutorials page by next week Friday (Update: The file has now been posted). Feel free to try them out in advance and/or use them to think of course topics you have questions about. More exam details can be found on the website.


Also, I have added my next week office hours to @186.


Final exam reference sheet

The reference sheet for the CS 240E final exam has been posted to the Exams page. This is the same sheet you will get at the actual exam, so you can plan your studying accordingly.


Assignment 5 Marks Available

All marks for Assignment 5 are now released on MarkUs. Your grade should be fairly accurate besides minor adjustments.


The Post-Mortem can be found on the Assignments page, going over some common mistakes found during marking.


See the Mark Appeals policy for information on remark requests. You must send an e-mail to cs240e@uwaterloo.ca before the final exam next Monday for your request to be handled.


Official marks on course website

The summary of grades has been updated with all official marks for assignments, the midterm, and remark results for the midterm + A1 to A4. It also shows your overall assignment average.

These will be your grades going into the final exam with the exception of A5 remarks, which may be submitted until the day of the exam (@191). If you believe there to be any errors, please e-mail cs240e@uwaterloo.ca. You must pass the weighted exam average to pass the course.


Tutorial 11 Solutions

Tutorial 11 solutions have been posted here https://www.student.cs.uwaterloo.ca/~cs240e/protect/index.php

Good luck for your final!

-- Bani


Help session problems

The list of questions for the final exam help session has been posted to the Tutorials page. Feel free to use them for practice or think about which ones you would like to have covered during the session.


Piazza soft deadline

Tonight at 11:59 PM will be the latest time that a Piazza question is still guaranteed an answer. You can still ask questions after, but we may not be able to answer in time for the exam.

It's been our pleasure watching the CS 240E class grow as students over the last 12 weeks, getting through some challenging content from what I consider the most abstract course in the first half of Waterloo's CS/SE degree. Best of luck to everyone on the final exam tomorrow!


Remarks on the exam

We have finished marking and overall you did pretty good, the average is close to 80% and the median is a bit above 80%. Everybody scored more than 50%.

I did not grade every question myself (obviously) so I don't have an exhaustive knowledge but here are my comments:

-Question 1 was mostly well done, many of you answered that a dictionary has order on the keys and can return sorted data. I did not like this argument very much because building a BST is also some work that's quite comparable to sorting. Since sorted data is nonetheless quite close to range-searching ideas I did not remove mark.

-Question 2 was overall not bad but I was still surprised by the proportion of people making big mistakes on this one

-Questions 5 and 7 were understood well but redaction was often the reason why you lost marks. We were not as picky as for assignments but still, it had to be as precise as possible (in Q5 the idea was to divide by two but you had to explain why it was possible)

-Question 6 was intended to be hard, but 6.d) and 6.e) were quite easy provided that you used previous results, those who skipped them missed a chance of getting 4 marks. However, a non-negligible number of students wrote something like $$n\in O(\log n)$$. I am sure it was the stress or the rush of the last minutes but it's pretty scary :/

For 6.c) many people wrote complicated things and then wrote non-trivial (or blatantly false) inequalities without proof and claimed that this gave the expected result. I was nice so I assumed that you made a mistake and not that you were trying to trick me into believing that your solution worked but for similar questions in future exams I really advise you to be honest when you can't prove something. It saves lots of time for the grader and avoids putting him/her in a bad mood or having him/her distrust you every time you don't justify enough.

Just like 6d), 6c) could be done quite fast by intuiting what mattered. You have $$Q_k=\frac{n!}{k!(n-k)!}1/n^k(1-1/n)^{n-k}$$. The last term is really close to one, bound it by 1. The $$k!$$ is really close to what you want, so keep it apart. Then notice that the other terms combine well with each other: it is very easy to see that $$\frac{n!}{n^k(n-k)!}$$ is bounded by 1. Thus you first see that $$Q_k\le \frac{1}{k!}$$ and only now you apply Stirling's formula. Once again, you want to make your life easier so instead of applying Stirling three times, first use some tricks to simplify the expression and avoid making your inequalities messier.

-The other questions were done pretty well in general. #pin