CS 240 Web Crawler at BYU

I recently polished off the web crawler project for CS 240 at BYU. It’s probably the most talked-about project in the CS major, and the cause of so many students retaking the class.

The specification for the web crawler assignment can be found here. Basically, given a start URL, the crawler finds every link on a page, follows them, downloads the pages, and indexes each of the words on a page, as long as they are not in a given stop words file; then it follows the links from that page, and so on. All of the indexed information is printed out to XML files. The code also has to conform to proper style, and no memory leaks are allowed.
For those who still need to do the project or haven’t taken the following exam yet, I thought I’d post a note or two of help.

First off, check your constructors! In an initialization for a templatized BST node, I had been invoking the default copy constructor. A copy constructor looks like this:

T(const T & other)

In the contained object, I had only implemented the operator= construction. My class T had pointers in it, and those pointers were to objects which were allocated on the heap with the new keyword. The default copy constructor copied the pointers, and when the copy of the object of type T was deleted, so were the structures that its pointers pointed to. Since the original object pointed to the same structures, that object would then cause a segfault when destroyed because it would try to delete non-existent structures. Ouch!

That bug wasted a good 6 hours of my life. Needless to say, I was a little scared of the next assignment: a debugging exam. The class TAs put 4 bugs into our code (they didn’t touch comments, asserts, or unit tests), and we had 3 hours to find them. Here’s what the TA’s did to my code:

The first three bugs I was I able find through unit testing. The last one I pinpointed using valgrind and print statements; however, even though it was staring me in the face, I couldn’t find it and only got 75% on the exam.

In case somebody finds the code interesting/useful, I’ll post it here (no cheating!). Make with make bin. Run with bin/crawler <start url> <stopwords file> <output file>.