Course Description
In April of 1995, Lycos had the largest index of the web with 3.6 million web pages. Today, all of the major search engines index several billion pages, as well as images, video, real-time blogs, etc.
The proliferation of online data in the past ten years has increased the visibility and importance of data mining, and has also caused some fundamental changes in methods for data mining.
This project course will focus on methods for mining of large-scale unstructured data sets. The format is seminar-style, and students will read recent research papers in data mining and present them in class.
Students should have basic knowledge of machine learning and statistics. A large portion of this course is the quarter-long project, in which students work in groups of 2-3 on a self-defined data mining project.
We will provide some basic datasets for these projects, including the Enron e-mail corpus and a portion of Google's crawl.
All projects will be presented to a panel of VCs and thought leaders in industry and academia.
Prerequisites
- Familiarity with the basic concepts of probability theory. (Stat116 is sufficient but not necessary.)
- Familiarity with linear algebra. (Math 113 or CS237A are sufficient but not neccessary).
- Knowledge of basic computer science principles and skills at the level of CS103.
- Mathematical ability and the ability to understand and analyze fairly complicated algorithms and data structures. (CS161 is sufficient but not necessary.)
Syllabus
- Week 1 - Search
- Authoritative Sources in a Hyperlinked Environment - JM Kleinberg.
- The Anatomy of a Large-Scale Hypertextual Web Search Engine - S Brin, L Page.
- Week 2 - Personalized Search
- Topic-Sensitive PageRank - TH Haveliwala.
- Scaling Personalized Web Search- G Jeh, J Widom.
- An Analytical Comparison of Approaches to Personalizing PageRank - TH Haveliwala, SD Kamvar, G Jeh.
- Week 3 - Collaborative Filtering / Recommender Systems
- Amazon.com Recommendations: Item-to-Item Collaborative Filtering - G Linden, B Smith, J York.
- Item-based collaborative Filtering Recommendation Algorithms - B Sarwar, G Karypis, J Konstan, J Riedl.
- Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions - G Adomavicius, A Tuzhilin.
- Week 4 - Latent Semantic Indexing
- Indexing By Latent Semantic Analysis Journal of the American Society for Information Science - Deerwester, Dumais, Furnas, Landauer, Harshman.
- Probabilistic Latent Semantic Indexing - T Hofmann.
- Week 5 - Classification and Feature Selection
- Hierarchically Classifying Documents Using Very Few Words - D Koller, M Sahami.
- Training Linear SVMs in Linear Time - T Joachims.
- Week 6 - Clustering
- Principal Direction Divisive Partitioning - D Boley.
- Scalable Techniques for Clustering The Web - T Haveliwala, A Gionis, P Indyk.
- Evaluating Strategies for Similarity Search on the Web - T Haveliwala, A Gionis, D Klein, P Indyk.
- Week 7 - Info Vis
- Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases - C Stolte, D Tang, P Hanrahan.
- Week 8 - Peer-to-peer Networks
- Designing a super-peer network - B Yang, H Garcia-Molina.
- Incentives Build Robustness in BitTorrent - B Cohen.
- Dissecting BitTorrent: Five Months in a Torrent's Lifetime - M Izal et al.