COL 864 Special Topics in AI (Information Retrieval) 2018-19 Semester 1

Course slot: AA (Mon., Thu. 2:00-3:30)
Piazza link (use the access-code shared in the class)


Jump to: Texbook Calendar Attendance Policy
Overview:

Information retrieval -aka "search"- plays a central role in our modern digital lives. In this course we cover the fundamental concepts of information retrieval as well as some of the recent advances in the field such as the use of knowledge graphs for retrieval, neural methods for retrieval tasks, and the use of succinct data-structures in building efficient search systems.

The course is mostly lecture-driven, with a few student-presentations, homework assignments and paper-reading tasks spread throughout the semester.

  1. Part I: Basic Concepts

    • Basic retrieval models - Boolean, tf-idf & vector-space models
    • Inverted Indexes - basic structure, ranked retrieval DAAT/TAAT
    • Retrieval effectiveness evaluation - precision/recall, nDCG
    • Benchmarking - Cranfield paradigm, pooled evaluations, TREC collections
  2. Part II: Improving Retrieval Models

    • Probabilistic retrieval, Language models,
    • Relevance feedback - Rocchio's method, implicit relevance feedback, query expansion
    • Revisiting inverted indexes - compact representation, top-k processing
    • Latent Semantic Indexing and Topic Models
  3. Part III: Advanced topics in IR (if time permits and topics are flexible)
    • Learning to rank - pointwise, pairwise, listwise approaches
    • Knowledge graphs for retrieval
    • Neural retrieval models
    • Web search engine architecture
Textbook and references:

Introduction to Information Retrieval
by Christopher Manning, Prabhakar Raghavan, and Hinrich Sch├╝tze,
Cambridge University Press.

I strongly encourage you to own a copy of this book. A high-quality preprint of the book is available from the book website

References

Apart from the text book, the course will use material from the following books (among others):

  1. Modern Information Retrieval : The Concepts and Technology behind Search
    by Ricardo Baeza-Yates and Ribeiro-Neto, 2010.
  2. Search Engines: Information Retrieval in Practice
    by Croft, Metzler and Strohman, 2149.
Prerequisites:
  • datastructures and algorithms
  • comfortable with programming in Java/C++, probability and statistics, linear algebra
  • background in machine learning and/or NLP is desired but not mandatory

Calendar

Dates of Lecture Topic Slides Notes
23-07 Introduction If you want to register and there are no slots left, send me an email. I can try to extend by another 5 slots (fcfs). Assignments will be either 2 prog + 3 written or 3 prog + 2 written - will let you know by end of next week.
26-07 Document Representation and Boolean Retr.
30-07, 02-08 Query Evaluation, Vector-space model Assignment-1 posted on Piazza
6-08 Evaluating Information Retrieval Systems
  • Note that the error in computing nDCG has been fixed.
  • There are additional reading tasks at the end of the lecture slides.
  • started taking attendance
13-08 No Lecture
9-08, 16-08 Relevance Feedback and intro. Probabilistic Retrieval
  • It was decided in the class that we will consider the first assignment as a "two part" assignment and accordingly give it twice the weight of other assignments.
  • Also, out of the remaining assignments there will be one programming assignment and two written assignments.
16-08
  • It was decided that for Assignment 1, the late penalty will be applied from the end of 20th August.
  • The deadline for Assignment 2 was corrected to 20th August.
20-08 Probabilistic Retrieval (contd.)
27-08 Probabilistic Retrieval (contd.) Revisited Probabilistic retrieval to clear up doubts
30-08 Language Modeling for IR Posted the paper by Zhai and Lafferty for self-reading.
3-09 Holiday due to Janmashtami
6-09, 10-09 Inverted Indexes Construction and Query Processing in IR using Inv. Indexes. Posted material on:
  • Compression in Posting Lists
  • Link to video of talk by Jeff Dean at WSDM 2149.
  • Techreport on the anatomy of Google search engine.

reminor-1 will be on 12th September.

13-09 No Lecture
17-09 Learning to Rank
20-09 Temporal Information Retrieval
24-09 Introduction to Knowledge Graphs for IR Slides Further reference: Tutorial on Utilizing Knowledge Graphs for IR
27-09 Guest Lecture by Dr. Sumit Bhatia from IBM, on "Knowledge Graphs From Theory to Practice" Slides Further reference: Tutorial Slides at ESWC 2018
08-10 Introduction to Neural Methods in Information Retrieval Slides
11-10 Introduction to Word2Vec

Paper Presentations

The list of papers that were covered through student presentations and reports are as follows:

  1. Enriching Word Vectors with Subword Information (https://arxiv.org/pdf/1607.04606.pdf)
  2. GloVe: Global Vectors for Word Representation (https://pdfs.semanticscholar.org/0825/788b9b5a18e3dfea5b0af123b5e939a4f564.pdf)
  3. Embedding-based Query Language Models (https://arxiv.org/pdf/1705.03556.pdf)
  4. Relevance-based Word Embedding (https://arxiv.org/pdf/1705.03556.pdf)
  5. Learning to Match Using Local and Distributed Representations of Text for Web Search (https://arxiv.org/pdf/1610.08136.pdf)
  6. PACRR: A Position-Aware Neural IR Model for Relevance Matching (https://arxiv.org/pdf/1704.03940.pdf)
  7. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval (https://arxiv.org/pdf/1710.05649.pdf)
  8. A deep relevance matching model for ad-hoc retrieval (https://arxiv.org/abs/1711.08611)
  9. Novelty and diversity in information retrieval evaluation (https://dl.acm.org/citation.cfm?id=1390446)
  10. Term level search result diversification (https://dl.acm.org/citation.cfm?id=2484095)
  11. Partitioned Elias-Fano indexes (https://dl.acm.org/citation.cfm?id=2609615)
  12. Inverted Index Compression and Query Processing with Optimized Document Ordering (https://dl.acm.org/citation.cfm?id=1526764)
  13. Type less, find more: fast autocompletion search with a succinct index (https://dl.acm.org/citation.cfm?id=1148234)
  14. Analyzing User's Sequential Behavior in Query Auto-Completion via Markov Processes (https://dl.acm.org/citation.cfm?id=2767723)

Attendance Policy

We will follow the commonly followed Mark weight scheme followed in many other CS courses in IITD (following description courtesy Prof. Arun Kumar)

  1. We propose attendance to carry individual weight in the course marks breakup, maybe 4 to 5%
  2. Less than 75% leads to one grade demarcation.
  3. Less than 50% leads to course failure.
  4. Proven to work in classes like: Prof. Naveen Garg - TOC 2016-2017 Semester 2 and Prof. Vinay Ribeiro - Computer Networks 2016-2017 Semester 1