Information Retrieval and Natural Language Processing

Course Description

Information Retrieval (IR) is an area that aims at answering user information needs with the most relevant information. In this course we shall study how search applications, e.g. Google, compute relevant search results from a repository of Web information.

This course starts by dissecting a search engine, and discusses the fundamental techniques currently used in information retrieval. Afterwards, the most relevant algorithms and retrieval models are discussed in detail.

The current demand for intuitive search processes and language comprehension have been alligning Natural Language Processing (NLP) and IR. In this course you will learn fundamental techniques, that are used to encode syntax, grammar, and semantics in machines.

This course includes extensive hands-on laboratories where key retrieval and NLP algorithms are examined. The goal is to strengthen students’ experimental analysis and critical thinking skills concerning search performance metrics and experimental results.

Objectives

Learn the concept of information relevance.
Learn how to parse and analyze text data.
Learn how to rank information by relevance.
Analysis of experimental results.

Grading

Exam (40%) + Lab work (60% with three submissions)

Online Lectures and Discussion forum

All lectures and labs are thaught by Zoom. Please, contact instructors to access the meeting ID and password.

A discussion forum (Discord) is set up to let students and lecturers discuss course and project issues. Please ask the intructors to join.

Discussion Forum Rules

When registering for the discussion forum, please follow the username schema: “FirstName Surname-StudentNr” e.g.: “Gustavo Gonçalves-40000”

Schedule

23/set/20 Introduction (video)
23/set/20 Text processing, NGRAMS, cosine distance (video)
30/set/20 Language models (video)
07/out/20 Evaluation (video)
14/out/20 Relevance-based Language Models (video)
21/out/20 Document categorization (video)
28/out/20 Learning to rank (video)
04/nov/20 Word embeddings (video)
11/nov/20 Information extraction (video)
18/nov/20 Question answering (video)
25/nov/20 Conversational search (video)
02/dez/20 Computational Ethics for NLP and IR (video)
09/dez/20 Project support
16/dez/20 Project support

Labs

Project starting point Colab Notebook or Jupyter Notebook
Sentiment classification with Scikit Learn [Colab Link] or with PyTorch [Colab Link]
Word and Sentence embeddings Part 1 or Part 2
Named Entities [Colab Link]
Query Re-Writing with T5 [Colab Link]

Project guidelines for milestone 2:

Model training:
- Build triplets (topic_turn, passage, relevance judgment) - Use only annotated ones;
- Convert relevance labels to 0-1;
- Feed all the pairs (topic_turn, passage) to BERT and get its embeddings;
- Train classifier with the embeddings and use the relevance judgments as labels to separate relevant from non-relevant pairs (use the provided classifier);
Answer Retrieval w/ LMD + Re-ranking with Trained Classifier:
- Get top-1000 passage from Elasticsearch;
- Feed BERT with <topic_turn, passage> and extract CLS embedding for each passage;
- Feed 1000 CLS embeddings to Classifier and extract scores;
- Sort passges by scores.

Exercises

Exercises Sheet

Lecturers

Joao Magalhaes ([email protected] - remove the ‘x’s to mail us)

Gustavo Gonçalves (ggoncalvcs.cmu.edu)