Teaching & Talks | Long Chen @ HKUST

COMP6411C: Advanced Topics in Multimodal Machine Learning

Spring 2024, Computer Science and Engineering, Hong Kong University of Science and Technology

Instructor: Long Chen
Class Time: Monday 13:30pm - 14:50pm, Friday 9:00am - 10:20am (Room 2504)

Email: longchen@ust.hk (For course-related queries, please use the subject starting from [COMP6411C])
Office Hour: Friday 13:00 pm - 14:00 pm (Room 3508, ZOOM: 986 8247 0232 (Passcode: 6411C))

Teaching Assistant: Mr. Chaoyang Zhu (cy.zhu@connect.ust.hk)
TA Office Hour: Monday 14:00 pm - 15:00 pm (Room 4204)
For those who have enrolled the COMP6411C course, if you want to get the recorded videos for absent classes, you can direct sent emails to the TA.

Student Presentation Schedule: Google Docs

Course Description: This course provides a comprehensive introduction to recent advances in multimodal machine learning, with a focus on vision-language research. Major topics include multimodal translation, multimodal reasoning, multimodal alignment, multimodal information extraction, and recent deep learning techniques in multimodal research (such as graph convolution network, Transformer architecture, deep reinforcement learning, and causal inference). The course structure will primarily consist of instructor presentation, student presentation, in-class discussion, and a course final project.

Course Objectives: After completion of this course, students will understand mainstream multimodal topics and tasks, and develop their critical thinking and problem solving, such as identifying and explaining the state-of-the-art approaches for multimodal applications.

Pre-requisite: Basic understanding of probability and linear algebra is required. Familiarity or experience with machine learning (especially deep learning) and computer vision basics are preferred.

Grading scheme:

Class attendance and in-class discussion: 20%
Project presentation: 30%
Final project report: 50%

Reference books/materials:

Conferences: Proceedings of CVPR/ICCV/ECCV, ICLR/ICML/NeurIPS, ACL/EMNLP/ACM Multimedia
Book: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Student Presentation Slides

Syllabus / Schedule

Data

Lecture

Materials

2024-02-02

1.1 Course Structure and Introduction
Course overview
Key research problems in multimodal learning
Multimodal learning tasks

1. Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv'23.
2. Multimodal Learning Overview Slides (CMU 11-777)

2024-02-05

1.2 Unimodal Representation (Image & Video)
Image representation
Convolutional neural networks
Image-based scene understanding
Video representation

1. Convolutional Neural Networks (Stanford CS231n)
2. Object Detection Slides (Stanford CS231n)
3. Video Understanding Slides (Stanford CS231n)

2024-02-09

1.3 Unimodal Representation (Language & Video)
Word representation
Recurrent neural networks
Revisit long video encoding

1. Word Vectors Slides (Stanford CS224n)
2. RNN Slides (Stanford CS224n)
3. Language Models Slides (Stanford CS224n)

2024-02-16

2.1 Multimodal Translation (Visual to Text)
Encoder-decoder framework
Attention mechanism
Image captioning
New directions on image captioning

1. From Show to Tell: A Survey on Deep Learning-based Image Captioning. TPAMI'23.
2. Video Description: A Survey of Methods, Datasets and Evaluation Metrics. ACM Computing Surveys'19
3. Seq2seq Models and Attention Mechanism Slides (Stanford CS224n)

2024-02-19

2.2 Multimodal Translation (Text to Visual)
Diffusion mdels
Controllable text-to-image generation
Image editing
Concept customization

1. CVPR 2023 Tutorial (Alignments in Text-to-Image Genration)
2. CVPR 2023 Tutorial (Denoising Diffusion Models: A Generative Learning Big Bang)
3. CVPR 2022 Tutorial (Denoising Diffusion-based Generative Modeling: Foundations and Applications)

2024-02-23

2.3 Multimodal Reasoning
VQA tasks
VQA methods
Multi-step VQA
Robust VQA

1. Visual Question Answering: A Survey of Methods and Datasets. CVIU'17.

2024-02-26

2.4 Multimodal Alignment (Video)
Video retrieval
Temporal grounding
Weakly-supervised temporal grounding

1. Temporal Sentence Grounding in Videos: A Survey and Future Directions. TPAMI'23

2024-03-01

2.5 Multimodal Alignment (Image)
Referring expression comprehension and phrase grounding
Referring expression segmentation
Weakly-supervised visual grounding

2024-03-04

2.6 Multimodal Information Extraction
Scene graphs
Event Extraction
Event-event Relation

2024-03-08

Student Presentation Session (1)

2024-03-11

Student Presentation Session (2)

2024-03-15

3.1 Graph Convolution Network Background
Graph Basics
Graph Convolution Network
General Perspective on GNN

2024-03-18

3.2 GCN in Multimodal Research

2024-03-22

3.3 Transformer Background
Self-Attention and Transformer
Vision Transformer (ViT)

1. Self-Attention and Transformers (Stanford CS224N)
2. Vision Transformer (UMichigan EECS498)

2024-03-25

3.4 Transformer in Multimodal Research

2024-04-08

3.5 Reinforcement Learning Background
Policy gradient
Q-Learning

2024-04-12

Cancelled

2024-04-15

3.6 RL in Multimodal Research

2024-04-19

3.7 Causal Inference Background
Backdoor adjustment
Frontdoor adjustment
Conterfactual thinking

2024-04-22

3.8 Causal Inference in Multimodal Research

2024-04-26

Student Presentation Session (3)

2024-04-29

4.1 Multimodal Pretraining (Image-Text)
BERT
Image-Text Pretreaining
Unified Pretraining

1. CVPR 2022 Tutorial (VL Pretraining)

2024-05-03

4.2 Multimodal Pretraining (Video-Text)
Video-Text Pretraining
Multi-channel Videos

2024-05-09

4.3 Adapting Pretrained Models
Adapter, LoRA
Prompt tuning

2024-05-10

5.1 In the era of LLMs
Multimodal LLMs
LLM-enhanced multimodal learning

Acknowledgements

This course was inspired by and/or uses reserouces from the following courses:

MultiModal Machine Learning by Louis-Philippe Morency, Carnegie Mellon University, Fall 2023.
Advanced Topics in MultiModal Machine Learning by Louis-Philippe Morency, Carnegie Mellon University, Spring 2023.
Advances in Computer Vision by Bill Freeman, MIT, Spring, 2023.
Deep Learning for Computer Vision by Fei-Fei Li, Stanford University, Spring 2023.
Natural Language Processing with Deep Learning by Christopher Manning, Stanford University, Winter 2023.
Deep Learning for Computer Visionby Justin Johnson, University of Michigan, Winter 2022.