Teaching & Talks | Long Chen @ HKUST

COMP6411C: Advanced Topics in Multimodal Machine Learning

Spring 2025, Computer Science and Engineering, Hong Kong University of Science and Technology

Instructor: Long Chen
Class Time: Tuesday 10:30pm - 11:50pm, Thursday 10:30am - 11:50am (Room 4580)
Email: longchen@ust.hk (For course-related queries, please use the subject starting from [COMP6411C])
Teaching Assistant: Mr. Chaolei Tan (ctanak@connect.ust.hk) and Mr. Jiazhen Liu (jliugj@connect.ust.hk)
For those who have enrolled the COMP6411C course, if you want to get the recorded videos for absent classes, you can direct sent emails to the TA.

Course Description: This course provides a comprehensive introduction to recent advances in multimodal machine learning, with a focus on vision-language research. Major topics include multimodal understanding (including translation, multimodal reasoning, multimodal alignment, multimodal information extraction), multimodal generation, multimodal pretraining and adaptation, and recent techniques and trends in multimodal research. The course structure will primarily consist of instructor presentation, student presentation, in-class discussion, and a course final project.

Course Objectives: After completion of this course, students will understand mainstream multimodal topics and tasks, and develop their critical thinking and problem solving, such as identifying and explaining the state-of-the-art approaches for multimodal applications.

Pre-requisite: Basic understanding of probability and linear algebra is required. Familiarity or experience with machine learning (especially deep learning) and computer vision basics are preferred.

Grading scheme:

Class attendance and in-class discussion: 20%
Project presentation: 30%
Final project report: 50%

Reference books/materials:

Conferences: Proceedings of CVPR/ICCV/ECCV, ICLR/ICML/NeurIPS, ACL/EMNLP/ACM Multimedia
Book: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Lecture Syllabus / Schedule

Lecture

Data

Reading Materials

Lec: 1.1 Course Introduction and Overview
Course overview

Feb 04

Lec: 1.2 Deep Learning Basics (CNN/RNN)

Feb 06, 11

Lec: 1.3 Deep Learning Basics (Transformer)

Feb 11, 13

Lec: 2.1 Multimodal Translation
Captioning

Feb 18, 20

Lec: 2.2 Multimodal Reasoning
Visual Question Answering

Feb 20

Lec: 2.3 Multimodal Alignment (Video)
Grounding, Matching

Feb 25, 27

Lec: 2.4 Multimodal Alignment (Image)
Grounding, Matching

Mar 4

Lec: 2.5 Multimodal Information Extraction

Mar 6

Lec: 3.1 Generation Basics (VAE/DDPM)
Diffusion Models

Mar 11, 18

Lec: 3.2 Generation Basics (SGM/Guidance)
Score-based Matching

Mar 20

Lec: 3.3 Image Generation and Editing

Mar 20, 25

Lec: 4.1 Multimodal Pretraining (Image-Text)

Mar 25, 27

Lec: 4.2 Multimodal Pretraining (Video-Text)

Mar 27, Apr 8

Lec: 4.3 Adapting Pretrained Models

Apr 8, 10

Presentation Topics / Schedule

Presentation topcis including: Image-based Multimodal Understanding, Video-based Multimodal Understanding, Image Generation and Editing, Video Generation and Editing, RLHF for Multimodal Generation Model, Multimodal Pretraining, Adapting Pretrained Models, Building Multimodal LLMs (MLLMs), LLM-enhanced Multimodal Understanding, LLM-enhanced Multimodal Generation, Limitations in Today’s MLLM, RLHF for MLLM.

Paper Title

Data

Reading Materials

Presentation Session (1)

Apr 15

Presentation Session (2)

Apr 17

Presentation Session (3)

Apr 22

Presentation Session (4)

May 6

Presentation Session (5)

May 8

Presentation Session (6)

May 10

Student Presentation Slides

Acknowledgements

This course was inspired by and/or uses reserouces from the following courses:

Multimodal Foundation Models by Jia-Bin Huang, University of Maryland, Fall 2024.
MultiModal Machine Learning by Louis-Philippe Morency, Carnegie Mellon University, Fall 2023.
Advanced Topics in MultiModal Machine Learning by Louis-Philippe Morency, Carnegie Mellon University, Spring 2023.
Advances in Computer Vision by Bill Freeman, MIT, Spring, 2023.
Deep Learning for Computer Vision by Fei-Fei Li, Stanford University, Spring 2023.
Natural Language Processing with Deep Learning by Christopher Manning, Stanford University, Winter 2023.
Deep Learning for Computer Visionby Justin Johnson, University of Michigan, Winter 2022.