Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
[ " HUANG Yu(y-huang20@mails.tsinghua.edu.cn) " ]
HUANG Longbo(longbohuang@tsinghua.edu.cn);
纸质出版日期:2023-09-25,
网络出版日期:2023-09-04,
收稿日期:2023-03-28,
录用日期:2023-04-11
扫 描 看 全 文
HUANG Yu, HUANG Longbo. A survey of multi-modal learning theory[J]. 中山大学学报(自然科学版)(中英文), 2023,62(5):38-49.
HUANG Yu,HUANG Longbo.A survey of multi-modal learning theory[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2023,62(05):38-49.
HUANG Yu, HUANG Longbo. A survey of multi-modal learning theory[J]. 中山大学学报(自然科学版)(中英文), 2023,62(5):38-49. DOI: 10.13471/j.cnki.acta.snus.2023A022.
HUANG Yu,HUANG Longbo.A survey of multi-modal learning theory[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2023,62(05):38-49. DOI: 10.13471/j.cnki.acta.snus.2023A022.
Deep multi-modal learning
a rapidly growing field with a wide range of practical applications
aims to effectively utilize and integrate information from multiple sources
known as modalities. Despite its impressive empirical performance
the theoretical foundations of deep multi-modal learning have yet to be fully explored. In this paper
we will undertake a comprehensive survey of recent developments in multi-modal learning theories
focusing on the fundamental properties that govern this field. Our goal is to provide a thorough collection of current theoretical tools for analyzing multi-modal learning
to clarify their implications for practitioners
and to suggest future directions for the establishment of a solid theoretical foundation for deep multi-modal learning.
multi-modal learningmachine learning theoryoptimizationgeneralization
ABNEY S, 2002. Bootstrapping[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics: 360-367.
AKAHO S, 2006. A kernel method for canonical correlation analysis[EB/OL]. arXiv:cs/0609071. https://arxiv.org/abs/cs/0609071https://arxiv.org/abs/cs/0609071.
ALAYRAC J B, RECASENS A, SCHNEIDER R, et al, 2020. Self-supervised multimodal versatile networks[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems: 25-37.
ALLEN-ZHU Z, LI Y Z, 2020. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning[EB/OL]. arXiv : 2012.09816. https://arxiv.org/abs/2012.09816https://arxiv.org/abs/2012.09816.
AMINI M R, USUNIER N, GOUTTE C, 2009. Learning from multiple partially observed views:an application to multilingual text categorization[C]//Proceedings of the 23rd International Conference on Neural Information Processing Systems: 28-36.
ANDERSON P, WU Q, TENEY D, et al, 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition: 3674-3683.
ARORA R, MIANJY P, MARINOV T V, 2016. Stochastic optimization for multiview representation learning using partial least squares[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning : 2859-2867.
ARORA S, KHANDEPARKAR H, KHODAK M, et al, 2019. A theoretical analysis of contrastive unsupervised representation learning[EB/OL]. arXiv: 1902.09229. https://arxiv.org/abs/1902.09229https://arxiv.org/abs/1902.09229.
BACH F R, JORDAN M I, 2003. Kernel independent component analysis[C]//2003 IEEE International Conference on Acoustics, Speech, and Signal Processing: IV-876.
BALCAN M F, BLUM A, 2005. A PAC-style model for learning from labeled and unlabeled data[M]// AUER P, ed. Learning Theory. Berlin: Springer: 111-126.
BALCAN M F, BLUM A, YANG K, 2004. Co-training and expansion: Towards bridging theory and practice[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems: 89-96.
BALTRUŠAITIS T, AHUJA C, MORENCY L P, 2019. Multimodal machine learning: A survey and taxonomy[J]. IEEE Trans Pattern Anal Mach Intell, 41(2): 423-443.
BLUM A, MITCHELL T, 1998. Combining labeled and unlabeled data with co-training[C]//Proceedings of the eleventh annual conference on Computational learning theory: 92-100.
BLUMER A, EHRENFEUCHT A, HAUSSLER D, et al, 1989. Learnability and the vapnik-chervonenkis dimension[J]. J ACM, 36(4): 929-965.
BOMMASANI R, HUDSON D A, ADELI E, et al, 2021. On the opportunities and risks of foundation models[EB/OL]. arXiv: 2108.07258. https://arxiv.org/abs/2108.07258https://arxiv.org/abs/2108.07258.
BOUSQUET O, ELISSEEFF A, 2002. Stability and generalization[J]. J Mach Learn Res, 2: 499-526.
BREFELD U, GÄRTNER T, SCHEFFER T, et al, 2006. Efficient co-regularised least squares regression[C]//Proceedings of the 23rd international conference on Machine learning: 137-144.
BROWN T B, MANN B, RYDER N, et al, 2020. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems: 1877-1901.
CAI J, SUN H, 2011. Convergence rate of kernel canonical correlation analysis[J]. Sci China Math, 54(10): 2161-2170.
CHAUDHURI K, KAKADE S M, LIVESCU K, et al, 2009. Multi-view clustering via canonical correlation analysis[C]//Proceedings of the 26th Annual International Conference on Machine Learning: 129-136.
CHEN H, XIE W, VEDALDI A, et al, 2020a. Vggsound: A large-scale audio-visual dataset[C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 721-725.
CHEN T, KORNBLITH S, NOROUZI M, et al, 2020b. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning:1597-1607.
CHEN X, HE K, 2021. Exploring simple Siamese representation learning[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 15745-15753.
CHENG Y H, ZHAO X, CAI R, et al, 2016. Semi-supervised multimodal deep learning for RGB-D object recognition[C]//:Proceedings of the 25th International Joint Conference on Artificial Intelligence: 3345-3351.
DASGUPTA S, LITTMAN M L, McALLESTER D, 2001. PAC generalization bounds for co-training[C]//Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic:375-382.
DESAI K, JOHNSON J, 2021. VirTex: learning visual representations from textual annotations[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 11157-11168.
DEVLIN J, CHANG M W, LEE K, et al, 2018. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. arXiv: 1810.04805. https://arxiv.org/abs/1810.04805https://arxiv.org/abs/1810.04805.
DU C, TENG J, LI T, et al, 2021. Modality laziness: Everybody’s business is nobody’s business[EB/OL]. https://openreview.net/pdf?id=1eGFH6yYAJnhttps://openreview.net/pdf?id=1eGFH6yYAJn.
FARQUHAR J D R, HARDOON D R, MENG H Y, et al, 2005. Two view learning: SVM-2K, theory and practice[C]//Proceedings of the 18th International Conference on Neural Information Processing Systems: 355-362.
FEDERICI M, DUTTA A, FORRÉ P, et al, 2020. Learning robust representations via multi-view information bottleneck[EB/OL]. arXiv: 2002.07017. https://arxiv.org/abs/2002.07017https://arxiv.org/abs/2002.07017.
FUKUMIZU K, BACH F R, GRETTON A, 2007. Statistical consistency of kernel canonical correlation analysis[J]. J Mach Learn Res, 8:361-383.
GAO J, LI P, CHEN Z, et al, 2020. A survey on deep learning for multimodal data fusion[J]. Neural Comput, 32(5): 829-864.
GAT I, SCHWARTZ I, SCHWING A, et al, 2020. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies[EB/OL]. arXiv: 2010.10802. https://arxiv.org/abs/2010.10802https://arxiv.org/abs/2010.10802.
GOYAL Y, KHOT T, SUMMERS-STAY D, et al, 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 6904-6913.
GUILLAUMIN M, VERBEEK J, SCHMID C, 2010. Multimodal semi-supervised learning for image classification[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition: 902-909.
GUO C F, WU D R, 2019. Canonical correlation analysis(CCA) based multi-view learning: An overview[EB/OL]. arXiv: 1907.01693. https://arxiv.org/abs/1907.01693https://arxiv.org/abs/1907.01693.
GUO W, WANG J, WANG S, 2019. Deep multimodal representation learning: A survey[J]. IEEE Access, 7: 63373-63394.
GUPTA S, HOFFMAN J, MALIK J, 2016. Cross modal distillation for supervision transfer[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2827-2836.
HAOCHEN J Z, WEI C, GAIDON A, et al, 2021. Provable guarantees for self-supervised deep learning with spectral contrastive loss[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems: 5000-5011.
HARDOON D R, SHAWE-TAYLOR J, 2009. Convergence analysis of kernel canonical correlation analysis: Theory and practice[J].Mach Learn, 74(1): 23-38.
HAZIRBAS C, MA L, DOMOKOS C, et al, 2017. FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture[C]//Asian Conference on Computer Vision: 213-228.
HE K, ZHANG X, REN S, et al, 2016. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770-778.
HOTELLING H, 1992. Relations between two sets of variates[M]//KOTZ S, ed. Breakthroughs in Statistics. New York: Springer: 162-190.
HUANG Y, DU C Z, XUE Z H, et al, 2021. What makes multi-modal learning better than single (provably)[EB/OL]. arXiv: 2106.04538. https://arxiv.org/abs/2106.04538https://arxiv.org/abs/2106.04538.
HUANG Y, LIN J Y, ZHOU C, et al, 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning? (provably)[EB/OL]. arXiv: 2203.12221. https://arxiv.org/abs/2203.12221https://arxiv.org/abs/2203.12221.
JIA C, YANG Y F, XIA Y, et al, 2021. Scaling up visual and vision-language representation learning with noisy text supervision[C]// Proceedings of the 38th International Conference on Machine Learning: 4904-4916.
KOLTCHINSKII V, PANCHENKO D, 2000. Rademacher processes and bounding the risk of function learning[M]// GINÉ E, ed. High Dimensional Probability II. Boston: Birkhäuser: 443-457.
KRIZHEVSKY A, SUTSKEVER I, HINTON G E, 2017. ImageNet classification with deep convolutional neural networks[J]. Commun ACM, 60(6): 84-90.
KUMAR A, RAGHUNATHAN A, JONES R, et al, 2022. Fine-tuning can distort pretrained features and underperform out-of-distribution[EB/OL]. arXiv: 2202.10054. https://arxiv.org/abs/2202.10054https://arxiv.org/abs/2202.10054.
KUSS M, GRAEPEL T, 2003. The geometry of kernel canonical correlation analysis [R]. Tübingen: Max Planck Institute for Biological Cybernetics.
LEE J D, LEI Q, SAUNSHI N, et al, 2021. Predicting what you already know helps: Provable self-supervised learning[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems: 309-323.
LEE M A, ZHU Y, SRINIVASAN K, et al, 2019. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks[C]//2019 International Conference on Robotics and Automation (ICRA) : 8943-8950.
LI Y, YANG M, ZHANG Z, 2019. A survey of multi-view representation learning[J]. IEEE Trans Knowl Data Eng, 31(10): 1863-1883.
LI Z Y, WANG T H, ARORA S, 2021. What happens after SGD reaches zero loss?: A mathematical framework[EB/OL]. arXiv: 2110.06914. https://arxiv.org/abs/2110.06914https://arxiv.org/abs/2110.06914.
LIANG P P, ZADEH A, MORENCY L P, 2022. Foundations and trends in multimodal machine learning: Principles, challenges, and open questions[EB/OL]. arXiv: 2209.03430. https://arxiv.org/abs/2209.03430https://arxiv.org/abs/2209.03430.
LIU K, LI Y, XU N, et al, 2018. Learn to combine modalities in multimodal deep learning[EB/OL]. arXiv: 1805.11730. https://arxiv.org/abs/1805.11730https://arxiv.org/abs/1805.11730.
McALLESTER D A, 1999. Some PAC-Bayesian theorems[J]. Mach Learn, 37: 355-363.
MOU Y, ZHOU L, YOU X, et al, 2017. Multiview partial least squares[J]. Chemom Intell Lab Syst, 160: 13-21.
NGIAM J, KHOSLA A, KIM M, et al, 2011. Multimodal deep learning[C]//Proceedings of the 28th International Conference on International Conference on Machine Learning: 689-696.
RADFORD A, KIM J W, HALLACY C, et al, 2021. Learning transferable visual models from natural language supervision[EB/OL]. arXiv: 2103.00020. https://arxiv.org/abs/2103.00020https://arxiv.org/abs/2103.00020.
RAMACHANDRAM D, TAYLOR G W, 2017. Deep multimodal learning: A survey on recent advances and trends[J]. IEEE Signal Process Mag, 34(6): 96-108.
RAMESH A, DHARIWAL P, NICHOL A, et al, 2022. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. arXiv: 2204.06125. https://arxiv.org/abs/2204.06125https://arxiv.org/abs/2204.06125.
REED S, AKATA Z, YAN X C, et al, 2016. Generative adversarial text to image synthesis[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning: 1060-1069.
ROSENBERG D S, BARTLETT P L, 2007. The rademacher complexity of co-regularized kernel classes[C]// Proceedings of the 11th International Conference on Artificial Intelligence and Statistics: 396-403.
ROSENBERG D S, SINDHWANI V, BARTLETT P L, et al, 2009. Multiview point cloud kernels for semisupervised learning [Lecture Notes][J]. IEEE Signal Process Mag, 26(5): 145-150.
SEICHTER D, KÖHLER M, LEWANDOWSKI B, et al, 2020. Efficient RGB-D semantic segmentation for indoor scene analysis[EB/OL]. arXiv: 2011.06961. https://arxiv.org/abs/2011.06961https://arxiv.org/abs/2011.06961.
SINDHWANI V, ROSENBERG D S, 2008. An RKHS for multi-view learning and manifold co-regularization[C]//Proceedings of the 25th international conference on Machine learning: 976-983.
SMITH L, GASSER M, 2005. The development of embodied cognition: Six lessons from babies[J]. Artif Life, 11(1/2): 13-29.
SRIDHARAN K, KAKADE S M, 2008. An information theoretic framework for multi-view learning[C]// Conference on Learning Theory : 403-414.
SUN S, 2013. A survey of multi-view machine learning[J].Neural Comput Appl, 23(7/8): 2031-2038.
SUN S, JIN F, 2011. Robust co-training[J]. Int J Patt Recogn Artif Intell, 25(7): 1113-1126.
SUN S, SHAWE-TAYLOR J, 2010. Sparse semi-supervised learning using conjugate functions[J]. J Mach Learn Res, 11:2423-2455.
SUN S, SHAWE-TAYLOR J, MAO L, 2017. PAC-Bayes analysis of multi-view learning[J]. Inf Fusion, 35: 117-131.
SUN S, YU M, SHAWE-TAYLOR J, et al, 2022. Stability-based PAC-Bayes analysis for multi-view learning algorithms[J]. Inf Fusion, 86/87: 76-92.
SUN X, XU Y, CAO P, et al, 2020. TCGM: an information-theoretic framework for semi-supervised multi-modality learning[M]//VEDALDI A, ed. Computer Vision: ECCV 2020, Cham: Springer :171-188.
SZEDMAK S, SHAWE-TAYLOR J, 2007. Synthesis of maximum margin and multiview learning using unlabeled data[J]. Neurocomputing, 70(7/8/9): 1254-1264.
TISHBY N, PEREIRA F, BIALEK W, 2000. The information bottleneck method[EB/OL]. arXiv:physics/0004057. https://arxiv.org/abs/physics/0004057https://arxiv.org/abs/physics/0004057.
TOSH C, KRISHNAMURTHY A, HSU D J, 2021a. Contrastive estimation reveals topic posterior information to linear models[J]. J Mach Learn Res, 22(1): 12883-12913.
TOSH C, KRISHNAMURTHY A, HSU D, 2021b. Contrastive learning, multi-view redundancy, and linear models[C]//Proceedings of the 32nd International Conference on Algorithmic Learning Theory: 1179-1206.
TSAI Y H H, WU Y, SALAKHUTDINOV R, et al, 2020. Self-supervised learning from a multi-view perspective[EB/OL]. arXiv: 2006.05576. https://arxiv.org/abs/2006.05576https://arxiv.org/abs/2006.05576.
VALIANT L G, 1984. A theory of the learnable[J]. Commun ACM, 27(11): 1134-1142.
WANG W, TRAN D, FEISZLI M, 2020. What makes training multi-modal classification networks hard?[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 12692-12702.
WANG W, ZHOU Z H, 2013. Co-training with insufficient views[C]//Proceedings of the 5th Asian Conference on Machine Learning: 467-482.
WANG W, ZHOU Z H, 2017. Theoretical foundation of co-training and disagreement-based algorithms[EB/OL]. arXiv: 1708.04403. https://arxiv.org/abs/1708.04403https://arxiv.org/abs/1708.04403.
WEI C, XIE S M, MA T, 2021. Why do pretrained language models help in downstream tasks?an analysis of head and prompt tuning[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems : 16158-16170.
WEN Z X, LI Y Z, 2021. Toward understanding the feature learning process of self-supervised contrastive learning[EB/OL]. arXiv: 2105.15134. https://arxiv.org/abs/2105.15134https://arxiv.org/abs/2105.15134.
WEN Z X, LI Y Z, 2022. The mechanism of prediction head in non-contrastive self-supervised learning[EB/OL]. arXiv: 2205.06226. https://arxiv.org/abs/2205.06226https://arxiv.org/abs/2205.06226.
WITTEN D M, TIBSHIRANI R J, 2009. Extensions of sparse canonical correlation analysis with applications to genomic data[J]. Stat Appl Genet Mol Biol, 8(1): 1-27.
WU M, GOODMAN N, 2018. Multimodal generative models for scalable weakly-supervised learning[EB/OL]. arXiv: 1802.05335. https://arxiv.org/abs/1802.05335https://arxiv.org/abs/1802.05335.
XU C, TAO D C, XU C, 2013. A survey on multi-view learning[EB/OL]. arXiv: 1304.5634. https://arxiv.org/abs/1304.5634https://arxiv.org/abs/1304.5634.
XU C, TAO D C, XU C, 2015. Multi-view intact space learning[J]. IEEE Trans Pattern Anal Mach Intell, 37(12): 2531-2544.
YANG Y, YE H J, ZHAN D C, et al, 2015. Auxiliary information regularized machine for multiple modality feature learning[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence: 1033-1039.
ZANTEDESCHI V, EMONET R, SEBBAN M, 2019. Fast and provably effective multi-view classification with landmark-based SVM[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases: 193-208.
ZHANG C Q, HAN Z B, CUI Y J, et al, 2019. CPM-nets: Cross partial multi-view networks[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems: 559-569.
ZHANG Y H, JIANG H, MIURA Y, et al, 2020. Contrastive learning of medical visual representations from paired images and text[EB/OL]. arXiv: 2010.00747. https://arxiv.org/abs/2010.00747https://arxiv.org/abs/2010.00747.
0
浏览量
8
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构