Deep Model Fusion: A Survey

Weishi Li1†  Yong Peng1†  Miao Zhang1  Liang Ding2  Han Hu3  Li Shen2
1 National University of Defense Technology
Corresponding author
  \dagger Equal Contribution
China
2JD Explore Academy
China
3Beijing Institute of Technology
China
liweishi.wh@foxmail.com; {yongpeng,zhangmiao15}@nudt.edu.cn; {liangding.liam,mathshenli}@gmail.com; hhu@bit.edu.cn
Abstract

Deep model fusion/merging is an emerging technique that merges the parameters or predictions of multiple deep learning models into a single one. It combines the abilities of different models to make up for the biases and errors of a single model to achieve better performance. However, deep model fusion on large-scale deep learning models (e.g., LLMs and foundation models) faces several challenges, including high computational cost, high-dimensional parameter space, interference between different heterogeneous models, etc. Although model fusion has attracted widespread attention due to its potential to solve complex real-world tasks, there is still a lack of complete and detailed survey research on this technique. Accordingly, in order to understand the model fusion method better and promote its development, we present a comprehensive survey to summarize the recent progress. Specifically, we categorize existing deep model fusion methods as four-fold: (1) “Mode connectivity”, which connects the solutions in weight space via a path of non-increasing loss, in order to obtain better initialization for model fusion; (2) “Alignment” matches units between neural networks to create better conditions for fusion; (3) “Weight average”, a classical model fusion method, averages the weights of multiple models to obtain more accurate results closer to the optimal solution. (4) “Ensemble learning” combines the outputs of diverse models, which is a foundational technique for improving the accuracy and robustness of the final model. In addition, we analyze the challenges faced by deep model fusion and propose possible research directions for model fusion in the future. Our review is helpful in deeply understanding the correlation between different model fusion methods and practical application methods, which can enlighten the research in the field of deep model fusion.

1 Introduction

In recent years, deep neural networks (DNNs) [129] have made remarkable development, which is widely used in computer vision (CV) [175], natural language processing (NLP) [30] and other fields. Generally speaking, a single deep learning model often has certain limitations and cannot fully capture all underlying information behind complex networks [195]. Therefore, the classic ensemble learning [198, 15, 193] combines the outputs of multiple models to improve the final performance of model in deep learning (DL). But it suffers from the high cost of storing and running multiple models at test time [204, 65], especially as the complexity and size of models increase. Especially, for example, GPT-3 [172] has billions of parameters, and PaLM [31] even reaches 540 billion parameters and 780 billion tokens. In addition, from the perspective of loss landscape of DNNs [134, 196], gradient-optimized solutions usually converge to points near the boundary of the wide flat region instead of the central point [99]. It means that a trained network is not exactly close to the optimal solution with minimum test error. The solutions near the relative optimal point need to be fused for a better result. It inspires researchers not only to limit the the fusion scope to predictions (e.g., logits, etc.), but also to include the fusion of model parameters without accessing the training data or maintaining all individual models [110]. Accordingly, deep model fusion [159, 111] aims at fusing several DNNs into a single network, which preserves their original capabilities and even outperforms multi-task training [135, 3]. In addition, deep model fusion can reduce the tendency of a single model to overfit particular samples or noise so as to improve the accuracy, diversity and robustness of predictions [223, 207].

Deep model fusion has attracted increasing interest due to the data privacy and practical resource-saving issues. Although the development of deep model fusion has brought many technical breakthroughs, it also produces a series of challenges, such as high computational load, model heterogeneity, and slow speed of alignment via combinatorial optimization [133, 204], etc. Some approaches are limited to specific scenarios [254, 227], which inspires researchers to investigate the principles of model fusion in different cases. Nevertheless, there is a lack of comprehensive reviews to summarize the approaches so as to indicate the internal mechanism of deep model fusion currently. Some work only focuses on model fusion from a single perspective (e.g., feature fusion, etc.) [45, 195] and a specific scene [213], or the fusion of information from different ways (multi-modal fusion [1, 103]) rather than the fusion of parameters. In order to give the developers insight into deep model fusion, we analyze the principles and methodologies of deep model fusion. In addition, we review the recent progress and representative applications, such as federated learning (FL) [160] and fine-tuning [29], etc. Our survey aims to illustrate the latest trends and potential directions in deep model fusion and provide a guideline for researchers to enhance the performance and reduce costs. Accoordingly, we group the approaches into four-fold according to the internal mechanisms and purposes as Figure 1. For the models trained independently that are not in the vicinity of each other, “mode connectivity” and “alignment” bring the solutions closer so as to obtain better original conditions of average. For the similar models with certain differences in the weights space, “weight average (WA)” tends to average the models directly and obtain solutions closer to the optimal point in the region of the parameter space where the value of loss function is low [118]. Furthermore, for the predictions of existing models, “ensemble learning” integrates different forms of predictions of the models to get better results. Specifically, the four categories are as follows:

Refer to caption
Figure 1: Schematic diagram of the overall model fusion process, as well as classification and connection of various classification methods.
  • Mode connectivity. [162, 61], The solutions obtained by gradient-based optimization can be connected in weight space by a path (connector) with no obstacles, which is referred to as mode connectivity [50, 46]. We can obtain other models that are more suitable for model fusion along the low-loss path. According to the mathematical form of path and the space where the connector is located, we divide this section into three parts “linear mode connectivity (LMC) [66]”, “non-linear mode connectivity” and “mode connectivity in subspace”. Mode connectivity can solve local optimization problems during training. The geometric relationships of paths of mode connectivity [162, 61] could also be used to accelerate the convergence, stability and accuracy of optimization procedures like stochastic gradient descent (SGD). In a word, mode connectivity provides a new perspective for interpreting and understanding the behaviors of model fusion [66]. But the difficulties of computational complexity and parameter tuning should be solved, especially when training models on large datasets.

  • Alignment. Alignment [140, 218] matches the units of multiple models and average the models to obtain the final model. The specific mathematical metrics (e.g., Euclidean distance [218]) between different models can be closer after alignment, which can reduce the differences between models, thus enhancing the effect of deep model fusion. Alignment can be divided into “activation matching” and “weight matching” depending on whether data distribution needs to be considered. Moreover, Re-basin [3] is introduced based on alignment, which explores the mechanism that solutions can be transported into a single basin (i.e., area of the flat parameter space where with relatively low loss [61, 96]) by permutation invariance [50]. However, it is often faced with the obstacles of large computation, slow speed pf combinatorial optimization and architecture difference, which makes it is not easy to be extended to other scenarios with different objectives. For example, the memory burden that comes with graph matching [142, 230] limits the application of deep model fusion.

  • Weight average. WA [227] is the most direct and efficient way to fuse several parent networks into a single network [159, 204]. Compared to mode connectivity and alignment, WA does not require additional computational complexity or training to find a superior starting point, which performs well on models contain a degree of similarities. According to the space of aggregation, WA can be classified into two parts “weight average” and “average in subspace” . In addition, the typical approaches “model soup”, “model arithmetic“ and “stochastic weight averaging (SWA)” also provide significant improvements over the existing methods. Furthermore, some bias may be introduced in the case of large differences in model structure or number of parameters when the parameters are normalized and merged. Nonetheless, WA is still the mainstream method of deep model fusion because of its simplicity and efficiency.

  • Ensemble Learning. The outputs of several different models are combined to improve the prediction performance and robustness, which is regarded as “ensemble learning” [195]. In this review, we focus on the ensemble learning in DL. Based on ensemble learning, “model reuse” provides specifications for each model so that useful models can be identified and merged from the pool of models when given new learning tasks [266, 177]. Ensemble learning has various frameworks with convenient interfaces, which is often used in practical areas such as object detection [20], etc. Although ensemble learning requires maintaining the multiple trained models and running each of them at test time [204], it is still one of the powerful techniques that have been widely adopted in DL.

  • Applications of Model Fusion. As a technology to improve the accuracy and robustness of deep models, model fusion promote the improvement to many application fields. “federated learning [160]”, an application of aggregating clients’ models on a central server, makes it possible for various parties to contribute data to the computation of functions (e.g., various statistics, classifiers [177]) without the risks of privacy disclosure. “fine-tuning” makes small adjustments to pre-trained models, which combined with model fusion to reduce training costs and adapt to the needs of a specific task or domain. Model fusion is also involved in “distillation”. That is, combine soft target knowledge from multiple complex models (teachers) to train a small model for specific requirements. “model fusion on foundation/LLMs” includes the work on large foundation models or large language models (LLMs), such as vision transformers (ViT) [79] and GPT [17], etc. The applications of model fusion help developers adapt to the needs of various tasks and domains and promote the development of DL.

In brief, our survey reviews deep model fusion techniques. In the first three sections “mode connectivity”, “alignment” and “weight average”, we mainly conduct a comprehensive study from the perspective of the fusion of model parameters. In the “ensemble learning”, we mainly investigate the issue from the perspective of model outputs aggregation. The main contributions of this work are summarized as:

  • We propose a new deep model fusion classification method from the perspectives of “mode connectivity”, “alignment”, “weight average” and ”ensemble learning”, which covers the theoretical synthesis approaches of model fusion, and provides guidance for the realization of high generalization and accuracy training of DNNs.

  • We compare the advantages and disadvantages of fusion approaches, and explain the mechanism and relationship between them, which provides inspiration for designing advanced model fusion methods in the future.

  • We summarize extensive application of deep model fusion. We also discuss current research trends so as to attract more attention and reflection in the future.

Moreover, the remainder of the paper is organized as follows: In Section 2 to Section 5, we introduce the approaches of deep model fusion according to the four perspectives “mode connectivity“, “alignment“, “weight average“ and “ensemble learning“. Section 6 introduces the applications of deep model fusion “federated learning“, “fine-tuning“, “distillation“ and “model fusion on foundation/LLMs“. Finally, in Section 7, we summarize the deep model fusion and discuss the challenges and potential directions in the future.

In addition, we illustrate the notations and their corresponding definitions in the full text. 𝑾isubscript𝑾𝑖\boldsymbol{W}_{i} is the ithsubscript𝑖𝑡i_{th} neural network with weights Wid(i=1,2,k)subscript𝑊𝑖superscript𝑑𝑖12𝑘W_{i}\in\mathbb{R}^{d}(i=1,2,...k) and bias term 𝒃𝒃\boldsymbol{b}. λ𝜆\lambda denotes weighted parameters. σ𝜎\sigma denotes a non-linear neuron activation function. \mathcal{L} is loss function that quantify the discrepancy between the predicted and actual values.

2 Mode Connectivity

Table 1: The summary of standard training pipelines of LMC and non-linear mode connectivity.
Mode connectivity The form of path Ref. Eq.
Linear path segment [58, 54] ϕ(t)=(1t)w1+tw2italic-ϕ𝑡1𝑡subscript𝑤1𝑡subscript𝑤2\phi(t)=(1-t)w_{1}+tw_{2}
polygonal chain [69, 66] ϕ(t)={2(tw+(0.5t)w1),0t0.52((t0.5)w2+(1t)w),0.5t1italic-ϕ𝑡cases2𝑡𝑤0.5𝑡subscript𝑤10𝑡0.52𝑡0.5subscript𝑤21𝑡𝑤0.5𝑡1\phi(t)=\left\{\begin{array}[]{ll}2\left(tw+(0.5-t)w_{1}\right),&0\leq t\leq 0.5\\ 2\left((t-0.5)w_{2}+(1-t)w\right),&0.5\leq t\leq 1\end{array}\right.
Non-linear path quadratic Bezier curve [152, 52] ϕ(t)=(1t)2w1+2t(1t)w+t2w2,0t1formulae-sequenceitalic-ϕ𝑡superscript1𝑡2subscript𝑤12𝑡1𝑡𝑤superscript𝑡2subscript𝑤20𝑡1\phi(t)=(1-t)^{2}w_{1}+2t(1-t)w+t^{2}w_{2},\quad 0\leq t\leq 1
Fourier series approximate curves [234] ϕ^(t)=β02+i=1nβicos(wit+ζi)^italic-ϕ𝑡subscript𝛽02superscriptsubscript𝑖1𝑛subscript𝛽𝑖subscript𝑤𝑖𝑡subscript𝜁𝑖\hat{\phi}(t)=\frac{\beta_{0}}{2}+\sum_{i=1}^{n}\beta_{i}\cos\left(w_{i}t+\zeta_{i}\right)

In this section, we introduce the definition, principles and related methods of mode connectivity. When training neural networks, the solutions trained by gradient-based optimization algorithms (e.g., SGD, etc.) can be merged without superior results [61, 46]. It is discovered that solutions can be connected via continuous paths (connectors) in the network weight space without increasing loss, which is referred to as mode connectivity [50, 66]. The models on the low-loss path can be fused to leverage the advantages of multiple models by mode connectivity, which is of great significance to produce a better aggregation model.

First, we explain the principles of mode connectivity. In a representative process of DL, the minima is usually described as a point at the bottom of a convex valley, the network parameters are determined by the location of the minima [118, 85, 116]. The traditional view is that the number of local minima and saddle points is large [71, 228], and different local minima will converge to different isolated regions in the parameter space [10, 27, 39]. Recent work [196, 125] demonstrates that the minima obtained by gradient-based optimizer are not walled off in isolated valleys [61]. Gotmare et al. [72] explore the potential relationship between the minima found by different training process. Other work [46, 169, 33, 182] manifest that neural network solutions form a connected manifold (i.e., solutions in the loss landscape are connected by pipelines in weight space). Compared with mode connectivity, a direct linear path connecting two such independently trained networks usually always leaves a low-loss manifold, which creates a high loss barrier at the points on the linear path. For example, the error at the midpoint of the line segment directly connecting two points is closed to 90%percent\% (VGG-16 on CIFAR-10 [66]). The above work proves the existence and effect of mode connectivity.

Second, some work [59, 66, 50] quantifies the pipelines of the mode connectivity. Let (tw1+(1t)w2)𝑡subscript𝑤11𝑡subscript𝑤2\mathcal{L}\left(tw_{1}+(1-t)w_{2}\right) for t(0,1)𝑡01t\in(0,1) be the loss (train or test error) of a neural network created by linearly interpolating between 𝑾1subscript𝑾1\boldsymbol{W}_{1} and 𝑾2subscript𝑾2\boldsymbol{W}_{2}. The random data augmentations in each epoch can be seen as noise when using SGD with the initialization and hyperparameters fixed. To determine whether the result of a trained network is stable to SGD noise, the loss barrier (error barrier) B(w1,w2)𝐵subscript𝑤1subscript𝑤2B\left(w_{1},w_{2}\right) [60] is defined as the maximum difference between the linear interpolation of the loss at each point and the loss of the linear connection of two points [50], as shown in Eq. (1):

B(w1,w2)=supt[(tw1+(1t)w2)][t(w1)+(1t)(w2)].𝐵subscript𝑤1subscript𝑤2subscriptsupremum𝑡delimited-[]𝑡subscript𝑤11𝑡subscript𝑤2delimited-[]𝑡subscript𝑤11𝑡subscript𝑤2B\left(w_{1},w_{2}\right)=\sup_{t}\left[\mathcal{L}\left(tw_{1}+(1-t)w_{2}\right)\right]-\left[t\mathcal{L}\left(w_{1}\right)+(1-t)\mathcal{L}\left(w_{2}\right)\right]. (1)

The loss barrier illustrates whether the error is constant or increased when we optimize the landscape [56, 61] along the path between 𝑾1subscript𝑾1\boldsymbol{W}_{1} and 𝑾2subscript𝑾2\boldsymbol{W}_{2}. If there is a tunnel between two networks with a barrier approximately equal to 0, which is equivalent to mode connectivity [60, 46, 59]. That is to say, the local minima obtained by SGD can be connected by a path ϕitalic-ϕ\phi with the lowest maximum loss as shown in Eq. (2):

ϕ(w1,w2)=argminϕ from 𝑾1 to 𝑾2{maxwϕ(w)},italic-ϕsubscript𝑤1subscript𝑤2italic-ϕ from subscript𝑾1 to subscript𝑾2argminsubscript𝑤italic-ϕ𝑤\phi\left(w_{1},w_{2}\right)=\underset{\phi\text{ from }\boldsymbol{W}_{1}\text{ to }\boldsymbol{W}_{2}}{\operatorname{argmin}}\left\{\max_{w\in\phi}\mathcal{L}(w)\right\}, (2)

which means that the loss is low along the pathway and the network is stable to SGD noise [46], as shown in Figure 2. There are two steps to conduct mode connectivity: first determine the form of the tunnels (e.g., polygonal chain, Bezier curve [66], etc.) as Table 1; then find the optimal low-loss pathway to connect different solutions, as shown in Table 2. According to the form of path and the space in which it is located, this section introduces “Linear mode connectivity”, “Non-linear mode connectivity” and “Mode connectivity in subspace”.

2.1 Linear Mode Connectivity

Refer to caption
Figure 2: Mode connectivity schematic diagram in two-dimensional loss landscape and other dimensional subspace. Left: Linear interpolation of the minima in the two basins results in high-loss barriers[46]. The lower two optimums follow a path of near constant low loss (e.g., Bezier curve, Polygonal chain, etc.)[66]. π(W2)𝜋subscript𝑊2\pi(W_{2}) is the equivalent model of W2subscript𝑊2W_{2} by permutation symmetry, which is located in the same basin as W1subscript𝑊1W_{1}. Re-Basin merges models by delivering solutions to individual basins [3]. Right: Low loss paths connect multiple minima in subspace(e.g., a low-loss manifold composed of d𝑑d-dim wedges [56]), etc.).

In order to connect two points on an optimized low-loss path, we first need to determine the form of the tunnel. If the optimal path ϕsuperscriptitalic-ϕ\phi^{*} is linear, then it is called LMC. Common linear paths are linear segment, polygonal chain as Eq.(3):

ϕw(t)={2(tw+(0.5t)w1),0t0.52((t0.5)w2+(1t)w),0.5t1,subscriptitalic-ϕ𝑤𝑡cases2𝑡𝑤0.5𝑡subscript𝑤10𝑡0.52𝑡0.5subscript𝑤21𝑡𝑤0.5𝑡1\phi_{w}(t)=\left\{\begin{array}[]{ll}2\left(tw+(0.5-t)w_{1}\right),&0\leq t\leq 0.5\\ 2\left((t-0.5)w_{2}+(1-t)w\right),&0.5\leq t\leq 1\end{array}\right., (3)

The parametric path train using the same hyperparameters from different random initialization. ϕw(0)=w1,ϕw(1)=w2formulae-sequencesubscriptitalic-ϕ𝑤0subscript𝑤1subscriptitalic-ϕ𝑤1subscript𝑤2\phi_{w}(0)=w_{1},\phi_{w}(1)=w_{2}. After deciding on the mathematical form of tunnel, the specific parameters need to be determined. Garipov et al. [66] suggest to minimize the expectation of loss (w)𝑤\ell(w) over a uniform distribution as Eq.(4):

minw(w)=minw𝔼tU(0,1)[(ϕw(t))],subscript𝑤𝑤subscript𝑤subscript𝔼similar-to𝑡𝑈01delimited-[]subscriptitalic-ϕ𝑤𝑡\min_{w}\ell(w)=\min_{w}\mathbb{E}_{t\sim U(0,1)}\left[\mathcal{L}\left(\phi_{w}(t)\right)\right], (4)

In addition, the tunnel found by this way is not unique. Nevertheless, vanilla mode connectivity are not robust enough to resolve various types of adversarial attacks. Robust mode connectivity (RMC) [229] uses adversarial training (AT) [156] to find tunnels between neural networks that exhibit robustness to different types of adversarial attacks as Eq.(5):

minw(w)=minw𝔼tU(0,1)maxDisti(𝐱,𝐱)δ𝐢(ϕw(t);(x,y)),subscript𝑤𝑤subscript𝑤subscript𝔼similar-to𝑡𝑈01subscriptsubscriptDist𝑖superscript𝐱𝐱subscript𝛿𝐢subscriptitalic-ϕ𝑤𝑡superscript𝑥𝑦\min_{w}\ell(w)=\min_{w}\mathbb{E}_{t\sim U(0,1)}\sum\max_{\operatorname{Dist}_{i}\left(\mathbf{x}^{\prime},\mathbf{x}\right)\leq\delta_{\mathbf{i}}}\mathcal{L}\left(\phi_{w}(t);\left(x^{\prime},y\right)\right), (5)

where δisubscript𝛿𝑖\delta_{i} are minimal values, Disti𝐷𝑖𝑠subscript𝑡𝑖Dist_{i} denotes distance measurement function. The RMC path in the parameter space improves robustness to different types of attack. Some work complements the LMC from a global connectivity perspective. Nguyen et al. [168] prove that when the number of neurons in a hidden layer is larger than a certain amount of training samples, the loss function has no so-called bad local valleys, and all the global minima are connected in a large global valley. Shevchenko et al. [202] demonstrate that as the number of neurons increases (over-parameterization), the landscape of the multi-layer network is connected, which is more conducive to LMC. Although previous studies speculate that interconnected local minima in over-parameterized networks mean the mode connectivity of the loss function, which does not always hold true (e.g., over-parameterized two-layer networks [125]). Kuditipudi et al. [125] explain mode connectivity by noise stability [60, 8], which is somewhat equivalent to dropout stability. In other words, all noise stabilization solutions can be connected in a sufficiently over-parameterized network.

As for the practical application of LMC, Zhao et al. [263] suggest to use LMC to repair backdoored or error-injected models. Neyshabur et al. [167] show the application of LMC to pre-trained visual models. Qin et al. [186] explore the relationship between different downstream configurations and mode connectivity of language model models.

2.2 Non-linear Mode Connectivity

In this subsection, we focus on the non-linear pathway connected solutions in weight space, which is known as non-linear mode connectivity [112, 186]. Bezier curve is one of the representative form of non-linear path as Eq.(6):

ϕw(t)=(1t)2w1+2t(1t)w+t2w2,0t1.formulae-sequencesubscriptitalic-ϕ𝑤𝑡superscript1𝑡2subscript𝑤12𝑡1𝑡𝑤superscript𝑡2subscript𝑤20𝑡1\phi_{w}(t)=(1-t)^{2}w_{1}+2t(1-t)w+t^{2}w_{2},\quad 0\leq t\leq 1. (6)

Compared with non-linear connectivity, the convex combinations (LMC) of minima within the loss basin remain in the same basin. in contrast, the nonlinear connectivity between minima are not located in the same basin, which means that the LMC is not available in some cases.

Recent work [125, 46, 152] show that different independently trained networks can be connected by nonlinear pathways that remain in the low-loss manifold in the weight space. Qin et al. [186] speculate that there may be multiple loss basins connected by low loss nonlinear paths. Yun et al. [253] indicate that output can be obtained by connecting the Bezier curves of the two network parameters in the absence of an actual forward passing network in the Bridge network. Gotmare et al. [72] manifest that non-linear mode connectivity is widely applied to networks trained with different optimizers, data enhancement strategies and learning rate schedules. Futhermore, Lubana et al. [152] explain the principle of mode connectivity by mechanistic similarity, which is defined as the fact that two models are mechanistically similar if they make predictions using the same properties (e.g., shape or background) of the input. The mechanistic similarity of the induced models is related to LMC of two minimizers (minima). There is no LMC between mechanistically dissimilar minimizers, but mode connections can be made via relatively non-linear paths. The representative approach for finding nonlinear path [66] is similar to LMC, as Eq.(7):

minw(w)=minw𝔼αqw(t)[(ϕw(t))],subscript𝑤𝑤subscript𝑤subscript𝔼similar-to𝛼subscript𝑞𝑤𝑡delimited-[]subscriptitalic-ϕ𝑤𝑡\min_{w}\ell(w)=\min_{w}\mathbb{E}_{\alpha\sim q_{w}(t)}\left[\mathcal{L}\left(\phi_{w}(t)\right)\right], (7)

where qw(t)subscript𝑞𝑤𝑡q_{w}(t) is the distribution for sampling the models along the path. Moreover, Draxler et al. [46] use AutoNEB [122] and minimum spanning tree (MST) to generate the approximation of ϕsuperscriptitalic-ϕ\phi^{*} connecting the minima of networks on CIFAR-10 and CIFAR-100. AutoNEB connects two solutions, which updates the pivot after each iteration until AutoNEB-tunnel approaches the optimal low-loss path ϕsuperscriptitalic-ϕ\phi^{*}. Nevertheless, the approximation of ϕsuperscriptitalic-ϕ\phi^{*} may fall into a local minima tunnel with unreasonable high saddle point losses.

To sum up, both linear and nonlinear paths can result in low test errors. While linearly connected pathways are simple, it could have certain limitations. As for non-linear mode connectivity, it is difficult to calculate the gradient on some non-linear path such as Bezier curve.

2.3 Mode Connectivity in Subspace

Table 2: The methods of finding tunnels between different local minima.
Connectors Methods Ref. Introduction
2-dim path line segment [71] produce big error
GDSS [61] approximate the geodesic paths via GDSS
AutoNEB [122, 46] minimize MST to obtain approximation of ϕsuperscriptitalic-ϕ\phi^{*}
minimize the expectation [66] representative approach that connects solutions in a simple way
RMC [229] enhance the robustness of DNNs against different perturbations
N-dim space MPO [206, 36] obtain substantial memory savings
N-dimensional connectors [56] connect low-dimensional wedges
train parametric subspace [238] learn the parameters of lines, curves and simplexes
SPRO, ESPRO [11] find simplexes and simplicial complexes to seek connectors
geodesic optimization [215] speculate the geodesics in the curved distribution space

Previous work of mode connectivity [66, 56, 54] focuses on low-loss tunnels in weight space without explicitly addressing other dimensional structure. This subsection explores the mode connectivity and model training in subspace of another dimension rather than in a native parameter space. Subspace in machine learning typically describe linear structures generated by vectors in the initial vector space. There are also concepts of non-linear subspace, such as nonlinear dimensionality reduction and manifold learning [98]. Standard neural network training is performed on a full parameter space Dsuperscript𝐷\mathbb{R}^{D}. Limiting the optimization to a random low-dimensional affine subspace (e.g., low-dimensional hyperplanes and hyperspheres, etc.) also leads to the similar results as full-space optimization in some cases [57, 132] , which lay the foundation for mode connectivity in subspace. Definitely, mode connectivity in oriented subspace constrain the representation ability of the model and the value range of the weights, so as to overcome the over-fitting problem of model fusion.

Recent work attempts to implement mode connectivity in different subspace. Fort et al. [56] extend the concept of low-loss connectors (tunnels) between solutions to m𝑚m-dimensional connectors (m𝑚m is smaller than the dimension of full parameter space). Randomly initialized points that are not on the same wedge (i.e., a union of m-dimensional manifolds) can always pass through the intersection of their wedges, thus building a low-loss path between the different minima, as shown in Figure 2.Based on the speculation, the m𝑚m-dimensional hyperplanes are constructed on the piece-wise linear interpolation between the points, in which the low-loss connectors can be found. Benton et al. [11] propose simplicial point-wise random optimization (SPRO) to connect models through a multi-dimensional manifold.𝒦(S(w0,ε0),S(w1,ε0))𝒦subscript𝑆subscript𝑤0subscript𝜀0subscript𝑆subscript𝑤1subscript𝜀0\mathcal{K}\left(S_{\left(w_{0},\varepsilon_{0}\right)},S_{\left(w_{1},\varepsilon_{0}\right)}\right) denote simplicial complex composed of disjoint 00-simplexes. SPRO adds the join points εisubscript𝜀𝑖\varepsilon_{i} to connect 0-simplexes in the complex iteratively so as to keep the loss low within the simplicial complex. It obtains a complex 𝒦𝒦\mathcal{K} by sharing multiple εisubscript𝜀𝑖\varepsilon_{i}. When a join point εksubscript𝜀𝑘\varepsilon_{k} connects the two modes, the pathway of complex 𝒦(S(w0,ε0),,S(wn,ε0))𝒦subscript𝑆subscript𝑤0subscript𝜀0subscript𝑆subscript𝑤𝑛subscript𝜀0\mathcal{K}\left(S_{\left(w_{0},\varepsilon_{0}\right)},...,S_{\left(w_{n},\varepsilon_{0}\right)}\right) can be found by previous method [66]. When some joint points connects multiple modes, the solution to 𝒦(S(w0,ε0,ε1,ε2),,S(wn,ε0,ε1,ε2))𝒦subscript𝑆subscript𝑤0subscript𝜀0subscript𝜀1subscript𝜀2subscript𝑆subscript𝑤𝑛subscript𝜀0subscript𝜀1subscript𝜀2\mathcal{K}\left(S_{\left(w_{0},\varepsilon_{0},\varepsilon_{1},\varepsilon_{2}\right)},...,S_{\left(w_{n},\varepsilon_{0},\varepsilon_{1},\varepsilon_{2}\right)}\right) is similar to the above work [56]. For narrow architectures of networks, geodesic optimization [215] finds a low-loss pathway connecting the solutions where general tunnels of mode connectivity can not pass through a region of high loss. The mode connectivity pathways in weight space is associated to the geodesics γ𝛾\gamma (i.e., shortest paths in the space of parameterized distributions, which is regarded as a Riemannian manifold with fisher information matrix fijsubscript𝑓𝑖𝑗f_{ij}). The geodesics γ𝛾\gamma is obtained by minimizing the loss (γ)𝛾\mathcal{L}(\gamma), which is equivalent to the integral of the square root Jensen-Shannon Divergence (JSD) [35] as Eq.(8):

(γ)=tdγidtfijdγjdt𝑑t=8γdJSD𝛾subscript𝑡𝑑superscript𝛾𝑖𝑑𝑡subscript𝑓𝑖𝑗𝑑superscript𝛾𝑗𝑑𝑡differential-d𝑡8subscript𝛾𝑑JSD\mathcal{L}(\gamma)=\int_{t}\sqrt{\frac{d\gamma^{i}}{dt}f_{ij}\frac{d\gamma^{j}}{dt}}dt=\sqrt{8}\int_{\gamma}\sqrt{d\mathrm{JSD}} (8)

Further, the mode connectivity in subspace is affected by the properties of the subspace, such as the relationship between dimension of the plane and the inherent dimension specific to the problem, the radius in the weight space, the dimensions of the hyperplane [132], etc. Moreover, Fort et al. [55] explore training tracks and subspace sampling (e.g., dropout, diagonal Gaussian, low-rank Gaussian and random subspace), which further complement relevant work of mode connectivity in subspace. In addition, recent work [42] inspires us to explore the mode connectivity in Pareto manifold to be applied to multi-task learning. In sum, the trained solutions can be found in both the full parameter space and the random low-dimensional hyperplane, as long as the points are distributed densely enough in most cases.

2.4 Discussion

In summary, mode connectivity provides a more novel and flexible perspective for deep model fusion. The training of neural networks tends to fall into local optima, which leads to degradation of performance. On the basis of model connectivity, we can find other models with better performance and use that as a starting point for further optimization and fusion. We can use the already trained model to move in the parameter space to reach the new target model, which can save time and computing overhead, and is suitable for situations where data is limited. Nevertheless, additional complexity and flexibility may be introduced to increasing the risk of overfitting when connecting different models. Therefore, the relevant hyperparameters and degree of variation should be carefully controlled. Also, mode connectivity requires fine-tuning or parameter changes, which can increase training time and resource consumption. In summary, model connectivity has many advantages in model fusion, including helping to overcome local optimal problems, providing new perspectives to explain network behavior, etc. In the future, mode connectivity is expected to help understand the inner mechanism of neural networks and provides guidance for more efficient deep model fusion designs in the future.

3 Alignment

Table 3: Comparison of representative alignment methods.
Alignment Methods Ref.
Activation matching metrics coefficient of correlation [140, 218]
mutual information [140]
22\ell 2 distance [218, 204, 3]
pre &\& post activation pre-activation [218, 204]
post-activation [140, 218]
Weight matching metrics Wassertain distance [204, 4, 232]
Euclidean distance [3, 178]
graph matching bipartite matching [140, 127]
graph matching [142]
other alignment Bayesian [254, 227]
Sinkhorn Re-basin [178]
SA [50]

Due to the randomness of channels and components from diverse networks, the active components of the networks interfere with each other [204]. So unaligned weighted averages could ignore correspondence between units from diverse models and damage useful information. For example, there is a relationship between two neurons in different models that could be completely different but functionally similar. Alignment matches the units of different models so as to obtain better initial conditions for deep model fusion. It aims to make multiple models have smaller differences and , thus enhancing the deep model fusion effects. Also, alignment can be regarded as a combinatorial optimization issue in essence. In this section, we introduce a representative mechanism “Re-basin”, which delivers solutions to individual basins so as to merge models with better original conditions. Following this, we divide the alignment into two types “Activation matching” and “Weight matching” depending on whether the aligned target is data-driven as Table 3.

3.1 Re-basin

Before introducing the specifics, we illustrate the permutation symmetry and Re-basin, which is the basic premise of alignment. Generally speaking, the number of saddle points and local optima can increase exponentially with the number of parameters even for shallow neural networks [10, 66]. It is discovered that there are invariances in training that leads to the same representation of some points among these local optima [81, 22, 140]. Specifically, the function of the network will not change if the units of hidden layer are exchanged by permutation, which is referred to as permutation symmetry [43, 50]. Formally, a \ell-layer function of DNN f()(x,w)=σ(W(1)f(1)+b(1))superscript𝑓𝑥𝑤𝜎superscript𝑊1superscript𝑓1superscript𝑏1f^{(\ell)}(x,w)=\sigma(W^{(\ell-1)}f^{(\ell-1)}+b^{(\ell-1)}) can be described as Eq.(9) [3]:

f()(x,w)=𝑷Tσ(𝑷W(1)f(1)+𝑷b(1)),superscript𝑓𝑥𝑤superscript𝑷𝑇𝜎𝑷superscript𝑊1superscript𝑓1𝑷superscript𝑏1f^{(\ell)}(x,w)=\boldsymbol{P}^{T}\sigma\left(\boldsymbol{P}W^{(\ell-1)}f^{(\ell-1)}+\boldsymbol{P}b^{(\ell-1)}\right), (9)

where 𝑷𝑷\boldsymbol{P} denotes the permutation matrix. We can obtain the functional equivalent model f(x;w)=f(x;π(w))𝑓𝑥𝑤𝑓𝑥𝜋𝑤f(x;w)=f\left(x;\pi(w)\right) by rearranging the input. On the basis of permutation symmetry, solutions from diverse area in weight space can generate equivalent solutions. A equivalent solution is located in a same region as the original solution with low-loss barrier (basin), as shown in Figure 2, which is referred to as “Re-basin” [3] as Eq.(10):

Re-basin: f()(x,w)=σ(P()W()(P(1))Tf()+P()b())Re-basin: superscript𝑓𝑥𝑤𝜎superscript𝑃superscript𝑊superscriptsuperscript𝑃1𝑇superscript𝑓superscript𝑃superscript𝑏\text{Re-basin: }f^{(\ell)}(x,w)=\sigma\left(P^{(\ell)}W^{(\ell)}(P^{(\ell-1)})^{T}f^{(\ell)}+P^{(\ell)}b^{(\ell)}\right) (10)

Once the optimal permutation matrix 𝑷superscript𝑷\boldsymbol{P}^{*} is obtained, it is theoretically possible to implement model fusion: W=λ1W1()+λ2P()W2()(P(1))T𝑊subscript𝜆1superscriptsubscript𝑊1subscript𝜆2superscript𝑃superscriptsubscript𝑊2superscriptsuperscript𝑃1𝑇W=\lambda_{1}W_{1}^{(\ell)}+\lambda_{2}P^{(\ell)}W_{2}^{(\ell)}(P^{(\ell-1)})^{T}. Compared with mode connectivity, Re-basin tends to transport the points into a basin by permutation instead of low-loss tunnels. At present, alignment is a representative approach of Re-basin[3, 178]. However, how to efficiently search for all possibilities of permutation symmetry so that all solutions point to the same basin is a current challenge.

Permutation symmetries imposed by these invariances help us understand the structure of loss landscapes better [22, 66]. The invariances also can be seen as the source of saddle points in loss landscapes [14]. Godfrey et al. [68] investigate the algebraic structure of symmetries in neural networks and how this structure manifests itself in loss landscape geometry. Brea et al. [14] introduce permutation point in high-dimensional plateaus, at which the neurons can be exchanged without increasing losses or parameter jumps as Figure 3. Conduct gradient descent on the loss and adjust the parameter vectors ϑmsubscriptitalic-ϑ𝑚\vartheta_{m} and ϑnsubscriptitalic-ϑ𝑛\vartheta_{n} of neuron m𝑚m and n𝑛n, until the vectors reach the permutation point. At this time, the parameter configuration is called permutation point, and the parameter vectors and function of the two neurons are the same . Furthermore, Tatro et al. [218] explore the permutation symmetry of the nonlinear mode connectivity. Benzing et al. [12] speculate that two random initialization of a network after permutation can lead to a good performance. Furthermore, the alignment method does not always generate good low-loss connections between solutions due to variance collapse of activations. REnormalizing Permuted Activations for Interpolation Repair (REPAIR) [111] mitigates the variance collapse by rescaling the preactivation of networks, which eliminate the 90%percent\% barrier for ResNet-18 on CIFAR-10 after alignment.

Refer to caption
Figure 3: Left: general alignment process. Model A𝐴A is transformed into model Apsubscript𝐴𝑝A_{p} by reference to model B𝐵B. Then the linear combination of Apsubscript𝐴𝑝A_{p} and B𝐵B produces C. Right: adjust the parameter vectors of the two neurons ϑmsubscriptitalic-ϑ𝑚\vartheta_{m},ϑnsubscriptitalic-ϑ𝑛\vartheta_{n} in different hidden layers are close to the replacement point. At the replacement point, [14], ϑm=ϑnsuperscriptsubscriptitalic-ϑ𝑚superscriptsubscriptitalic-ϑ𝑛\vartheta_{m}^{\prime}=\vartheta_{n}^{\prime}, and the two neurons compute the same function, which means that two neurons can be exchanged.

3.2 Activation Matching

In this subsection, based on permutation symmetry, we focus on the matching of activation values. The initial models for fusion can be improved by reducing the differences in activation. Minimizing the cost functions between activations is a representative way to calculate 𝑷superscript𝑷\boldsymbol{P}^{*}, which can be transformed into assignment problems, such as linear assignment problem (LAP) or quadratic allocation problem (QAP), etc. They can be solved by Hungarian algorithm or Sinkhorn algorithm. The common cost functions 𝒞𝒞\mathcal{C} used in alignment are cross-correlation [218] as Eq.(11), mutual information (information entropy) [140] as Eq.(12), 22\ell 2 distance [3] as Eq.(13), KL divergence, Wasserstein distance, etc.

𝒞(Am,An)cor=𝔼[(Am𝔼[Am])(An𝔼[An])]/ξmξn,𝒞subscriptsubscript𝐴𝑚subscript𝐴𝑛𝑐𝑜𝑟𝔼delimited-[]subscript𝐴𝑚𝔼delimited-[]subscript𝐴𝑚subscript𝐴𝑛𝔼delimited-[]subscript𝐴𝑛subscript𝜉𝑚subscript𝜉𝑛\mathcal{C}(A_{m},A_{n})_{cor}=\mathbb{E}\left[\left(A_{m}-\mathbb{E}\left[A_{m}\right]\right)\left(A_{n}-\mathbb{E}\left[A_{n}\right]\right)\right]/\xi_{m}\xi_{n}, (11)
𝒞(Am,An)info=aAm(𝑾1)bAn(𝑾2)p(a,b)log(p(a,b)p(a)p(b)),𝒞subscriptsubscript𝐴𝑚subscript𝐴𝑛𝑖𝑛𝑓𝑜subscript𝑎superscriptsubscript𝐴𝑚subscript𝑾1subscript𝑏superscriptsubscript𝐴𝑛subscript𝑾2𝑝𝑎𝑏𝑝𝑎𝑏𝑝𝑎𝑝𝑏\mathcal{C}(A_{m},A_{n})_{info}=\sum_{a\in A_{m}^{(\boldsymbol{W}_{1})}}\sum_{b\in A_{n}^{(\boldsymbol{W}_{2})}}p(a,b)\log\left(\frac{p(a,b)}{p(a)p(b)}\right), (12)
𝒞(Am,An)2=Am(𝑾1)𝑷An(𝑾2)2,𝒞subscriptsubscript𝐴𝑚subscript𝐴𝑛2superscriptnormsuperscriptsubscript𝐴𝑚subscript𝑾1𝑷superscriptsubscript𝐴𝑛subscript𝑾22\mathcal{C}(A_{m},A_{n})_{\ell 2}=\left\|A_{m}^{(\boldsymbol{W}_{1})}-\boldsymbol{P}A_{n}^{(\boldsymbol{W}_{2})}\right\|^{2}, (13)

where Amsubscript𝐴𝑚A_{m} denotes the activation of unit m𝑚m with standard deviation ξ𝜉\xi. p(a)𝑝𝑎p(a) denotes marginal probability distributions. In addition, it is discovered that using post-activation is better than using pre-activation in some cases [218]. Besides the cost functions, Singh et al. [204] use the optimal transport (OT) and Wasserstein barycenter to match the activations of different neural networks. The transport map 𝑻(n×m)𝑻superscriptnm\boldsymbol{T}\in\mathbb{R}^{(\mathrm{n}\times\mathrm{m})} transports neurons of 𝑾1subscript𝑾1\boldsymbol{W}_{1} optimally to neurons of 𝑾2subscript𝑾2\boldsymbol{W}_{2} in the same layer. The permutation matrix and 𝑻𝑻\boldsymbol{T} have a similar function, which can be obtained as Eq.(14):

𝑻OT(μ,ν,ds),𝑻OT𝜇𝜈subscript𝑑𝑠\boldsymbol{T}\leftarrow\mathrm{OT}\left(\mu,\nu,d_{s}\right), (14)

where dssubscript𝑑𝑠d_{s} denotes the support measure (reflect the 22\ell 2 distance between activations here). ν𝜈\nu and μ𝜇\mu are the probability measure. This kind of methods based on OT lay the foundation for some recent work [4, 178, 3]. Nevertheless, if the alignment problem is simply defined as linear problems, the second-order proximity of weights and the abundant edge information between channels could be ignored [142].

3.3 Weight Matching

Instead of matching activation, we could alternatively align the models based on weight without data distribution. First, the basic approaches of weight matching is also based on minimizing the cost function to obtain 𝑷superscript𝑷\boldsymbol{P^{*}}. Singh et al. [204] use the weights of the incoming edges to calculate support and probability measures to obtain the transport map 𝑻𝑻\boldsymbol{T} as Eq.(14). Ainsworth et al. [3] arrange the rows and columns of the modes to minimize the 22\ell 2 distance between the weight vectors (restricted by ordinary least squares) as Eq.(15):

𝒞(w1,w2)2=vec(w1)vec(π(w2))2.𝒞subscriptsubscript𝑤1subscript𝑤22absentabsentsuperscriptnormvecsubscript𝑤1vec𝜋subscript𝑤22\mathcal{C}(w_{1},w_{2})_{\ell 2}=\underset{}{}\left\|\operatorname{vec}\left(w_{1}\right)-\operatorname{vec}\left(\pi\left(w_{2}\right)\right)\right\|^{2}. (15)

It results in the sum of bilinear linear assignment problem (SOBLAP), which can be divided into sub-problems and solved by LAP. Different from activation matching, weight matching is not affected by data distribution. It means that all 𝑷𝑷\boldsymbol{P} need to be obtained by LAP, which is a complicated issue in essence. And it is difficult to leverage the gradient-based optimization. Pena et al. [178] extend the scope of cost function to all differentiable objectives, such as a midpoint as Eq.(16) and random point between w1subscript𝑤1w_{1} and w2subscript𝑤2w_{2} as Eq.(17):

𝒞mid(w1,w2)=𝒞(w1+π(w2)2),subscript𝒞𝑚𝑖𝑑subscript𝑤1subscript𝑤2𝒞subscript𝑤1𝜋subscript𝑤22\mathcal{C}_{mid}\left(w_{1},w_{2}\right)=\mathcal{C}\left(\frac{w_{1}+\pi\left(w_{2}\right)}{2}\right), (16)
𝒞random(w1,w2)=𝒞[(1α)w1+απ(w2)],subscript𝒞𝑟𝑎𝑛𝑑𝑜𝑚subscript𝑤1subscript𝑤2𝒞delimited-[]1𝛼subscript𝑤1𝛼𝜋subscript𝑤2\mathcal{C}_{random}\left(w_{1},w_{2}\right)=\mathcal{C}\left[(1-\alpha)w_{1}+\alpha\pi\left(w_{2}\right)\right], (17)

where αU(0,1)similar-to𝛼𝑈01\alpha\sim U(0,1). Moreover, Sinkhorn operator Sτsubscript𝑆𝜏S_{\tau} is added to the LAP process and Sinkhorn Re-basin is shown as Eq.(18):

f()(x,w)=σ[Sτ(P())W()Sτ((P(1))T)f(1)+Sτ(P())b()].superscript𝑓𝑥𝑤𝜎delimited-[]subscript𝑆𝜏superscript𝑃superscript𝑊subscript𝑆𝜏superscriptsuperscript𝑃1𝑇superscript𝑓1subscript𝑆𝜏superscript𝑃superscript𝑏f^{(\ell)}(x,w)=\sigma\left[S_{\tau}\left(P^{(\ell)}\right)W^{(\ell)}S_{\tau}\left((P^{(\ell-1)})^{T}\right)f^{(\ell-1)}+S_{\tau}\left(P^{(\ell)}\right)b^{(\ell)}\right]. (18)

It solves non-differentiable problems and can be applied to more scenarios, such as FL [160]. Based on Beta-Bernoulli Process (BBP) [219], Yurochkin et al. [254] max the posterior of random variables pisubscript𝑝𝑖p_{i} that match neurons at any batch and the global neurons. Hungarian algorithm can be used to solve this problem to obtained Pisubscript𝑃𝑖P_{i}. In addition to minimizing the cost function, Wang et al. [227] regard the units of the model as a random permutation of global nodes based on the Beta-Bernoulli Process (BBP) [219] The permutation matrix can be obtained by BBP-MAP [254]. A simulated annealing (SA)-based method [50] searches for the valid permutations in the weight space Re-basin . Due to the high cost, it unrealistic to be applied, especially for large models. Stoica et al. [211] calculate merge matrix Pisubscript𝑃𝑖P_{i} and unmerge matrix P¯isubscript¯𝑃𝑖\bar{P}_{i} to fuse the models and unmerge operations, which can be applied within the model or across the models. Instead of calculating the optimal matrix, Ainsworth et al. [3] optimize the approximate equivalent model w2~~subscript𝑤2\tilde{w_{2}} iteratively and keep looking for the closest equivalent model until convergence, which minimizes \mathcal{L} as Eq.(19):

minw~2(12(w1+proj(w~2))),subscriptsubscript~𝑤212subscript𝑤1projsubscript~𝑤2\min_{\tilde{w}_{2}}\mathcal{L}\left(\frac{1}{2}\left(w_{1}+\operatorname{proj}\left(\tilde{w}_{2}\right)\right)\right), (19)

where projection operations can be solved by straight-through estimator (STE), which is expensive in practic. Based on Gromov-Wasserstein barycenter (GWB) [179], Akash et al. [4] update the coupling matrix ΠΠ\Pi and W𝑊W alternately to optimize Gromov-Wasserstein barycenter distance until convergence. Let k𝑘k be the number of nodes, the final aligned model can be obtained as Eq.(20):

Wkk11𝟙kl1𝟙klT1ni=1nΠiWi(Πi1)Tsuperscript𝑊superscript𝑘superscript𝑘11subscript1superscript𝑘𝑙1superscriptsubscript1superscript𝑘𝑙𝑇1𝑛superscriptsubscript𝑖1𝑛subscriptsuperscriptΠ𝑖subscriptsuperscript𝑊𝑖superscriptsubscriptsuperscriptΠ1𝑖absent𝑇W^{\ell}\leftarrow k^{\ell}k^{\ell-1}\frac{1}{\mathbb{1}_{k^{l-1}}\mathbb{1}_{k^{l}}^{T}}\frac{1}{n}\sum_{i=1}^{n}\Pi^{\ell}_{i}W^{\ell}_{i}\left(\Pi^{\ell-1}_{i}\right)^{*T} (20)

Moreover, recent research [227, 254] proposes to alternate for a number of iterations between finding an alignment and retraining to minimize the loss barriers between SGD minimas.

Furthermore, another significant approach of alignment is graph matching (GM) [150], which aims to match nodes in the graph using structural characteristics in the graph. Since network channels and weight can be treated as nodes and edges, the alignment issues could be turned into GM [247, 142]. General approaches could use Bipartite semi-matching or Bipartite matching [127, 140] to solve GM. Liu et al. [142] propose graduated assignment model fusion (GAMF) [230] uses second-order similarity of model weights to align neurons build on gradient assignment as Eq.(21):

maxP=i=0dΣ1j=0dΣ1a=0dΣ1b=0dΣ1𝑷[i,j]𝑲[i,j,a,b]𝑷[a,b],subscript𝑃superscriptsubscript𝑖0subscript𝑑Σ1superscriptsubscript𝑗0subscript𝑑Σ1superscriptsubscript𝑎0subscript𝑑Σ1superscriptsubscript𝑏0subscript𝑑Σ1subscript𝑷𝑖𝑗subscript𝑲𝑖𝑗𝑎𝑏subscript𝑷𝑎𝑏\max_{P}=\sum_{i=0}^{d_{\Sigma}-1}\sum_{j=0}^{d_{\Sigma}-1}\sum_{a=0}^{d_{\Sigma}-1}\sum_{b=0}^{d_{\Sigma}-1}\boldsymbol{P}_{[i,j]}\boldsymbol{K}_{[i,j,a,b]}\boldsymbol{P}_{[a,b]}\\ , (21)

where dsubscript𝑑d_{\sum} denotes the sum of dimensions,𝑲𝑲\boldsymbol{K} denotes affinity tensor that calculate the affinity between the edges (i,a)𝑖𝑎(i,a) and (j,b)𝑗𝑏(j,b). The problem can be transformed into QAP by unifying the relationships of nodes and edges into a incidence matrix. In contrast, multi-graph matching (MGM) [246, 108, 130] ensures that the matching of two graphs is not affected by another graph, and it applies to the alignment of multiple models. Further, Uriot et al. [222] explore merging models that take into account more possible permutations.

3.4 Discussion

Alignment makes the models more similar by adjusting the parameters of the models, which can improve the information sharing between the models, and thus improve the generalization ability of the fused model. In addition, alignment helps improve the performance and robustness of the model on complex tasks. However, alignment methods face the problems of slow combinatorial optimization. Alignment requires additional computational overhead to adjust the model’s parameters, which can lead to a more complex and time-consuming training process, especially in large depth models [204, 142].

In summary, alignment can improve the consistency and overall effect between different models. With the diversification of DL application scenarios, alignment will become one of the key methods to optimize deep model fusion, improve generalization ability. In the future, alignment could play a role in areas such as transfer learning, domain adaptive [63], knowledge distillation, etc. For example, alignment can reduce the differences between source and target domains in transfer learning, improve the learning on new domains.

4 Weight Average

“Weight average” combines multiple weights of networks for the final model with better performance, robustness and generalization. It is also known as vanilla average [204], weight summation [131], as shown in Eq.(22):

λiWi,subscript𝜆𝑖subscript𝑊𝑖\sum\lambda_{i}W_{i}, (22)

where each model is assigned a weighted parameter λisubscript𝜆𝑖\lambda_{i} that controls how much it contributes to the fused model. However, different from alignment or mode connectivity, the pre-conditions of WA are relatively strict. For example, the original models must share part of the training trajectory or located in the same basin [99, 133], etc. It means that the final model can benefit from all models when the weights are similar enough but have certain differences [110]. In a flat basin, the solutions tend to demonstrate good performance. Conversely, points in narrow regions are easily accessible to energy-barriers, resulting in increased losses [167]. Previous sections focus on transporting solutions from different regions to the same basin through mode connectivity or alignment. This section will focus on the fusion of convex combinations of solutions in the same basin, which makes the merged solution closer to the midpoint (optima) of the basin with better generalization performance than endpoints, such as SWA [99], model soup [239], etc. The models discussed in this section includes the following cases:

  • Multiple similar models with certain differences.

  • Multiple models after appropriate fine-tuning on foundation models (e.g., model soup, model arithmetic, etc.).

  • Multiple checkpoints from networks with the same architectures and sharing part of the training trajectory (e.g. SWA [99], tail average [166], etc.).

Accordingly, in this section, we review two-fold approaches of weight average “Weight average” and “Average in subspace”. Next, we introduce representative approaches of WA “Model soup” , “Model arithmetic” and “SWA”. The representative approaches are listed in Table 4.

4.1 Weight Average

Because of the high redundancy of neural network parameters, there is usually no one-to-one correspondence between weights of different neural networks. Accordingly, there is usually no guarantee that WA will perform well by default. For trained networks with widely varying weights, the vanilla average performs poorly [204]. From a statistical point of view, WA allows the individual model parameters in the model to be controlled, which reduces the variance of the final model, resulting in a reliable effect on regularization properties and output result [77, 166].

Table 4: Summary of representative methods and formulas of weight average.
Method Method Ref. Introduction
choose the best [239] argmaxiValAcc(Wi))\operatorname{argmax}_{i}ValAcc(W_{i})) simple but without the advantages of WA
vanilla average [204] W=λiWi𝑊subscript𝜆𝑖subscript𝑊𝑖W=\sum\lambda_{i}W_{i} often have bad performance
Fisher [159] W=λifiwiλifi𝑊subscript𝜆𝑖subscript𝑓𝑖subscript𝑤𝑖subscript𝜆𝑖subscript𝑓𝑖W=\frac{\sum\lambda_{i}f_{i}w_{i}}{\sum\lambda_{i}f_{i}} maximize joint likelihood of the posterior distribution
RegMean [109] W=(XiTXi)1(XiTXiWi)𝑊superscriptsuperscriptsubscript𝑋𝑖𝑇subscript𝑋𝑖1superscriptsubscript𝑋𝑖𝑇subscript𝑋𝑖subscript𝑊𝑖W=\left(\sum X_{i}^{T}X_{i}\right)^{-1}\sum\left(X_{i}^{T}X_{i}W_{i}\right) minimize differences between merged model and individual models
MLP fusion [232] W=[σ(𝑿W1,,i+𝒃1,i𝟏)W2,i,]+𝟏𝒃2T𝑊delimited-[]𝜎𝑿subscript𝑊1𝑖subscript𝒃1𝑖1subscript𝑊2𝑖1superscriptsubscript𝒃2𝑇W=\sum\left[\sigma\left(\boldsymbol{X}W_{1,\cdot,i}+\boldsymbol{b}_{1,i}\mathbf{1}\right)W_{2,i,\cdot}\right]+\mathbf{1}\boldsymbol{b}_{2}^{T} cluster the sub MLPs via NTK
BTM [135] W=λiWi𝑊subscript𝜆𝑖subscript𝑊𝑖W=\sum\lambda_{i}W_{i} combine the expert LMs on different domains of corpora
PAPA [110] W Averaging (W,N)𝑊 Averaging 𝑊𝑁W\leftarrow\text{ Averaging }(W,N) average a mass of models trained on slightly different datasets
ratatouille [188] W=λi(wi,ϕfeaturizer)𝑊subscript𝜆𝑖subscript𝑤𝑖subscriptitalic-ϕ𝑓𝑒𝑎𝑡𝑢𝑟𝑖𝑧𝑒𝑟W=\sum\lambda_{i}\left(w_{i},\phi_{featurizer}\right) use the diversity of auxiliary tasks to enrich the diversity of weights
Lookahead [259] Wslow,t+1=ema(Wfast)+(1α)tWslow,0subscript𝑊𝑠𝑙𝑜𝑤𝑡1𝑒𝑚𝑎subscript𝑊𝑓𝑎𝑠𝑡superscript1𝛼𝑡subscript𝑊𝑠𝑙𝑜𝑤0W_{slow,t+1}=ema(W_{fast})+(1-\alpha)^{t}W_{slow,0} combine fast weight and slow weight
SMA [9] W=tt0tt0+1Wt1+1tt0+1Wt𝑊𝑡subscript𝑡0𝑡subscript𝑡01subscript𝑊𝑡11𝑡subscript𝑡01subscript𝑊𝑡W=\frac{t-t_{0}}{t-t_{0}+1}\cdot W_{t-1}+\frac{1}{t-t_{0}+1}\cdot W_{t} conduct tail average in later stages
WiSE-FT [240] W=(1λ)W0+λWft𝑊1𝜆subscript𝑊0𝜆subscript𝑊𝑓𝑡W=(1-\lambda)\cdot W_{0}+\lambda\cdot W_{ft} interpolation models before and after fine-tuning
EWC [131] W=H1W1+H2W2H1,+H2𝑊subscript𝐻1subscript𝑊1subscript𝐻2subscript𝑊2subscript𝐻1subscript𝐻2W=\frac{H_{1}W_{1}+H_{2}W_{2}}{H_{1,}+H_{2}} minimize the weight variation required by model fusion
gradient information [65] W=λiWi1iXgradient𝑊subscript𝜆𝑖subscript𝑊𝑖1𝑖subscript𝑋𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡W=\sum\lambda_{i}W_{i}-\frac{1}{i}\nabla X_{gradient} exploit other info- rmation in checkpoints
PAINT [95] Wpatch =(1λi)W0+λiWftsubscript𝑊patch 1subscript𝜆𝑖subscript𝑊0subscript𝜆𝑖subscript𝑊ftW_{\text{patch }}=\left(1-\sum\lambda_{i}\right)W_{\mathrm{0}}+\sum\lambda_{i}W_{\mathrm{ft}} linear interpolation of the models before and after fine- tuning on tasks to be patched
HiPro [147] wi=w(𝒑i)𝕀(τj𝒯i)𝕀(τi𝒯i)subscript𝑤𝑖𝑤subscript𝒑𝑖𝕀subscript𝜏𝑗subscript𝒯𝑖𝕀subscript𝜏𝑖subscript𝒯𝑖w_{i}=\frac{\sum w\left(\boldsymbol{p}_{i}\right)\mathbb{I}\left(\tau_{j}\in\mathcal{T}_{i}\right)}{\sum\mathbb{I}\left(\tau_{i}\in\mathcal{T}_{i}\right)} obtain classifier weights from the individual prompt and the shared prompt
EWR [37] W=λ0fW0W0λ1fτ1τ1+λ2fτ2τ2λ0fW0+λ1fτ1+λ2fτ2𝑊subscript𝜆0subscriptfsubscript𝑊0subscript𝑊0subscript𝜆1subscriptfsubscript𝜏1subscript𝜏1subscript𝜆2subscriptfsubscript𝜏2subscript𝜏2subscript𝜆0subscriptfsubscript𝑊0subscript𝜆1subscriptfsubscript𝜏1subscript𝜆2subscriptfsubscript𝜏2W=\frac{\lambda_{0}\cdot\mathrm{f}_{W_{0}}\cdot W_{0}-\lambda_{1}\cdot\mathrm{f}_{\tau_{1}}\cdot\tau_{1}+\lambda_{2}\cdot\mathrm{f}_{\tau_{2}}\cdot\tau_{2}}{\lambda_{0}\cdot\mathrm{f}_{W_{0}}+\lambda_{1}\cdot\mathrm{f}_{\tau_{1}}+\lambda_{2}\cdot\mathrm{f}_{\tau_{2}}} use Fisher to combine model with task vectors
experts merging [102] W=Wpre +(λiτi)𝑊subscript𝑊pre subscript𝜆𝑖subscript𝜏𝑖W=W_{\text{pre }}+\left(\sum\lambda_{i}\tau_{i}\right) find efficient fine-tuning via adapters to train experts

First, the weights of neural networks could be merged directly. Generally speaking, the linear interpolation of two well-trained model in different regions does not necessarily generate a well-performing model because of the nonlinear structure of neural networks [167]. However, for the solutions before and after fine-tuning are usually within a basin [95, 240], the linear interpolation of the solutions could improve he accuracy of fused model and the robustness of the distribution shift as Eq.(23):

W=(1t)W0+tWft.𝑊1𝑡subscript𝑊0𝑡subscript𝑊𝑓𝑡W=(1-t)\cdot W_{0}+t\cdot W_{ft}. (23)

In addition to simple linear interpolation, the fusion of weights could be transformed into another mathematical form of aggregation. Matena et al. [159] propose Fisher merging, which regards model fusion as a approximately maximization of the joint likelihood of the posterior distribution over parameters. It use the Fisher information Fisubscript𝐹𝑖F_{i} of the model as the posterior precision matrix to perform a Laplacian approximation, so as to obtain the Gaussian approximation logp(wwi,Fi)𝑝conditional𝑤subscript𝑤𝑖subscript𝐹𝑖\log p\left(w\mid w_{i},F_{i}\right) of the posterior distribution as Eq.(24):

maxwλscalelogp(wwi,Fi),subscript𝑤subscript𝜆𝑠𝑐𝑎𝑙𝑒𝑝conditional𝑤subscript𝑤𝑖subscript𝐹𝑖\max_{w}\sum\lambda_{scale}\log p\left(w\mid w_{i},F_{i}\right), (24)

where λscalesubscript𝜆𝑠𝑐𝑎𝑙𝑒\lambda_{scale} denotes model scalar hyperparameters. Jin et al. [109] tend to minimize the 22\ell 2 distance between the merged model and other multiple models trained on different datasets Xi,Yisubscript𝑋𝑖subscript𝑌𝑖\left\langle X_{i},Y_{i}\right\rangle, which is called Regression Mean (RegMean). Accordingly, the optimization problem can be converted into linear regression problem as Eq.(25):

minWWTX1W1TX12+WTX2W2TX22.subscript𝑊superscriptnormsuperscript𝑊𝑇subscript𝑋1superscriptsubscript𝑊1𝑇subscript𝑋12superscriptnormsuperscript𝑊𝑇subscript𝑋2superscriptsubscript𝑊2𝑇subscript𝑋22\min_{W}\left\|W^{T}X_{1}-W_{1}^{T}X_{1}\right\|^{2}+\left\|W^{T}X_{2}-W_{2}^{T}X_{2}\right\|^{2}. (25)

Compared with Fisher average [159], RegMean obtain the inner product matrix of the linear layer input in the forward pass process, which improves the efficiency of the operation. Besides, Wei et al. [232] regard each layer of multi-layer perceptrons (MLPs) as the distribution of corresponding weights. The sub-MLPs can be clustered by neural tangent kernel (NTK) approximating, which can be solved with GWB [179]. Moreover, other works choose to average the weights of multiple experts[135] or leverage Bayesian algorithm [254] to improve the generalization and efficiency.

Also, some recent work focuses on increasing the diversity of models with well-behaved and varieties of weights. PopulAtion Parameter Averaging (PAPA) [110] start at the same initialization and train each models on a slightly different data set (e.g., data orderings, augmentations, regularizations, etc.), averaging these models every few epochs. It is equivalent to training a larger batch size, helping to improve the generalization of the model [86]. Further, another possible interpretation is that PAPA fuse the models under better initial conditions by improving the cosine similarity between networks (29%percent\%-61%percent\% to 95%percent\%-99%percent\%), which is similar to some work on alignment [3]. Based on the idea of maximizing the diversity of weights, Rame et al. [188] fine-tune the base model for multiple times on different auxiliary tasks and re-fine-tune these auxiliary weights so as to obtain a variety of weights. Gao et al. [65] utilize development data and softmax normalized logarithm with temperature to adjust the parameters. The models are re-parameterized and updated iteratively to ensure normalization, which could reduce overfitting and increase robustness. In addition, the mean of gradient information Xgradientsubscript𝑋𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡\nabla X_{gradient} could be used to optimize the WA [65]. Let η𝜂\eta be step size. The merged model is shown as Eq.(26):

W=λiWiηXgradient.𝑊subscript𝜆𝑖subscript𝑊𝑖𝜂subscript𝑋𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡W=\sum\lambda_{i}W_{i}-\eta\nabla X_{gradient}. (26)

Next, from the perspective of iterative averaging, we can average the weights at different times during the training process of the same or architecturally identical model [65, 149, 131]. It reduces the variance and updates the model more smoothly but need to share a portion of the training history [207]. Early iterative average has the problem of convergence rate [183, 194] , especially for high-dimensional problems. Then, geometric Polyak-Ruppert [166] use the weight average instead of uniform average, and its weights decay geometrically. It uses regularization properties (control deviation characteristics of corresponding SGD estimators) to produce stable fusion results. Geometric Polyak-Ruppert helps to capture the overall trend of the gradient when training conditions are poor. In contrast, tail average [101] is more appropriate when data conditions are good. Tail average average the weights of each iteration during the last period of the training, which can prevent large fluctuations of parameter in the late stage. When the model is close to convergence, and the tail part of the gradient may contain information closer to the real gradient. Moreover, a great deal of factors (e.g., decaying step size [183], constant step size [165], form of linear interpolation, etc.) in the iteration average will affect the final result. Further, checkpoint average [91, 25, 226, 149] uses checkpoints from the same training run.

Nevertheless, simple coordinate-wise weight average may result in poor performance. Hierarchical aggregation improves model performance by combining parameters from multiple models at different layers or structures. The network architecture suitable for a specific aggregation approach has certain limitations [254, 159], so recursively processing layers with matching averages may affect the final performance. Wang et al. [227] propose a hierarchical aggregation scheme. The server obtains the first layer weight of the model and broadcasts it to the client, which continues to train all the layers with the matching layers frozen. And then repeats the procedures until the last layer before aggregation. Hierarchical Prompt learning (HiPro) [147] constructs a hierarchical task tree and average classifier weights generated from the global prompt and individual prompt 𝒑isubscript𝒑𝑖\boldsymbol{p}_{i}. The classifier average weights on ith task τisubscript𝜏𝑖\tau_{i} is shown as Eq.(27):

Wi=W(𝒑i)𝕀τj𝕀τi,subscript𝑊𝑖𝑊subscript𝒑𝑖𝕀subscript𝜏𝑗𝕀subscript𝜏𝑖W_{i}=\frac{\sum W\left(\boldsymbol{p}_{i}\right)\mathbb{I}\tau_{j}}{\sum\mathbb{I}\tau_{i}}, (27)

where 𝕀𝕀\mathbb{I} is the indicator function. Its layer-wise structure helps to gain knowledge of diverse granularity. Some other work [203, 186] propose layer-wise, module-wise and matrix-wise structure of parameter division, which reduces the cost of calculation and storage and inspires more directions of WA.

Further, WA is often used to weight scaling rules, which average the predictions of the distribution over the weights [164, 209]. To ensure the efficiency of model average, Akhlaghi et al. [5] propose that activation functions should restrict postsynaptic activity to a limited range(e.g., sigmoid, hyperbolic tangent, etc.). Leontev et al. [131]propose other constraints that network generates presynaptic activity in the presence of native features and the mean of the weights’ probability distribution should be zero [13]. In addition, for heterogeneous issue, they can be approximated by introducing additional zero-valued weights [131].

4.2 SWA

Table 5: Comparison of characteristics of Weight Averge methods based on SWA
Method Ref. Introduction
SWA [99] weight can be manually weighted after training
EMA [242, 115] smoothing model weights
EWA [93] improve the performance without increasing inference delay and weights
SWAG [155] approximate Bayesian model averaging in Bayesian DLand achieves the state-of-the-art uncertainty calibration results in various settings
SWALP [248] match the performance of SGD training with quantized parameters
SWAP [78] speed up the training of NN by using large batch size
SWAD [21] improve the OOD generalization performance of DNNs
LAWA [113] record up-to-date checkpoints at the end of each epoch
HWA [75] combine online WA and offline WA
PSWA [77] find high-quality local optima quickly
TWA [137] conduct subspace training to implicitly adjust the averaging coefficients and approach better to the minima

Inspired by Fast Geometric Ensembling (FGE) [66] and checkpoint average [149], Izmailov et al. [99] utilize a constant or periodic learning rate to average multiple points along the SGD trajectory, which is regarded as SWA. SWA improves the training on a series of important baslines, providing better time scalability. Instead of training a set of collected models like vanilla fusion, SWA trains a single model to find smoother solutions than SGD. In Table 5, we list the approaches related to SWA . Also, SWA can be applied to any architecture or datasets and demonstrate performance than snapshot ensemble (SSE) [91] and FGE. At the end of each cycle, the SWA model WSWAsubscript𝑊𝑆𝑊𝐴W_{SWA} is updated by averaging the newly obtained weights over the existing weights , as shown in Eq.(28):

WSWA WSWAn+Wn+1.subscript𝑊SWA subscript𝑊SWA𝑛𝑊𝑛1W_{\text{SWA }}\leftarrow\frac{W_{\text{SWA}}\cdot n+W}{n+1}. (28)

Nevertheless, SWA can only average the points near the local optimal point, and finally get a relatively minimum value rather than accurately approximating the optima. Also, the final input sample deviation could be large or insufficient due to some factors (e.g., poor convergence at early stage, large learning rate, fast weight change rate, etc.), which results in bad overall effect. There is a good deal of work tends to change the sampling schedule of SWA. For example, SWA-Densely (SWAD) [21] uses more dense sampling points to solve the problem of insufficient random weights. Periodic-SWA (PSWA) [77] is initialized during the early stage of the operation of SGD instead of in the late convergence phase like SGD. Latest weight averaging (LAWA) [113] averages only the checkpoints collected at the end of each epoch given the large weight variation during the initial training phase. In Figure 4, we summarize several ways to optimize SWA with different sampling schedules.

Refer to caption
Figure 4: Comparison of sampling and learning rate schedule of different SWA related methods. (a) SWA: constant learning rates. (b)SWA: cyclical learning rates c. (c)SWAD: sample densely. (d)HWA: leverages both online and offline WA, which sampled at different synchronization cycles with a slide window of length hh, i.e. wi¯¯=t=ih+1iwt¯h¯¯subscript𝑤𝑖superscriptsubscript𝑡𝑖1𝑖¯subscript𝑤𝑡\overline{\overline{w_{i}}}=\frac{\sum_{t=i-h+1}^{i}\overline{w_{t}}}{h}.

Some work based on SWA optimizes the polymerization process to gain competitive outcome. SWA in Low-Precision (SWALP) [248] tends to reduce the influence of quantization noise and low learning rate so as to converge to the optima. SWA-Gaussian (SWAG) [155] obtains Gaussian distribution from the points of SWA, then average the Bayesian models sampled from the distribution. Trainable Weight Averaging (TWA) [137] adjusts the fuse solution to better approximate the minimum by projecting the gradient onto the subspace as Eq.(29):

WTWAWTWAηl𝑩(𝑩g),subscript𝑊TWAsubscript𝑊TWAsubscript𝜂𝑙𝑩superscript𝑩top𝑔W_{\mathrm{TWA}}\leftarrow W_{\mathrm{TWA}}-\eta_{l}\boldsymbol{B}\left(\boldsymbol{B}^{\top}g\right), (29)

where 𝑩𝑩\boldsymbol{B} denotes the matrix of a set of base vectors. ηlsubscript𝜂𝑙\eta_{l} is the learning rate. g𝑔g is the gradient. TWA could eliminate errors caused by static averaging in full parameters space. Different from the above approaches, Hierarchical Weighted Average (HWA) [75] combines online and offline WA into a common training framework, Online WA is designed to speed up convergence, offline WA tends to increase generalization performance. HWA tends to combines the advantages of both. Similar to SWA, Exponential Moving Average (EMA) [184, 214] is often used to smooth the model weights in order to reduce the noise and volatility of update on weights as Eq.(30):

WEMAλdWEMA+(1λd)W,subscript𝑊𝐸𝑀𝐴subscript𝜆𝑑subscript𝑊𝐸𝑀𝐴1subscript𝜆𝑑𝑊W_{EMA}\leftarrow\lambda_{d}W_{EMA}+(1-\lambda_{d})W, (30)

where λdsubscript𝜆𝑑\lambda_{d} denotes the decay rate (0.99absent0.99\approx 0.99). Some recent work [19] combines KD with EMA, using the weights of EMA (e.g., student models [217] or branches [242]) as teacher models to transfer knowledge. Huang et al. [93] replace the networks with Mixture-of-Experts (MoEs) [201] and perform the EMA on MOEs at the end of each iteration. It can be used to improve generalization on a variety of 2D and 3D vision tasks on ViT architectures. Arput et al. [9] propose simple moving average (SMA), which conducts moving average in the later stages of training (after t0subscript𝑡0t_{0} rounds of iteration) to improve the performance in out of domain as Eq.(31):

W^t={Wttt0tt0tt0+1W^t1+1tt0+1Wttt0.subscript^𝑊𝑡casessubscript𝑊𝑡𝑡subscript𝑡0𝑡subscript𝑡0𝑡subscript𝑡01subscript^𝑊𝑡11𝑡subscript𝑡01subscript𝑊𝑡𝑡subscript𝑡0\hat{W}_{t}=\left\{\begin{array}[]{ll}W_{t}&t\leq t_{0}\\ \frac{t-t_{0}}{t-t_{0}+1}\hat{W}_{t-1}+\frac{1}{t-t_{0}+1}W_{t}&t\leq t_{0}\end{array}\right.. (31)

Lookahead algorithm [259] interpolates fast and slow weights linearly from the optimized trajectory. as Eq.(32):

wslow,t+1=t[wfast,t+(1t)wfast,(t1)++(1t)(t1)wfast,0]+(1t)(t)wslow,0.subscript𝑤𝑠𝑙𝑜𝑤𝑡1𝑡delimited-[]subscript𝑤𝑓𝑎𝑠𝑡𝑡1𝑡subscript𝑤𝑓𝑎𝑠𝑡𝑡1superscript1𝑡𝑡1subscript𝑤𝑓𝑎𝑠𝑡0superscript1𝑡𝑡subscript𝑤𝑠𝑙𝑜𝑤0w_{slow,t+1}=t\left[w_{fast,t}+(1-t)w_{fast,(t-1)}+\ldots+(1-t)^{(t-1)}w_{fast,0}\right]+(1-t)^{(t)}w_{slow,0}. (32)

The trajectories of fast weights wfast,tsubscript𝑤𝑓𝑎𝑠𝑡𝑡w_{fast,t} are updated quickly by EMA in the direction of low curvature. The slow weights wslow,tsubscript𝑤𝑠𝑙𝑜𝑤𝑡w_{slow,t} smooth the oscillations by interpolating the parameters. Lookahead reduces variance, speeds up convergence and bring the results closer to the regions with high test accuracy.

4.3 Model Soup

Table 6: Summary of different methods of Model Soup.
Method Ref. Introduction
Uniform Soup [239] average the fine-tuned models directly
Greedy Soup [239] simple operation, good performance
Learned Soup [239] high memory cost (especially in large-scale model)
Sparse Soup [269] flexible and transparent alleviates scaling issue
Adversarially-robust soup [34] improve adversarial robustness to multiple threat models
Rewarded Soup [189] merge networks according to user preferences
DiWA [190] leverage the full potential of WA
Fed Soup [26] alleviate overfitting and seek flat minima
Adapter Soup [32] maintain performance on in-domain and new domains.

Model soup [239] refers to the method of averaging the models fine-tuned with different hyperparameters. It is simple but effective, achieving an accuracy of 90.94%percent\% on the ImageNet-1K, which surpasses the previous work on CoAtNet-7 (90.88%percent\%) [38] and ViT-G (90.45%percent\%) [255]. In Table 6, we summarize the different soups. Model soup reduces the inference time required for ensemble learning 1ni=1nf(x,Wi)1𝑛superscriptsubscript𝑖1𝑛𝑓𝑥subscript𝑊𝑖\frac{1}{n}\sum_{i=1}^{n}f\left(x,W_{i}\right) [195], which includes three soups as follows: The uniform soup average all the weights of the model directly f(x,1ni=1nWi)𝑓𝑥1𝑛superscriptsubscript𝑖1𝑛subscript𝑊𝑖f\left(x,\frac{1}{n}\sum_{i=1}^{n}W_{i}\right). The greedy soup adds the models to the soup in sequence, keeping the model in the soup if the accuracy of the verification set does not decrease, which performs the best of the three soups as Eq.(33):

 ingredients  ingredients {Wi} if Acc(Avg(ingredients{Wi}))Acc(Avg(ingredients)). ingredients  ingredients subscript𝑊𝑖 if 𝐴𝑐𝑐𝐴𝑣𝑔𝑖𝑛𝑔𝑟𝑒𝑑𝑖𝑒𝑛𝑡𝑠subscript𝑊𝑖𝐴𝑐𝑐𝐴𝑣𝑔𝑖𝑛𝑔𝑟𝑒𝑑𝑖𝑒𝑛𝑡𝑠\text{ ingredients }\leftarrow\text{ ingredients }\cup\left\{W_{i}\right\}\text{ if }Acc\left(\right.Avg\left(\right.ingredients\left.\left.\cup\left\{W_{i}\right\}\right)\right)\geq Acc(Avg(ingredients)). (33)

Greedy soups [239] can be regarded as another form of SWA [99], which take a subset of weights as the input sample of the SWA. The learned soup removes the order rules of greedy soup, learns the mixing coefficient λmixsubscript𝜆𝑚𝑖𝑥\lambda_{mix} and temperature scaling parameters λtempsubscript𝜆𝑡𝑒𝑚𝑝\lambda_{temp} for each component in the verification set, and optimizes the soup by gradient-based optimization as Eq.(34):

argminλmixk,λtempj=1n(λtempf(xj,i=1nλmix,iWi),yj).formulae-sequencesubscript𝜆𝑚𝑖𝑥superscript𝑘subscript𝜆𝑡𝑒𝑚𝑝superscriptsubscript𝑗1𝑛subscript𝜆𝑡𝑒𝑚𝑝𝑓subscript𝑥𝑗superscriptsubscript𝑖1𝑛subscript𝜆𝑚𝑖𝑥𝑖subscript𝑊𝑖subscript𝑦𝑗\underset{\lambda_{mix}\in\mathbb{R}^{k},\lambda_{temp}\in\mathbb{R}}{\arg\min}\sum_{j=1}^{n}\ell\left(\lambda_{temp}\cdot f\left(x_{j},\sum_{i=1}^{n}\lambda_{mix,i}W_{i}\right),y_{j}\right). (34)

The adversarially-robust model soup [34] moves the convex hull of parameters of each classifier to adjust the weights of soup, in order to balance the robustness to different threat models and adapt to potential attacks. Based on reinforcement learning from human feedback (RLHF), rewarded soup [189] fine-tunes the models according to the diverse rewards. It selects the proper interpolating coefficients {λij}i=1Nsuperscriptsubscriptsuperscriptsubscript𝜆𝑖𝑗𝑖1𝑁\left\{\lambda_{i}^{j}\right\}_{i=1}^{N} form N𝑁N-simplex that maximize the reward R^^𝑅\hat{R} as Eq.(35):

argmaxj=1nR^(i=1NλijWi).superscriptsubscriptargmax𝑗1𝑛^𝑅superscriptsubscript𝑖1𝑁superscriptsubscript𝜆𝑖𝑗subscript𝑊𝑖\operatorname{argmax}_{j=1}^{n}\hat{R}\left(\sum_{i=1}^{N}\lambda_{i}^{j}W_{i}\right). (35)

4.4 Model Arithmetic

Different from traditional single-task learning, MTL is a kind of joint learning. The multiple tasks are learned in parallel so as to take advantage of data resources for different tasks [261, 44]. In general, MTL could be regarded as a parameter sharing, or ensemble [42], that can include major information of multiple individual tasks. In the process of MTL, participants fine-tune the latest model on the corresponding task in each iteration. The multiple fine-tuned models are merged to produce the final model or base model for the next iteration [29, 44]. The general fusion method adopted in MTL is linear combination. Patching with interpolation (PAINT) [95]combines fine-tuning and initial model so as to improve performance for specific task while also maintaining accuracy for other tasks. PAINT reduces the time of migration and adaptation between multi-tasks. HiPro [147] explore the shared information from a plenty of tasks via hierarchical structure, which adapts pre-trained vision-language models (VLMs) to multiple downstream tasks. In addition, there are some other approaches group similar tasks could together, which is conducive to obtain shared model parameters conveniently [210, 53, 158, 48]. Moreover, recent work set up metrics to measure the performance of the shared model, such as, uncertainty to weight tasks [117], loss weighting strategies [128], etc. Huang et al. [90] introduce Low-rank adaptations Hub (LoraHub), a framework that ensembles LoRA modules trained on different given tasks, which improves flexibility and scalability in MTL.

In MTL, the pre-trained model and tasks vectors (i.e., τi=WftWpresubscript𝜏𝑖subscript𝑊𝑓𝑡subscript𝑊𝑝𝑟𝑒\tau_{i}=W_{ft}-W_{pre}, the difference between the pre-trained model and the fine-tuned model) are combined to result in better performance on all tasks. Based on this observation, task arithmetic [94] improves the performance of the model on tasks by adding and linear combination of fine-tuning task vectors, which has become a flexible and efficient method for editing pre-trained models directly as Figure 5. Ortiz et al. [174] fine-tune the pre-trained model in the tangent space and provide a more reliable way to edit the pre-trained model by NTK linearization [100], improving the task algorithm significantly by reducing the accuracy gap of individual tasks [205]. Similar to the task algorithm, Daheim et al. [37] propose elastic weight removal (EWR), which calculates difference vectors between original models and expert models (fine-tuned on positive behaviours). EWR uses Fisher [159] to average the weights of the model and task vectors as Eq.(37):

W=λ0fW0W0λ1fτ1τ1+λ2fτ2τ2λ0fW0+λ1fτ1+λ2fτ2𝑊subscript𝜆0subscriptfsubscript𝑊0subscript𝑊0subscript𝜆1subscriptfsubscript𝜏1subscript𝜏1subscript𝜆2subscriptfsubscript𝜏2subscript𝜏2subscript𝜆0subscriptfsubscript𝑊0subscript𝜆1subscriptfsubscript𝜏1subscript𝜆2subscriptfsubscript𝜏2W=\frac{\lambda_{0}\cdot\mathrm{f}_{W_{0}}\cdot W_{0}-\lambda_{1}\cdot\mathrm{f}_{\tau_{1}}\cdot\tau_{1}+\lambda_{2}\cdot\mathrm{f}_{\tau_{2}}\cdot\tau_{2}}{\lambda_{0}\cdot\mathrm{f}_{W_{0}}+\lambda_{1}\cdot\mathrm{f}_{\tau_{1}}+\lambda_{2}\cdot\mathrm{f}_{\tau_{2}}} (36)

It combine Fisher merging and task arithmetic to preserve positive behaviour in the model while removing the negative behaviours. Jang et al. [102] add the sum of the vectors of a particular experts to the pre-trained language models (LMs) so as to cover the information from multiple experts trained on diverse tasks. In sum, the essence of task arithmetic is to preserve pre-trained model behavior, thereby avoiding expensive joint fine-tuning on multiple tasks [95, 135, 240].

Refer to caption
Figure 5: The flow chart of Task Arithmetic and LoRA Hub[90] in multi-task scenarios.

4.5 Average in Subspace

Due to the large dimension of conventional full-parameter space, from tens of millions to hundreds of millions of dimensions, model fusion in subspace will constrain the training trajectory in a low-dim subspace so as to reduce the loads and difficulties [132, 138, 73, 136]. In general, DNNS are over-parameterized. The Low-dimensional Trajectory Hypothesis [138] speculates that the intrinsic dimension required for network training is not as large as the number of parameters given. The parameters trained and redundant information are reduced in a subspace, which could accelerate the convergence speed and improves robustness and generalization [136, 138]. Recently, Li et al. [137] demonstrate that each point in the subspace corresponds to a base. The linear combination of bases is equivalent to a weighted average [132]. Liu et al. [145] extract submodels by sparse training to fuse multiple local models in low-dimensional subspace. Leontev et al. [131] propose Elastic Weight Consolidation (EWC) to average the models in multi-dimensional space as Eq.(37):

W=H1W1+H2W2H1,+H2,𝑊subscript𝐻1subscript𝑊1subscript𝐻2subscript𝑊2subscript𝐻1subscript𝐻2W=\frac{H_{1}W_{1}+H_{2}W_{2}}{H_{1,}+H_{2}}, (37)

where Hi=𝔼p(xw)[(Lwi)2]subscript𝐻𝑖subscript𝔼𝑝conditional𝑥𝑤delimited-[]superscript𝐿subscript𝑤𝑖2H_{i}=\mathbb{E}_{p(x\mid w)}\left[\left(\frac{\partial L}{\partial w_{i}}\right)^{2}\right] represents Hessian matrix. EWC changes the weights of individual models in the direction of the minimum change in the loss function so as to prevents catastrophic forgetting [121].

But there are difficulties in the applications of WA in subspace, such as low efficiency of random basis [132], or expensive computation cost[138], etc. Moreover, when working with high-dimensional or large models, the projection matrix for projecting the gradient into the subspace can be too large for a single GPU to bear. Wortsman et al. [238] provide a way to learn model subspace in a supervised learning. Gaya et al. [67] learn a convex subspace in online adaptation in reinforcement learning. In short, how to explore the mechanism of vanilla average in subspace with numerous examples of training DNNs in subspace is a challenge for the future.

4.6 Discussion

WA gets the final model by averaging the weights of different deep models without additional computational complexity or training processes [109, 159]. In general, if random models have significant differences in presentation capabilities, structure, or training data, the results of fusion may not achieve the expected performance. The linear interpolation of models from scratch using the same hyperparameter configuration but with different data orders is even less effective than stochastic models [59]. Therefore, a large number of approaches described in this section aim to optimize the WA process in other mathematical ways. Further, when models share part of their optimized trajectories (e.g., checkpoint averaging, tail averaginhg, SWA [99, 149], etc.) or fine-tuned on the same pre-trained model (e.g., model soup [239], etc), the accuracy of interpolated models performs better [167]. Moreover, model soup [239] averages the models with different hyperparameter configurations to get the final result. In addition, selection of proper weights in model average can also be a challenge, which is often fraught with subjectivity. More complex weight selection mechanisms may need plenty of complex trials and cross-validation.

WA is a promising technique in DL, which can be used as model optimization techniques in the future to reduce the weight fluctuation between different iterations, and improve the stability and convergence rate. WA can improve the aggregation stage of FL to protect privacy better and reduce communication costs in the future. Moreover, it is expected to reduce the storage space and computing overhead of the model on resource-constrained devices by implementing network compression on the terminal devices [250]. In short, WA is a promising and cost-effective DL technique, which can be applied in areas such as FL to improve performance and reduce storage overhead.

5 Ensemble Learning

Ensemble learning, or multi-classifier system, is a technique that integrates multiple single models to generate final predictions, including voting, average [195], etc. It improves overall performance and reduces the variance of the models, addressing issues such as overfitting, instability, and limited data volume. In this section, we demonstrate “Ensemble learning“ in DL and related techniques “Model reuse“.

5.1 Ensemble Learning

Ensemble learning combines the outputs of networks, which surpasses than the result obtained from any model alone [198, 7, 225]. The general WA averages the model weights, that is, f(x,1ni=1nWi)𝑓𝑥1𝑛superscriptsubscript𝑖1𝑛subscript𝑊𝑖f\left(x,\frac{1}{n}\sum_{i=1}^{n}W_{i}\right), which ends up with only one model. In contrast, ensemble learning averages the output value after inference 1ni=1nf(x,Wi)1𝑛superscriptsubscript𝑖1𝑛𝑓𝑥subscript𝑊𝑖\frac{1}{n}\sum_{i=1}^{n}f\left(x,W_{i}\right), resulting in multiple models [239]. Ensemble learning has a long history of research. There are plenty of typical algorithms, such as Adaboost [62], Bagging [15], Stacking [236], etc.

In order to make the network show better generalization ability, some previous work [80, 16] applies the ensemble learning (e.g., random forest, etc.) to DNNs, which can be used to adjust the output and take full advantages in feature selection, noise filtering. Kontschieder et al. [123] propose deep neural decision forests, which uses the random decision function in the optimization algorithm of CNN to reduce the complexity of parameters. Zhou et al. [267] introduce a decision-tree ensemble approach to demonstrates the possibility of building models without backpropagation, which needs fewer hyperparameters than a typical deep neural network. Moreover, Dropout [209] typically needs to ensemble the output of all sub-nets to reduce prediction errors.Nevertheless, if multiple models are too similar, the predictions of different networks will be too close to make sense of ensemble learning. To find enough diverse models, snapshot ensemble [91] uses long learning rates, combining the predictions of multiple neural networks saved at the end of each learning rate cycle to produce one final result. As an improvement on snapshot, FGE [66] uses a linear piece-wise cyclic learning rate and smaller steps to find models along the low-loss path [46], which inspires the relevant work of LMC. Similarly, Laine et al. [126] tend to ensemble the predictions over multiple previous training epochs. Arpit et al. [9] ensemble a set includes independent models and corresponding moving average models, which is referred to as ensemble of averages (EoA) as Eq.38:

y^=argmaxnSoftmax(f(x;W^i))n\hat{y}=\arg\max_{n}\operatorname{Softmax}\left(\sum f\left(x;\hat{W}_{i}\right)\right)_{n} (38)

WAK et al. [231] present a distributed robust optimization (DRO) framework to learn from a black box model, fusing multiple models using a distributed robust optimization approach. Hoang et al. [84] demonstrate the ensemble of black-box experts with no access to black-box architectures. Besides, there is a variety of work [135, 126] combines the ensemble learning with WA. The ensemble learning in DL achieves remarkable results and is widely used in facial recognition [233], speech recognition [40], and other practical fields.

5.2 Model Reuse

Table 7: Summary of multiple model reuse methods based on model fusion.
Methods Ref. Introduction
FMR [249] reuse the fixed models to reduce the cost during training
PM2R [245] utilize consistency on different modalities
MMR [151] reuse multiple source models
NMMR [153] take advantage of the nonlinear relationship
RKME [244] identify available pre-trained models by specifications in the deployment stage
HMR [243] combine and adjust the output of local models to generate a global model,
HMR for ML [216] reuse of biased models trained on local datasets to construct a global model
RKHS [244] does not require calibration
ZhiJian [260] the merge module integrates the features, weights, or predictions of the pre-trained models

Based on existing pre-trained source models, model reuse [266] provides a required model applied to a new task without having to retrain the new model from scratch. It can save time and computing resources and provide better performance in the case of limited resources [249]. In addition, because the focus of transfer learning is to solve prediction tasks on the target domain, model reuse can be regarded as a kind of transfer learning. But transfer learning requires labeled data for both source and target, while in model reuse, only unlabeled data can be collected and data from source domain can not be used [153].

Different from multi-classifiers ensemble learning, most current approaches reuse the existing features, labels or modalities to obtain the final prediction [176, 266] without storing a large amount of training data [245]. Fixed model reuse (FMR) [249] could be regarded as features reuse essentially. Based on fixed models or features, FMR decreases the data required during training and provides privacy protection for fixed components. But it can only use one type of source feature. Jha et al. [105] present Bag of Experts (BoE) architecture to reuse annotated data from reusable slots rather than one source domain train the target model training. Pre-trained multi-model reuse (PM2R) forms the predictions from pre-trained models into matrices and obtains the final predictions based on the consistency among different modalities. But these type of methods ignore the potential information and only can be applied to limited scenarios. Another crucial challenge of model reuse is to identify useful models from a set of pre-trained models for a given learning task. Wu et al. [244] propose reduced kernel mean embedding (RKME) specification to obtain available pre-trained models in the deployment stage. Tang et al.[216] use optional calibration strategies and types of specifications, which combines the advantages of RKME and HMR.

Using a single model for model reuse produces too much homogenous information (e.g., a model trained in one domain may not fit data in another domain), and it is difficult to find a single pre-trained model that is perfectly suited to the target domain. In general, we use a set of similar models to produce better performance than a single model, which is denoted as Multiple Model-Reuse (MMR) [153]. Based on MMR, Xiang et al. [245] propose PM2R without training data or validation instances. Heterogeneous model reuse (HMR) [244] tends to reuse the local models for global predictions at first and improve the local model by the multiparty multiclass margin (MPMC-margin). Instead of using the output features or labels, Lou et al. [151] improve the way of representation, and use the hidden layer representation of the source model to train the target depth model, which is superior to the approach using the limited data in target domain. nonlinear multi-model reuse (NMMR) Nevertheless, some MMR methods will assume the linear relationship between the source model and the target model strictly, which is difficult to define in practice. NMMR [153] improves performance significantly by introducing a manifold regularization scheme to take advantage of arbitrary nonlinear relationships between the source and target models. Specifically, we compare the characteristics of different reuse methods in Table 7, Brifly, model reuse can significantly reduce the amount of data required by using pre-trained models to solve the problem of consuming a lot of bandwidth when transferring data between different ends. Multi-model reuse also has a wide range of applications, such as speech recognition, security and privacy interaction system, digital retina [64], etc.

5.3 Discussion

Compared with related model fusion algorithms such as federated learning [88, 89, 160], which have certain requirements on model parameters and sizes, ensemble methods use prediction to combine multiple heterogeneous weak classifiers without such limitations. In addition, networks with different architectures in the ensemble approacesh will have a more obvious comparison effect than weight averge. Ensemble methods, however, requires maintaining and running multiple trained models and running them together when tested. Given the larger scale and complexity of deep learning models, this approach is not suitable for applications with limited computational resources and costs [204]. Due to the diversity of ensemble learning frameworks, it is possible to achieve model diversity and enhance generalization. In the future, this will be important for dealing with changes in data and adversarial attacks. Ensemble learning in DL is expected to provide confidence estimation and uncertainty measurement for model predictions, which is critical for safety and reliability in decision support systems, autonomous driving [74], medical diagnostics, etc.

6 Application

In recent years, a plenty of new research has appeared in the field of deep model fusion, which has also promoted the development of this related application field. Based on the reviews of the development of model fusion and the current mainstream methods, this section summarizes some representative applications of the existing model fusion research “Federated Learning“, “Fine-tuning“, “Distillation“ and “Model Fusion on Foundation Models/LLMs“. In the future, more work will try to further improve the accuracy and ease of model fusion, and gradually apply the model fusion method to real-world problems.

6.1 Federated Learning

Refer to caption
Figure 6: Two aggregation modes of federated learning. Left: Centralized federated learning transfer models or gradients between the central server and the terminals of clients, which are aggregated on the server finally. Right: Decentralized federated learning transfers and aggregates models between terminals of clients without a central server.

With the development of artificial intelligence, mobile devices, edge devices (e.g., IoT devices, sensors, etc.), and cloud computing platforms access to large amount of data. However, due to the restrictions of practical scenarios and network bandwidth, it is is fraught with risk to collect all data from edge devices [139, 208]. To address the challenges of security and centralization of data storage, FL [160, 170] allows many participants to collaborate to train shared global models while protecting data privacy, without the need to centralize datasets on a central server. It also could be regarded as a multi-party learning problem [177]. Particularly, aggregation is a significant procedure of FL, which incorporates model or parameter updates trained by various parties (such as devices, organizations, or individuals). In Figure 6, we demonstrate two different aggregation approaches in centralized and decentralized FL. Because of the efficient use of computing resources, low-cost nature (i.e., no need to transfer the entire datasets or maintain local parameters during training, etc.), Federated Averaging (FedAvg) [160] is the most influential FL algorithms. In the process of FedAvg, the local clients update the weights as Eq.39:

𝐰i(t+1)=𝐰i(t)ηgi(𝐰i(t),ξi(t)),superscriptsubscript𝐰𝑖𝑡1superscriptsubscript𝐰𝑖𝑡𝜂subscript𝑔𝑖superscriptsubscript𝐰𝑖𝑡superscriptsubscript𝜉𝑖𝑡\mathbf{w}_{i}^{(t+1)}=\mathbf{w}_{i}^{(t)}-\eta\nabla g_{i}\left(\mathbf{w}_{i}^{(t)},\xi_{i}^{(t)}\right), (39)

where gi(𝐰i(t),ξ(t))subscript𝑔𝑖superscriptsubscript𝐰𝑖𝑡superscript𝜉𝑡\nabla g_{i}\left(\mathbf{w}_{i}^{(t)},\xi^{(t)}\right) represents stochastic gradient on the mini-batch ξi(t)superscriptsubscript𝜉𝑖𝑡\xi_{i}^{(t)} at tthsubscript𝑡𝑡t_{th} round [160, 114]. The global model 𝐰(t)superscript𝐰𝑡\mathbf{w}^{(t)} is updated as Eq.40:

𝐰(t+1)=1ni=1n𝐰i(t).superscript𝐰𝑡11𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝐰𝑖𝑡\mathbf{w}^{(t+1)}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{w}_{i}^{(t)}. (40)

Due to the heterogeneity of models (e.g., data distribution, bandwidth environment, network structure, permutation invariance [50], etc.), a simple aggregation of weights can adversely affect the performance of the final model and put the pressure on communication [161]. We list the common aggregation methods in Table 8. Probabilistic federated neural matching (PFNM) [254] uses the Bayesian nonparametric mechanism to adjust the global model size to accommodate the heterogeneity of data. But it can only be applied to simple architectures. FedMA [227] proposes to hierarchically match neurons of a network, which is quite difficult in practice (participant models need to have the same number of layers and structure). FedBABU [171] only aggregates the body in the aggregation phase instead of the whole network, where body is related the generality of the network and head represents personalization. It is more adaptable to adapt to the heterogeneous data of each client, and improves the representation and personalization ability of a single global model.

Moreover, centralized gradient aggregation puts pressure on communication bandwidth and computing costs. In order to avoid the risk of failure of large-scale centralized fusion of local models, Hoang et al. [84] compare centralized and distributed gradient aggregation that occurs only in the local experts. Other recent work [192, 88] regards client updates as pseudo-gradients ΔisubscriptΔ𝑖\Delta_{i}, which is aggregated as Eq.(41), and the global model is updated as Eq.(42):

Δ¯(t)=1ni=1nΔi(t)superscript¯Δ𝑡1𝑛superscriptsubscript𝑖1𝑛superscriptsubscriptΔ𝑖𝑡\bar{\Delta}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}\Delta_{i}^{(t)} (41)
𝐰(t+1)=𝐰(t)ηΔ¯(t)superscript𝐰𝑡1superscript𝐰𝑡𝜂superscript¯Δ𝑡\mathbf{w}^{(t+1)}=\mathbf{w}^{(t)}-\eta\bar{\Delta}^{(t)} (42)

Based on it, Jhunjhunwala et al. [106] propose FedExp, a dynamically varying pseudo-gradient self-adaptive method for caculating the server step size. FedExp accelerates convergence and reduces the overhead, which uses the extrapolation to accelerate Projection Onto Convex Sets (POCS) as Eq.(43):

𝐰POCS(t+1)=𝐰POCS(t)λ(1ni=1nPi(𝐰POCS(t))𝐰POCS(t))superscriptsubscript𝐰POCS𝑡1superscriptsubscript𝐰POCS𝑡𝜆1𝑛superscriptsubscript𝑖1𝑛subscript𝑃𝑖superscriptsubscript𝐰POCS𝑡superscriptsubscript𝐰POCS𝑡\mathbf{w}_{\mathrm{POCS}}^{(t+1)}=\mathbf{w}_{\mathrm{POCS}}^{(t)}-\lambda\left(\frac{1}{n}\sum_{i=1}^{n}P_{i}\left(\mathbf{w}_{\mathrm{POCS}}^{(t)}\right)-\mathbf{w}_{\mathrm{POCS}}^{(t)}\right) (43)

Huang et al. [92] aggregate personalized sparse gradients and masks trained from local models to generate new global model as Eq.(44):

𝐰(t+1)=𝐰(t)1|St|(𝐰~0(t)𝐰~n(t)),superscript𝐰𝑡1superscript𝐰𝑡1subscript𝑆𝑡superscriptsubscript~𝐰0𝑡superscriptsubscript~𝐰𝑛𝑡\mathbf{w}^{(t+1)}=\mathbf{w}^{(t)}-\frac{1}{\left|S_{t}\right|}\sum\left(\tilde{\mathbf{w}}_{0}^{(t)}-\tilde{\mathbf{w}}_{n}^{(t)}\right), (44)

where Stsubscript𝑆𝑡S_{t} denotes the clients. It reduces the communication overhead and solves the issues of sparse personalized FL. In addition, the application of personalized model to FL could adapt the preferences of local users and decrease the costs [51, 43, 127].

Table 8: The different aggregation approaches in Federated Learning
Model fusion in FL Methods Ref. Aggregation
Aggregation FedAvg [160] aggregate the parameters of the participants directly
FedExp [106] determine the server step size based on pseudo-gradients
FedMA [227] hierarchically match neurons of a network
PFNM [254] match the neurons of the networks
FedBABU [171] updates only the body of the models during training
FedSPA [92] aggregate the sparse gradients and masks from local clients
Ensemble FedCVAE-ENS [82] leverage CVAE to address statistical heterogeneity
one-shot [76] ensemble the predictions of clients in a single iteration
DENSE [257] ensemble local models for the global model
Distillation FedDF [141] address the quality loss [87] of BN and heterogeneous client models
FedFTG [258] use a data-free KD method to fine-tune the global model
FedCVAE-KD [82] compress the ensemble of client decoders into a decoder
FedBE [24] use Bayesian methods and ensemble distillation
FedAUX [197] weight the logits of local models by certainty score

Since ensemble learning does not require averaging weights, it could be a good tool for aggregation and support heterogeneous client models. One-shot [76] utilizes ensemble learning to aggregate the local model, which achieves a relative gain of 51.5 %percent\% over the baseline on the AUC. Similarly, there are plenty of researches that applies the ensemble learning to FL [82, 257]. Under certain conditions ( im<issubscript𝑖𝑚subscript𝑖𝑠i_{m}<\sqrt{i_{s}} where imsubscript𝑖𝑚i_{m} denotes machines, issubscript𝑖𝑠i_{s} is samples), the performance of the direct weight aggregation can be comparable to centralized algorithm that can access all samples in data distributed communication [262]. Nevertheless, it is not available to apply ensemble learning techniques directly in FL due to the heavy burden of keeping all the received models on the server. KD could solve these problems and regularize the size of global model and local learning using multi-teachers ensemble methods [268]. Recent work [141, 70, 104] present some novel FL framework based on ensemble distillation. FedFTG [258] does not directly broadcast the aggregate model back to each client, but uses knowledge extracted from the local model to fine-tune this preliminary global model in the server, mitigating the performance degradation after the model is aggregated. FedDF breaks the communication barrier between heterogeneous client models [87]. FedCVAE-KD [82] uses a lightweight knowledge distillation process to aggregate the client decoders, which generates substantially samples than FedAvg. It address the statistical heterogeneity and pipeline security [265] (i.e., outside attacker who obtains transferred data cannot train a classifier) concerns.

In short, the essence of the aggregation step in FL is a model fusion technique. Selecting a reasonable model fusion method can reduce the impact of specific participants or individual data on the final model, so as to improve the generalization ability and adaptability of the model in the global scope. In future work, a good aggregation approach is expected to be helpful in facing a series of challenges in federated learning. In future work, a high-quality and scalable aggregation approache are expected to face a series of challenges in FL, such as client heterogeneity, non-i.i.d heterogeneous data, limited computing resources [141], etc. FL is expected to show its potential in many more areas, such as NLP, recommendation systems [146], medical image analysis [144], etc.

6.2 Fine-tuning

Refer to caption
Figure 7: Different methods of applying Weight Average in fine-tuning scenarios. (General Fine-tuning [173], WiSE[240], Inter-training[181]: selects the appropriate fine-tuning model on intermediate tasks as the base model. Fusing[29]:average models fine-tuned on intermediate(Source) tasks. Model Soup[239]: average models that are fine-tuned on the target task. Source task Tssubscript𝑇𝑠T_{s}. Ratatouille[188] : recycle the multiple fine-tunings on diverse auxiliary tasks, then averages all the fine-tuned weights to get the final model. target task T𝑇T, Auxiliary task Tauxsubscript𝑇𝑎𝑢𝑥T_{aux}.

Fine-tuning a base mode, such as pre-trained model, is an efficient approach for adjusting models to perform downstream tasks [23, 41], which results in better generalization and more accurate output with less labeled data. Compared with random initialization, a pre-trained model is trained by a relatively set of task-specific data, which is always a better standard starting point for training. Nevertheless. the average of existing fine-tuned models [29, 28] is even a better base model than the vanilla pre-trained model for fine-tuning on the downstream tasks. Besides, there is a great deal of recent work combining WA with fine-tuning as shown in Figure 7, such as model soup [239], DiWA [190], etc. Fine-tuning improves the accuracy on target distribution, but often leads to a decrease in the robustness of distribution shift. WiSE-FT [240] combines the weights of the zero-shot and fine-tuned models to improve the distribution shift accuracy while retaining the high accuracy of the target distribution. Local fine-tuning (Lo-fi) [237] fine-tunes each node independently without any communication, and then averages the nodes. Lo-fi can also improve the performance of distributed shifts. Collaborative Descent fusion (ColD) [44] replaces base models with fusion models that can be recycled, which can continually improve the pre-trained models on which they are based. ColD [44] is superior to RoBERTa [148] and even previous multitasking models. While these strategies for averaging the fine-tuned models may be simple, they do not take full advantage of the connections between each fine-tuned model. Therefore, training on an intermediate task before before training on a target task can explore the capabilities of the base models [180, 224, 185]. Inspired by inter-training strategies [185], Rame et al. [188] fine-tune the models on auxiliary tasks, which utilize diverse auxiliary tasks and improve the out-of-distribution (OOD) generalization.

The average of fine-tuned models reduces the training time required to achieve the goal [28] and generates more accurate and better generalized models. Essentially, different ways of fine-tuning (e.g., fine-tuning with frozen layers, top-layer fine-tuning, etc.) also have a certain impact on final accuracy and distribution shift [240]. However, the combination of WA and fine-tuning is an expensive overhead, which has a certain limitation on specific application. Also, it may face a problem of explosion of preservation checkpoints, or catastrophic forgetting [121], especially applied to transfer learning.

6.3 Distillation

Refer to caption
Figure 8: Two aggregation modes of distillation. Top: in contrast to standard KD, this framework incorporates multiple teacher models for distillation. Bottom: the framework incorporates multiple student models for distillation.

Knowledge distillation (KD) [83] is a significant method to ensemble multiple models, which involves the following two types of models. A teacher model denotes large and powerful model trained on large-scale data and has high predictive and expressive power. A student models is a relatively smaller model with fewer parameters and computational resource [199, 18]. Using the knowledge of the teacher (e.g., the output probability distribution, hidden layer representation, etc.) to guide the training, the student could achieve the prediction ability closed to the large model with fewer resources and faster speed [2, 124, 119, 221]. Given that multiple teachers or students are expected to have a preferable performance than a single model [6], we divide KD into two categories according to the aggregated objects as Figure 8.

The first type of approach is to merge multiple teacher models and distill the student model directly, as shown in Table 9. Currently, recent work mainly integrates the output of teachers (e.g., logits [49, 6, 252] or feature-based knowledge [241, 143], etc.). Ensemble distillation (ED) [141, 157] distills the average output of multiple teachers to a student model, which can make up for the shortcomings of a single teacher model and provide more diversified and comprehensive information to the student model. FedDF [141] distills a collection of client-teacher models |St|subscript𝑆𝑡\left|S_{t}\right| into a server-student model. It averages the logit output f(𝐱^tk)𝑓superscriptsubscript^𝐱𝑡𝑘f\left(\hat{\mathbf{x}}_{t}^{k}\right) of teachers as Eq.(45):

𝐱t,j:=𝐱t,j1ηKL(σ(1|𝒮t|k𝒮tf(𝐱^tk)),σ(f(𝐱t,j1)))𝐱t,j1,assignsubscript𝐱𝑡𝑗subscript𝐱𝑡𝑗1𝜂KL𝜎1subscript𝒮𝑡subscript𝑘subscript𝒮𝑡𝑓superscriptsubscript^𝐱𝑡𝑘𝜎𝑓subscript𝐱𝑡𝑗1subscript𝐱𝑡𝑗1\mathbf{x}_{t,j}:=\mathbf{x}_{t,j-1}-\eta\frac{\partial\mathrm{KL}\left(\sigma\left(\frac{1}{\left|\mathcal{S}_{t}\right|}\sum_{k\in\mathcal{S}_{t}}f\left(\hat{\mathbf{x}}_{t}^{k}\right)\right),\sigma\left(f\left(\mathbf{x}_{t,j-1}\right)\right)\right)}{\partial\mathbf{x}_{t,j-1}}, (45)

where t𝑡t is the communication round, KL means KL divergence. The ensemble part of FedDF does not affect the overall workflows of clients and solves the loss problem of network batch normalization (BN) [87], Wu et al. [241] propose a multi-teacher adaptive distillation framework that can transfer knowledge from multiple teachers to student without the need for source domain data. Although merging multiple teachers makes up for the shortcomings of a single teacher, some of the teacher’s information may be overlooked when conflict exist among teachers.

The other way is to use the teacher model to distill multiple students and then merge these student models. Co-distillation (CD) [6] regards each client device as a student model, treats the average of the logits output of the other devices as teacher‘s output . However, the same training data sample should be used to synchronize the output of the teacher and the local student model. In order to solve the problem of CD, FD [104] uploads these local average logit vectors to the server periodically. Each average logits with the associated label as the current training sample will be used as the distillation regularizer for the next round of local device computation. FD improves performance and reduces communication rounds significantly. However, merging multi-students also has some problems, such as large demand of computing resources, poor interpretation and over-dependence on the original model.

Table 9: Classification of KD according to the differences of the aggregated objects
Merge mode of distillation Methods Ref. Introduction
Merge multiple teachers FedDF [141] addresses the quality loss issue[87] of BN, break the knowledge barriers among heterogeneous client models
END2 [157] distilling the distribution of the predictions from an ensemble instead of the average prediction
AE-KD [47] regard the ensemble knowledge distillation as a multi-objective optimization problem
FedFTG [258] a data-free knowledge distillation method, which relieves the issue of direct model aggregation
Merge multiple students Batch Ensemble [235] mini-batch friendly, parallelizable within a device, minor memory overhead
Hydra [220] improve distillation performance while capturing the uncertainty behavior of the original ensemble
LatentBE [163] average a student with multiple subnetworks, giving a single student network with no additional inference cost

6.4 Model Fusion on Foundation Models/LLMs

Foundation models show strong performance and emergent abilities when dealing with complex tasks, Large foundation models are characterized by their sheer size, containing billions of parameters that help them learn complex patterns in the data. Especially, with the emergence of new LLMs [264, 200] recently, such as GPT-3 [17, 172], T5 [187], BERT [41], Megatron-LM, the application of WA [256, 212, 154] to LLMs attracts more attention. You et al. [251] propose B-tuning using Bayesian learning to calculate posterior prediction distribution, which tunes top-K ranked pre-trained models by their transferabilities. Zoo-tuning [203] aggregates the weights of pre-trained model with aligned channels to obtain the final model adapt to downstream tasks, which improve the issue of high cost of migrating on large models.

Besides, recent work [256, 120] tends to craft better framework and modules adapted to the application LLMs. Izacard et al. [97] present fusion-in-decoder (FiD) , a novel framework to perform evidence fusion in the decoder only, which aims to efficiently aggregate multiple passages. Based on FiD, Ravaut et al. [191] introduce Summa Fusion to concatenate the representations of summary candidates, which further explores the effectiveness of fusion in the context of text summarization. However, their results improve little because they do not filter out poor quality candidates before using the algorithm. In contrast, Jiang et al. [107] propose an ensemble framework LLM-BLENDER, which focus on identifying subtle differences in the output of different candidates by PairRanker algorithm, and then ranking and summarizing the candidates to achieve better performance. Huang et al. [90] introduce Low-rank adaptations hub (LoRAHub), a framework to combine multiple LoRA modules trained on different tasks, which is designed to increase the adaptability of the LLMs and reduce training costs.

due to the high performance and low computational resources, the application of fine-tuning to large foundation models improve obustness of distribution shifts [240] . Branch-Train-Merge (BTM) [135] reduce the large amount of multi-node synchronization required for parallel LLMs training. In addition, the negative task vector of task arithmetic [174] can reduce the number of toxic generations of LLMs. For example, it decreases the amount from 4.8 %percent\% to 0.8 %percent\% in GPT-2 [94].

7 Conclusion

In this survey, we review the deep model fusion techniques which aims at improving the performance of the model. We propose a new categorization that groups the tecnologies of deep model fusion into four perspective: “mode connectivity”, “alignment,” “weight average” and “ensemble learning”. In the first three chapters, we describe the fusion of model’s weight to obtain the superior final fused model. In the “ensemble learning”, we focus on the fusion of the output of deep models with a wealth of available methods and a large number of ensemble frameworks. We summarize the common methods from the point of view of algorithm design and performance, and compare the differences, advantages and disadvantages of different approaches. Finally, we discusses the applications and engineering prospects of deep model fusion technology in FL, distillation, LLMs, etc.

We not only summarize current technologies of deep model fusion, but also point out the bottlenecks and breakthrough. The survey is expected to help the developers improve the performance of deep model fusion technologies, and indicate the promising and valuable directions. In the future, it is worth designing novel deep model fusion strategies from innovative aggregation patterns, better initial conditions, diverse ensemble frameworks and other perspectives. The abundant information in the loss landscape and the potential relationships between the components of networks remain to be further exploited. In addition, better adaptive methods are expected to be applied in heterogeneous models and complex real scenarios, such as FL, large-scale models, transfer learning, etc. Also, we need to pay attention to the practical effects to promote the development and application of deep model fusion technologies.

References

  • [1] Imad Afyouni, Zaher Al Aghbari, and Reshma Abdul Razack. Multi-feature, multi-modal, and multi-source social event detection: A comprehensive survey. Information Fusion, 79:279–308, 2022.
  • [2] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9163–9171, 2019.
  • [3] Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  • [4] Aditya Kumar Akash, Sixu Li, and Nicolás García Trillos. Wasserstein barycenter-based model fusion and linear mode connectivity of neural networks. arXiv preprint arXiv:2210.06671, 2022.
  • [5] Milad I Akhlaghi and Sergey V Sukhov. Knowledge fusion in feedforward artificial neural networks. Neural Processing Letters, 48(1):257–272, 2018.
  • [6] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
  • [7] Anna Aniol, Marcin Pietron, and Jerzy Duda. Ensemble approach for natural language question answering problem. In 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), pages 180–183. IEEE, 2019.
  • [8] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254–263. PMLR, 2018.
  • [9] Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. Ensemble of averages: Improving model selection and boosting performance in domain generalization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, 2022.
  • [10] Peter Auer, Mark Herbster, and Manfred KK Warmuth. Exponentially many local minima for single neurons. Advances in neural information processing systems, 8, 1995.
  • [11] Gregory Benton, Wesley Maddox, Sanae Lotfi, and Andrew Gordon Gordon Wilson. Loss surface simplexes for mode connecting volumes and fast ensembling. In International Conference on Machine Learning, pages 769–779. PMLR, 2021.
  • [12] Frederik Benzing, Simon Schug, Robert Meier, Johannes Von Oswald, Yassir Akram, Nicolas Zucchet, Laurence Aitchison, and Angelika Steger. Random initialisations performing above chance and how to find them. arXiv preprint arXiv:2209.07509, 2022.
  • [13] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR, 2015.
  • [14] Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. arXiv preprint arXiv:1907.02911, 2019.
  • [15] Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
  • [16] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  • [17] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [18] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
  • [19] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  • [20] Ángela Casado-García and Jónathan Heras. Ensemble methods for object detection. In ECAI 2020, pages 2688–2695. IOS Press, 2020.
  • [21] Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
  • [22] An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces. Neural computation, 5(6):910–927, 1993.
  • [23] Guanzheng Chen, Fangyu Liu, Zaiqiao Meng, and Shangsong Liang. Revisiting parameter-efficient tuning: Are we really there yet? arXiv preprint arXiv:2202.07962, 2022.
  • [24] Hong-You Chen and Wei-Lun Chao. Fedbe: Making bayesian model ensemble applicable to federated learning. arXiv preprint arXiv:2009.01974, 2020.
  • [25] Hugh Chen, Scott Lundberg, and Su-In Lee. Checkpoint ensembles: Ensemble methods from a single training process. arXiv preprint arXiv:1710.03282, 2017.
  • [26] Minghui Chen, Meirui Jiang, Qi Dou, Zehua Wang, and Xiaoxiao Li. Fedsoup: Improving generalization and personalization in federated learning via selective model interpolation, 2023.
  • [27] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
  • [28] Leshem Choshen, Elad Venezian, Shachar Don-Yehia, Noam Slonim, and Yoav Katz. Where to start? analyzing the potential value of intermediate models. arXiv preprint arXiv:2211.00107, 2022.
  • [29] Leshem Choshen, Elad Venezian, Noam Slonim, and Yoav Katz. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
  • [30] KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of artificial intelligence, pages 603–649, 2020.
  • [31] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • [32] Alexandra Chronopoulou, Matthew E Peters, Alexander Fraser, and Jesse Dodge. Adaptersoup: Weight averaging to improve generalization of pretrained language models. arXiv preprint arXiv:2302.07027, 2023.
  • [33] Yaim Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021.
  • [34] Francesco Croce, Sylvestre-Alvise Rebuffi, Evan Shelhamer, and Sven Gowal. Seasoning model soups for robustness to adversarial and natural distribution shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12313–12323, June 2023.
  • [35] Gavin E Crooks. Measuring thermodynamic length. Physical Review Letters, 99(10):100602, 2007.
  • [36] Wojciech Marian Czarnecki, Simon Osindero, Razvan Pascanu, and Max Jaderberg. A deep neural network’s loss surface contains every low-dimensional pattern. arXiv preprint arXiv:1912.07559, 2019.
  • [37] Nico Daheim, Nouha Dziri, Mrinmaya Sachan, Iryna Gurevych, and Edoardo M Ponti. Elastic weight removal for faithful and abstractive dialogue generation. arXiv preprint arXiv:2303.17574, 2023.
  • [38] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems, 34:3965–3977, 2021.
  • [39] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014.
  • [40] Li Deng and John Platt. Ensemble deep learning for speech recognition. In Proc. interspeech, 2014.
  • [41] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [42] Nikolaos Dimitriadis, Pascal Frossard, and François Fleuret. Pareto manifold learning: Tackling multiple tasks via ensembles of single-task models. In International Conference on Machine Learning, pages 8015–8052. PMLR, 2023.
  • [43] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.
  • [44] Shachar Don-Yehiya, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz, and Leshem Choshen. Cold fusion: Collaborative descent for distributed multitask finetuning, 2022.
  • [45] Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and Qianli Ma. A survey on ensemble learning. Frontiers of Computer Science, 14:241–258, 2020.
  • [46] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
  • [47] Shangchen Du, Shan You, Xiaojie Li, Jianlong Wu, Fei Wang, Chen Qian, and Changshui Zhang. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. advances in neural information processing systems, 33:12345–12355, 2020.
  • [48] Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (volume 2: short papers), pages 845–850, 2015.
  • [49] Nikita Dvornik, Cordelia Schmid, and Julien Mairal. Diversity with cooperation: Ensemble methods for few-shot classification. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [50] Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  • [51] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
  • [52] Rida T Farouki. The bernstein polynomial basis: A centennial retrospective. Computer Aided Geometric Design, 29(6):379–419, 2012.
  • [53] Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021.
  • [54] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33:5850–5861, 2020.
  • [55] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  • [56] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019.
  • [57] Stanislav Fort and Adam Scherlis. The goldilocks zone: Towards better understanding of neural network loss landscapes. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 3574–3581, 2019.
  • [58] Jonathan Frankle. Revisiting” qualitatively characterizing neural network optimization problems”. arXiv preprint arXiv:2012.06898, 2020.
  • [59] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  • [60] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  • [61] C. D. Freeman and J. Bruna. Topology and geometry of half-rectified network optimization. 2016.
  • [62] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • [63] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  • [64] Wen Gao, Yonghong Tian, and Jian Wang. Digital retina: revolutionizing camera systems for the smart city. Science China Information Science, 48(8):1076–1082, 2018.
  • [65] Yingbo Gao, Christian Herold, Zijian Yang, and Hermann Ney. Revisiting checkpoint averaging for neural machine translation. arXiv preprint arXiv:2210.11803, 2022.
  • [66] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  • [67] Jean-Baptiste Gaya, Laure Soulier, and Ludovic Denoyer. Learning a subspace of policies for online adaptation in reinforcement learning, 2022.
  • [68] Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. arXiv preprint arXiv:2205.14258, 2022.
  • [69] Jonas Gomes, Luiz Velho, and Mario Costa Sousa. Computer graphics: theory and practice. CRC Press, 2012.
  • [70] Xuan Gong, Abhishek Sharma, Srikrishna Karanam, Ziyan Wu, Terrence Chen, David Doermann, and Arun Innanje. Ensemble attention distillation for privacy-preserving federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15076–15086, 2021.
  • [71] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
  • [72] Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Using mode connectivity for loss landscape analysis. arXiv preprint arXiv:1806.06977, 2018.
  • [73] Frithjof Gressmann, Zach Eaton-Rosen, and Carlo Luschi. Improving neural network training in low dimensional random bases. Advances in Neural Information Processing Systems, 33:12140–12150, 2020.
  • [74] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
  • [75] Xiaozhe Gu, Zixun Zhang, Yuncheng Jiang, Tao Luo, Ruimao Zhang, Shuguang Cui, and Zhen Li. Hierarchical weight averaging for deep neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [76] Neel Guha, Ameet Talwalkar, and Virginia Smith. One-shot federated learning. arXiv preprint arXiv:1902.11175, 2019.
  • [77] Hao Guo, Jiyong Jin, and Bin Liu. Stochastic weight averaging revisited. Applied Sciences, 13(5):2935, 2023.
  • [78] Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. Stochastic weight averaging in parallel: Large-batch training that generalizes well. arXiv preprint arXiv:2001.02312, 2020.
  • [79] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
  • [80] Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
  • [81] Robert Hecht-Nielsen. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pages 129–135. Elsevier, 1990.
  • [82] Clare Elizabeth Heinbaugh, Emilio Luz-Ricca, and Huajie Shao. Data-free one-shot federated learning under very high statistical heterogeneity. In The Eleventh International Conference on Learning Representations, 2022.
  • [83] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015.
  • [84] Minh Hoang, Nghia Hoang, Bryan Kian Hsiang Low, and Carleton Kingsford. Collective model fusion for multiple black-box experts. In International Conference on Machine Learning, pages 2742–2750. PMLR, 2019.
  • [85] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
  • [86] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
  • [87] Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. The non-iid data quagmire of decentralized machine learning. In International Conference on Machine Learning, pages 4387–4398. PMLR, 2020.
  • [88] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification, 2019.
  • [89] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Federated visual classification with real-world data distribution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 76–92. Springer, 2020.
  • [90] Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition, 2023.
  • [91] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
  • [92] Tiansheng Huang, Shiwei Liu, Li Shen, Fengxiang He, Weiwei Lin, and Dacheng Tao. Achieving personalized federated learning with sparse local models. arXiv preprint arXiv:2201.11380, 2022.
  • [93] Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, and Wanli Ouyang. Experts weights averaging: A new general training scheme for vision transformers. arXiv preprint arXiv:2308.06093, 2023.
  • [94] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  • [95] Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching open-vocabulary models by interpolating weights. arXiv preprint arXiv:2208.05592, 2022.
  • [96] B. Imek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, and J. Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. 2021.
  • [97] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
  • [98] Alan Julian Izenman. Introduction to manifold learning. Wiley Interdisciplinary Reviews: Computational Statistics, 4(5):439–446, 2012.
  • [99] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  • [100] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • [101] Prateek Jain, Sham Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of Machine Learning Research, 18, 2018.
  • [102] Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Exploring the benefits of training expert language models over instruction tuning, 2023.
  • [103] Anubhav Jangra, Sourajit Mukherjee, Adam Jatowt, Sriparna Saha, and Mohammad Hasanuzzaman. A survey on multi-modal summarization. ACM Computing Surveys, 55(13s):1–36, 2023.
  • [104] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.
  • [105] Rahul Jha, Alex Marin, Suvamsh Shivaprasad, and Imed Zitouni. Bag of experts architectures for model reuse in conversational language understanding. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 153–161, 2018.
  • [106] Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fedexp: Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604, 2023.
  • [107] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
  • [108] Zetian Jiang, Tianzhe Wang, and Junchi Yan. Unifying offline and online multi-graph matching via finding shortest paths on supergraph. IEEE transactions on pattern analysis and machine intelligence, 43(10):3648–3663, 2020.
  • [109] Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
  • [110] Alexia Jolicoeur-Martineau, Emy Gervais, Kilian Fatras, Yan Zhang, and Simon Lacoste-Julien. Population parameter averaging (papa). arXiv preprint arXiv:2304.03094, 2023.
  • [111] Keller Jordan, Hanie Sedghi, Olga Saukh, Rahim Entezari, and Behnam Neyshabur. Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403, 2022.
  • [112] Jeevesh Juneja, Rachit Bansal, Kyunghyun Cho, João Sedoc, and Naomi Saphra. Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
  • [113] Jean Kaddour. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. arXiv preprint arXiv:2209.14981, 2022.
  • [114] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • [115] Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, and Llion Jones. A comparative study on neural architectures and training methods for japanese speech recognition. arXiv preprint arXiv:2106.05111, 2021.
  • [116] Kenji Kawaguchi. Deep learning without poor local minima. Advances in neural information processing systems, 29, 2016.
  • [117] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
  • [118] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • [119] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems, 31, 2018.
  • [120] Hiroaki Kingetsu, Kenichi Kobayashi, and Taiji Suzuki. Neural network module decomposition and recomposition. arXiv preprint arXiv:2112.13208, 2021.
  • [121] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • [122] Esben L Kolsbjerg, Michael N Groves, and Bjørk Hammer. An automated nudged elastic band method. The Journal of chemical physics, 145(9), 2016.
  • [123] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision, pages 1467–1475, 2015.
  • [124] Animesh Koratana, Daniel Kang, Peter Bailis, and Matei Zaharia. Lit: Learned intermediate representation training for model compression. In International Conference on Machine Learning, pages 3509–3518. PMLR, 2019.
  • [125] Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Rong Ge, and Sanjeev Arora. Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in neural information processing systems, 32, 2019.
  • [126] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
  • [127] Thanh Chi Lam, Nghia Hoang, Bryan Kian Hsiang Low, and Patrick Jaillet. Model fusion for personalized learning. In International Conference on Machine Learning, pages 5948–5958. PMLR, 2021.
  • [128] Isabelle Leang, Ganesh Sistu, Fabian Bürger, Andrei Bursuc, and Senthil Yogamani. Dynamic task weighting methods for multi-task networks in autonomous driving systems. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–8. IEEE, 2020.
  • [129] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • [130] Spyridon Leonardos, Xiaowei Zhou, and Kostas Daniilidis. Distributed consistent data association via permutation synchronization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2645–2652. IEEE, 2017.
  • [131] Mikhail Iu Leontev, Viktoriia Islenteva, and Sergey V Sukhov. Non-iterative knowledge fusion in deep convolutional neural networks. Neural Processing Letters, 51:1–22, 2020.
  • [132] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
  • [133] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
  • [134] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
  • [135] Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
  • [136] Tao Li, Zhehao Huang, Qinghua Tao, Yingwen Wu, and Xiaolin Huang. Trainable weight averaging: Efficient training by optimizing historical solutions. In The Eleventh International Conference on Learning Representations, 2022.
  • [137] Tao Li, Zhehao Huang, Qinghua Tao, Yingwen Wu, and Xiaolin Huang. Trainable weight averaging: A general approach for subspace training, 2023.
  • [138] Tao Li, Lei Tan, Zhehao Huang, Qinghua Tao, Yipeng Liu, and Xiaolin Huang. Low dimensional trajectory hypothesis is true: Dnns can be trained in tiny subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3411–3420, 2022.
  • [139] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
  • [140] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural networks learn the same representations? arXiv preprint arXiv:1511.07543, 2015.
  • [141] Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2351–2363. Curran Associates, Inc., 2020.
  • [142] Chang Liu, Chenfei Lou, Runzhong Wang, Alan Yuhan Xi, Li Shen, and Junchi Yan. Deep neural network fusion via graph matching with applications to model ensemble and federated learning. In International Conference on Machine Learning, pages 13857–13869. PMLR, 2022.
  • [143] Iou-Jen Liu, Jian Peng, and Alexander G. Schwing. Knowledge flow: Improve upon your teachers, 2019.
  • [144] Quande Liu, Cheng Chen, Jing Qin, Qi Dou, and Pheng-Ann Heng. Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1013–1023, 2021.
  • [145] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Zahra Atashgahi, Lu Yin, Huanyu Kou, Li Shen, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Sparse training via boosting pruning plasticity with neuroregeneration. Advances in Neural Information Processing Systems, 34:9908–9922, 2021.
  • [146] Shuchang Liu, Shuyuan Xu, Wenhui Yu, Zuohui Fu, Yongfeng Zhang, and Amelie Marian. Fedct: Federated collaborative transfer for recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 716–725, 2021.
  • [147] Yajing Liu, Yuning Lu, Hao Liu, Yaozu An, Zhuoran Xu, Zhuokun Yao, Baofeng Zhang, Zhiwei Xiong, and Chenguang Gui. Hierarchical prompt learning for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10888–10898, 2023.
  • [148] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • [149] Yuchen Liu, Long Zhou, Yining Wang, Yang Zhao, Jiajun Zhang, and Chengqing Zong. A comparable study on model averaging, ensembling and reranking in nmt. In Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II 7, pages 299–308. Springer, 2018.
  • [150] Eliane Maria Loiola, Nair Maria Maia De Abreu, Paulo Oswaldo Boaventura-Netto, Peter Hahn, and Tania Querido. A survey for the quadratic assignment problem. European journal of operational research, 176(2):657–690, 2007.
  • [151] Yihang Lou, Ling-Yu Duan, Yong Luo, Ziqian Chen, Tongliang Liu, Shiqi Wang, and Wen Gao. Towards digital retina in smart cities: A model generation, utilization and communication paradigm. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 19–24. IEEE, 2019.
  • [152] Ekdeep Singh Lubana, Eric J Bigelow, Robert P Dick, David Krueger, and Hidenori Tanaka. Mechanistic mode connectivity. In International Conference on Machine Learning, pages 22965–23004. PMLR, 2023.
  • [153] Yong Luo, Ling-Yu Duan, Yan Bai, Tongliang Liu, Yihang Lou, and Yonggang Wen. Nonlinear multi-model reuse. In 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), pages 1–6. IEEE, 2022.
  • [154] Xingtai Lv, Ning Ding, Yujia Qin, Zhiyuan Liu, and Maosong Sun. Parameter-efficient weight ensembling facilitates task-level knowledge transfer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 270–282, 2023.
  • [155] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019.
  • [156] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • [157] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. arXiv preprint arXiv:1905.00076, 2019.
  • [158] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019.
  • [159] Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  • [160] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • [161] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615–4625. PMLR, 2019.
  • [162] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
  • [163] Giung Nam, Hyungi Lee, Byeongho Heo, and Juho Lee. Improving ensemble distillation with weight averaging and diversifying perturbation. arXiv preprint arXiv:2206.15047, 2022.
  • [164] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variance networks: When expectation does not meet your expectations. arXiv preprint arXiv:1803.03764, 2018.
  • [165] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  • [166] Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pages 3222–3242. PMLR, 2018.
  • [167] Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  • [168] Quynh Nguyen. On connected sublevel sets in deep learning. In International conference on machine learning, pages 4790–4799. PMLR, 2019.
  • [169] Quynh Nguyen, Mahesh Chandra Mukkamala, and Matthias Hein. On the loss landscape of a class of deep neural networks with no bad local valleys. arXiv preprint arXiv:1809.10749, 2018.
  • [170] Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international conference on communications (ICC), pages 1–7. IEEE, 2019.
  • [171] Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fedbabu: Towards enhanced representation for federated image classification. arXiv preprint arXiv:2106.06042, 2021.
  • [172] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [173] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [174] Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. arXiv preprint arXiv:2305.12827, 2023.
  • [175] Niall O’Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh. Deep learning vs. traditional computer vision. In Advances in Computer Vision: Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1 1, pages 128–144. Springer, 2020.
  • [176] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  • [177] Manas Pathak, Shantanu Rane, and Bhiksha Raj. Multiparty differential privacy via aggregation of locally trained classifiers. Advances in neural information processing systems, 23, 2010.
  • [178] Fidel A Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Aminbeidokhti, Eric Granger, and Marco Pedersoli. Re-basin via implicit sinkhorn differentiation. arXiv preprint arXiv:2212.12042, 2022.
  • [179] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [180] Jason Phang, Iacer Calixto, Phu Mon Htut, Yada Pruksachatkun, Haokun Liu, Clara Vania, Katharina Kann, and Samuel R Bowman. English intermediate-task training improves zero-shot cross-lingual transfer too. arXiv preprint arXiv:2005.13013, 2020.
  • [181] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
  • [182] Fabrizio Pittorino, Antonio Ferraro, Gabriele Perugini, Christoph Feinauer, Carlo Baldassi, and Riccardo Zecchina. Deep networks on toroids: removing symmetries reveals the structure of flat regions in the landscape geometry. In International Conference on Machine Learning, pages 17759–17781. PMLR, 2022.
  • [183] Boris T Polyak. New stochastic approximation type procedures. Automat. i Telemekh, 7(98-107):2, 1990.
  • [184] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  • [185] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. Intermediate-task transfer learning with pretrained models for natural language understanding: When and why does it work?, 2020.
  • [186] Yujia Qin, Cheng Qian, Jing Yi, Weize Chen, Yankai Lin, Xu Han, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Exploring mode connectivity for pre-trained language models. arXiv preprint arXiv:2210.14102, 2022.
  • [187] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • [188] Alexandre Rame, Kartik Ahuja, Jianyu Zhang, Matthieu Cord, Leon Bottou, and David Lopez-Paz. Model ratatouille: Recycling diverse models for out-of-distribution generalization. 2023.
  • [189] Alexandre Rame, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards, 2023.
  • [190] Alexandre Rame, Matthieu Kirchmeyer, Thibaud Rahier, Alain Rakotomamonjy, Patrick Gallinari, and Matthieu Cord. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 2022.
  • [191] Mathieu Ravaut, Shafiq Joty, and Nancy F Chen. Towards summary candidates fusion. arXiv preprint arXiv:2210.08779, 2022.
  • [192] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization, 2021.
  • [193] Lior Rokach. Ensemble-based classifiers. Artificial intelligence review, 33:1–39, 2010.
  • [194] David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
  • [195] Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.
  • [196] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  • [197] Felix Sattler, Tim Korjakow, Roman Rischke, and Wojciech Samek. Fedaux: Leveraging unlabeled auxiliary data in federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [198] Robert E Schapire et al. A brief introduction to boosting. In Ijcai, volume 99, pages 1401–1406. Citeseer, 1999.
  • [199] Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
  • [200] Murray Shanahan. Talking about large language models. arXiv preprint arXiv:2212.03551, 2022.
  • [201] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • [202] Alexander Shevchenko and Marco Mondelli. Landscape connectivity and dropout stability of sgd solutions for over-parameterized neural networks. In International Conference on Machine Learning, pages 8773–8784. PMLR, 2020.
  • [203] Yang Shu, Zhi Kou, Zhangjie Cao, Jianmin Wang, and Mingsheng Long. Zoo-tuning: Adaptive transfer from a zoo of models. In International Conference on Machine Learning, pages 9626–9637. PMLR, 2021.
  • [204] Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  • [205] Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitriy Pyrkin, Sergei Popov, and Artem Babenko. Editable neural networks. arXiv preprint arXiv:2004.00345, 2020.
  • [206] Ivan Skorokhodov and Mikhail Burtsev. Loss landscape sightseeing with multi-point optimization. arXiv preprint arXiv:1910.03867, 2019.
  • [207] Joshua Smith and Michael Gashler. An investigation of how neural networks learn from the experiences of peers through periodic weight averaging. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 731–736. IEEE, 2017.
  • [208] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. Advances in neural information processing systems, 30, 2017.
  • [209] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • [210] Trevor Standley, Amir Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Which tasks should be learned together in multi-task learning? In International Conference on Machine Learning, pages 9120–9132. PMLR, 2020.
  • [211] George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
  • [212] Tianxiang Sun, Zhengfu He, Qin Zhu, Xipeng Qiu, and Xuan-Jing Huang. Multitask pre-training of modular prompt for chinese few-shot learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11156–11172, 2023.
  • [213] Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, and Lijuan Wang. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933, 2023.
  • [214] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [215] Charlie Tan, Theodore Long, Sarah Zhao, and Rudolf Laine. Geodesic mode connectivity. 2023.
  • [216] Anke Tang, Yong Luo, Han Hu, Fengxiang He, Kehua Su, Bo Du, Yixin Chen, and Dacheng Tao. Improving heterogeneous model reuse by density estimation. arXiv preprint arXiv:2305.13871, 2023.
  • [217] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
  • [218] Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
  • [219] Romain Thibaux and Michael I Jordan. Hierarchical beta processes and the indian buffet process. In Artificial intelligence and statistics, pages 564–571. PMLR, 2007.
  • [220] Linh Tran, Bastiaan S Veeling, Kevin Roth, Jakub Swiatkowski, Joshua V Dillon, Jasper Snoek, Stephan Mandt, Tim Salimans, Sebastian Nowozin, and Rodolphe Jenatton. Hydra: Preserving ensemble diversity for model distillation. arXiv preprint arXiv:2001.04694, 2020.
  • [221] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019.
  • [222] Thomas Uriot and Dario Izzo. Safe crossover of neural networks through neuron alignment. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pages 435–443, 2020.
  • [223] Joachim Utans. Weight averaging for neural networks and local resampling schemes. In Proc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer, 1996.
  • [224] Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer. Exploring and predicting transferability across nlp tasks. arXiv preprint arXiv:2005.00770, 2020.
  • [225] Benyou Wang, Jiabin Niu, Liqun Ma, Yuhua Zhang, Lipeng Zhang, Jingfei Li, Peng Zhang, and Dawei Song. A chinese question answering approach integrating count-based and embedding-based features. pages 934–941. Springer, 2016.
  • [226] Feng Wang, Guoyizhe Wei, Qiao Liu, Jinxiang Ou, Hairong Lv, et al. Boost neural networks by checkpoints. Advances in Neural Information Processing Systems, 34:19719–19729, 2021.
  • [227] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020.
  • [228] Huan Wang, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402, 2018.
  • [229] Ren Wang, Yuxuan Li, and Sijia Liu. Exploring diversified adversarial robustness in neural networks via robust mode connectivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2345–2351, 2023.
  • [230] Tianzhe Wang, Zetian Jiang, and Junchi Yan. Clustering-aware multiple graph matching via decayed pairwise matching composition. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), New York, NY, USA, pages 7–12, 2020.
  • [231] Zhenyi Wang, Xiaoyang Wang, Li Shen, Qiuling Suo, Kaiqiang Song, Dong Yu, Yan Shen, and Mingchen Gao. Meta-learning without data via wasserstein distributionally-robust model fusion. In Uncertainty in Artificial Intelligence, pages 2045–2055. PMLR, 2022.
  • [232] Tianxin Wei, Zeming Guo, Yifan Chen, and Jingrui He. Ntk-approximating mlp fusion for efficient language model fine-tuning. 2023.
  • [233] Guihua Wen, Zhi Hou, Huihui Li, Danyang Li, Lijun Jiang, and Eryang Xun. Ensemble of deep neural networks with probability-based fusion for facial expression recognition. Cognitive Computation, 9(5):597–610, 2017.
  • [234] Haitao Wen, Haoyang Cheng, Heqian Qiu, Lanxiao Wang, Lili Pan, and Hongliang Li. Optimizing mode connectivity for class incremental learning. 2023.
  • [235] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715, 2020.
  • [236] David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
  • [237] Mitchell Wortsman, Suchin Gururangan, Shen Li, Ali Farhadi, Ludwig Schmidt, Michael Rabbat, and Ari S Morcos. lo-fi: distributed fine-tuning without communication. arXiv preprint arXiv:2210.11948, 2022.
  • [238] Mitchell Wortsman, Maxwell C Horton, Carlos Guestrin, Ali Farhadi, and Mohammad Rastegari. Learning neural network subspaces. In International Conference on Machine Learning, pages 11217–11227. PMLR, 2021.
  • [239] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
  • [240] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  • [241] Ancong Wu, Wei-Shi Zheng, Xiaowei Guo, and Jian-Huang Lai. Distilled person re-identification: Towards a more scalable system. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [242] Guile Wu and Shaogang Gong. Peer collaborative learning for online knowledge distillation. In Proceedings of the AAAI Conference on artificial intelligence, volume 35, pages 10302–10310, 2021.
  • [243] Xi-Zhu Wu, Song Liu, and Zhi-Hua Zhou. Heterogeneous model reuse via optimizing multiparty multiclass margin. In International Conference on Machine Learning, pages 6840–6849. PMLR, 2019.
  • [244] Xi-Zhu Wu, Wenkai Xu, Song Liu, and Zhi-Hua Zhou. Model reuse with reduced kernel mean embedding specification. IEEE Transactions on Knowledge and Data Engineering, 35(1):699–710, 2021.
  • [245] Yang Yang De-Chuan Zhan Xiang and Yu Guo Yuan Jiang. Modal consistency based pre-trained multi-model reuse. In Proc. IJCAI, 2017.
  • [246] Junchi Yan, Minsu Cho, Hongyuan Zha, Xiaokang Yang, and Stephen M Chu. Multi-graph matching via affinity optimization with graduated consistency regularization. IEEE transactions on pattern analysis and machine intelligence, 38(6):1228–1242, 2015.
  • [247] Junchi Yan, Shuang Yang, and Edwin R Hancock. Learning for graph matching and related combinatorial optimization problems. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 4988–4996. International Joint Conferences on Artificial Intelligence Organization, 2020.
  • [248] Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Chris De Sa. Swalp: Stochastic weight averaging in low precision training. In International Conference on Machine Learning, pages 7015–7024. PMLR, 2019.
  • [249] Yang Yang, De-Chuan Zhan, Ying Fan, Yuan Jiang, and Zhi-Hua Zhou. Deep learning for fixed model reuse. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • [250] Kaixuan Yao, Feilong Cao, Yee Leung, and Jiye Liang. Deep neural network compression through interpretability-based filter pruning. Pattern Recognition, 119:108056, 2021.
  • [251] Kaichao You, Yong Liu, Ziyang Zhang, Jianmin Wang, Michael I Jordan, and Mingsheng Long. Ranking and tuning pre-trained models: a new paradigm for exploiting model hubs. The Journal of Machine Learning Research, 23(1):9400–9446, 2022.
  • [252] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1285–1294, 2017.
  • [253] EungGu Yun, Hyungi Lee, Giung Nam, and Juho Lee. Traversing between modes in function space for fast ensembling. arXiv preprint arXiv:2306.11304, 2023.
  • [254] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International conference on machine learning, pages 7252–7261. PMLR, 2019.
  • [255] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  • [256] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. arXiv preprint arXiv:1810.05749, 2018.
  • [257] Jie Zhang, Chen Chen, Bo Li, Lingjuan Lyu, Shuang Wu, Shouhong Ding, Chunhua Shen, and Chao Wu. Dense: Data-free one-shot federated learning. Advances in Neural Information Processing Systems, 35:21414–21428, 2022.
  • [258] Lin Zhang, Li Shen, Liang Ding, Dacheng Tao, and Ling-Yu Duan. Fine-tuning global model via data-free knowledge distillation for non-iid federated learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10174–10183, 2022.
  • [259] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. Advances in neural information processing systems, 32, 2019.
  • [260] Yi-Kai Zhang, Lu Ren, Chao Yi, Qi-Wei Wang, De-Chuan Zhan, and Han-Jia Ye. Zhijian: A unifying and rapidly deployable toolbox for pre-trained model reuse. arXiv preprint arXiv:2308.09158, 2023.
  • [261] Yu Zhang and Qiang Yang. An overview of multi-task learning. National Science Review, 5(1):30–43, 2018.
  • [262] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization. Advances in neural information processing systems, 25, 2012.
  • [263] Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridging mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060, 2020.
  • [264] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, and Xiaolei Wang. A survey of large language models, 2023.
  • [265] Yanlin Zhou, George Pu, Xiyao Ma, Xiaolin Li, and Dapeng Wu. Distilled one-shot federated learning. arXiv preprint arXiv:2009.07999, 2020.
  • [266] Zhi-Hua Zhou. Learnware: on the future of machine learning. Frontiers Comput. Sci., 10(4):589–590, 2016.
  • [267] Zhi-Hua Zhou and Ji Feng. Deep forest. National science review, 6(1):74–86, 2019.
  • [268] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, pages 12878–12889. PMLR, 2021.
  • [269] Max Zimmer, Christoph Spiegel, and Sebastian Pokutta. Sparse model soups: A recipe for improved pruning via model averaging. arXiv preprint arXiv:2306.16788, 2023.