Zhilong Li^{1}, Xiaohu Wu^{1}, Xiaoli Tang^{2}, Tiantian He^{3,4}, Yew-Soon Ong^{2,3,4}

Mengmeng Chen^{1}, Qiqi Liu^{5}, Qicheng Lao^{1}, Han Yu^{4}^{1}Beijing University of Posts and Telecommunications, China ^{2}College of Computing and Data Science, Nanyang Technological University, Singapore^{3}Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore^{4}Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore^{5}School of Engineering, Westlake University, China

{lzl,xiaohu.wu,chenmengmeng0306,qicheng.lao}@bupt.edu.cn

{xiaoli001,asysong,han.yu}@ntu.edu.sg, he_tiantian@cfar.a-star.edu.sg

liuqiqi@westlake.edu.cn

###### Abstract

There is growing research interest in measuring the statistical heterogeneity of clients’ local datasets. Such measurements are used to estimate the suitability for collaborative training of personalized federated learning (PFL) models. Currently, these research endeavors are taking place in silos and there is a lack of a unified benchmark to provide a fair and convenient comparison among various approaches in common settings. We aim to bridge this important gap in this paper. The proposed benchmarking framework currently includes six representative approaches. Extensive experiments have been conducted to compare these approaches under five standard non-IID FL settings, providing much needed insights into which approaches are advantageous under which settings. The proposed framework offers useful guidance on the suitability of various data divergence measures in FL systems. It is beneficial for keeping related research activities on the right track in terms of: (i) designing PFL schemes, (ii) selecting appropriate data heterogeneity evaluation approaches for specific FL application scenarios, and (iii) addressing fairness issues in collaborative model training. The code is available at https://github.com/Xiaoni-61/DH-Benchmark.

## 1 Introduction

Federated learning (FL) is a promising privacy-preserving collaborative machine learning (ML) paradigm [58, 23]. Under FL, multiple FL clients train a shared model with their own datasets and upload their local model updates to a FL server for aggregation and redistribution [41]. One key challenge of FL is that the data distributions across different clients can be heterogeneous (i.e., non-independently and identically distributed (non-IID)) [65, 29, 31].This has inspired the field of personalized federated learning (PFL) [49] to build more powerful ML models, better realizing the final goal of FL.

The non-IIDness of clients’ local data entails evaluating data complementarity or similarity between clients and building a personalized ML model for each client.A FL system itself is a collaborative network of clients where a client $i$ may be complemented by other clients $j$ with different weights, which measure the data heterogeneity. The basic way that FL works is to aggregate the clients’ local model updates according to their weights, e.g., the vanilla Federated Averaging framework [42]. In the context of PFL, we believe that it is still of fundamental importance to obtain a personalized model for each client by aggregating the clients’ local model updates according to the weights.Measuring statistical heterogeneity of data is one way to understand the data complementarity and potential collaboration advantages among clients [14]. Recently, there has been growing interest in measuring data heterogeneity. These techniques are either divergence-based measures (e.g., Jensen–Shannon (JS) divergence [12, 36], $\mathcal{C}$-divergence [3, 35], hash function-based distribution sketches [34]), or model-based approaches (e.g., Shapley Value [55, 32], Hypernetworks [44, 47], cosine similarity [59]).Although important, current research endeavors take place in silos and some are even not specifically designed for PFL. A comprehensive study is needed to simultaneously compare all the schemes in common settings of PFL.

To this end, we develop a comprehensive benchmark to study the various approaches mentioned above. The main contributions of this paper are as follows.

- 1.
We summarize the six techniques (JS divergence, $\mathcal{C}$-divergence, distribution sketch-based euclidean distance, Shapley Value, Hypernetworks, cosine similarity) into a unified framework to understand their application in FL settings. The unified framework clarifies the ways of quantifying the collaboration advantages among clients and the theoretical development of using collaboration advantages or data similarity for PFL.

- 2.
We evaluate all six approaches under five standard Non-IID settings summarized in [29] acrosseight widely-adopted benchmark datasets. We assess the performance of each approach in terms ofcomputation cost, communication overhead and scalability.The results provide insights into which approaches are advantageous under which settings.

- 3.
The unified framework and the experimental results identify scenarios where the current approaches perform relatively poorly, highlighting promising future research directions for collaborative PFL.

## 2 Preliminary

### 2.1 FL Notation

Consider a FL setting with $N$ clients.Each client $i\in\{1,2,\dots,N\}$ has a dataset $\hat{D}_{i}=\{(x_{k}^{(i)},\,y_{k}^{(i)})\}_{k=1}^{m_{i}}$ consisting of $m_{i}$ samples drawn from an underlying data distribution $\mathcal{D}_{i}$, where $x_{k}^{(i)}\in\mathcal{X}$ denotes the feature while $y_{k}^{(i)}\in\mathcal{Y}$ denotes the label. The total number of samples across all clients is $m=\sum_{i=1}^{N}{m_{i}}$. Let $\beta_{i}=m_{i}/m$ denote the proportion of data samples held by client $i$, and $\beta=[\beta_{1},\dots,\beta_{N}]$ represent the quantity distribution of data samples across clients. Given a machine learning model (hypothesis) $h$ and a risk function $\boldsymbol{\ell}$, the local expected risk of client $i$ is defined as $\mathcal{L}_{i}(h)=\mathbb{E}_{(x,y)\in\mathcal{D}_{i}}{\boldsymbol{\ell}(h(x)%,y)}$and its local empirical risk is given by $\hat{\mathcal{L}}_{i}(h)=\frac{1}{m_{i}}\sum\nolimits_{k=1}^{m_{i}}{%\boldsymbol{\ell}(h(x_{k}^{(i)}),y_{k}^{(i)})}$.The goal of each client $i$ is to find a model $h$ within the hypothesis space $\mathcal{H}$ that minimizes its local expected risk, denoted as $h_{i}^{\ast}=$ $\operatorname*{arg\,min}_{h\in\mathcal{H}}{\mathcal{L}_{i}(h)}$,based on its finite local dataset $\hat{D}_{i}$.

We denote $w_{i}^{t}$ as the model parameters of client $i$ at the beginning of the training round $t$. Let $\mathbf{\alpha}_{i}^{t}=(\alpha_{i,1}^{t},\dots\alpha_{i,N}^{t})$ denote the weighted collaboration vector of client $i$ in the training round $t$, where $\sum_{j=1}^{N}{\alpha_{i,j}^{t}}=1$ and $\alpha_{i,j}^{t}\geq 0$.Except the scheme in [34], the central server performs a weighted aggregation of the model parameters from all clients for each client $i$.As a result, the model parameters of $i$ become $w_{j}^{t+1}=\sum_{j=1}^{N}{\alpha_{i,j}^{t}w_{j}^{t}}$. The weight $\alpha_{i,j}^{t}$ can be viewed as a quantification of the collaboration advantage that client $j$ brings to client $i$ in the training round $t$. Let $\alpha^{t}$ be a $N\times N$ matrix whose $i$-th row is $\alpha_{i}^{t}$. $\alpha^{t}$ defines a directed benefit graph among the $N$ clients. In the schemes of [59, 55], the value of $\alpha_{i,j}^{t}$ changes in each round of training. In the schemes of [11, 12, 3], the value of $\alpha_{i,j}^{t}$ is independent of the round $t$ and is precomputed before the FL training process starts; thus, $\alpha_{i,j}^{t}$, $\alpha_{i}^{t}$ and $\alpha^{t}$ are also simply denoted as $\alpha_{i,j}$, $\alpha_{i}$ and $\alpha$. The way that the scheme in [34] works will be introduced in Section 3.1.3.

### 2.2 Typical Non-IID Data Settings

From a probability distribution perspective, the local data distribution $P(x_{i},y_{i})$ of client $i$ can be represented by a conditional probability $P(x_{i}|y_{i})P(y_{i})$ or $P(y_{i}|x_{i})P(x_{i})$. There are three typical non-IID data settings in the FL context [49, 29, 23]: (i) label distribution skew, where $P(y_{i})$ differs among clients; (ii) feature distribution skew, where $P(x_{i})$ differs among clients; and (iii) quantity skew, where $P(x_{i},y_{i})$ is the same for all clients $i$, but the amount of data varies across clients.

The label distribution skew can be categorized into two settings:

- •
Quantity-based label imbalance: Each client is randomly allocated $k$ different label IDs [41, 30, 15, 60]. For each label, the samples are randomly and equally divided among the clients assigned to that label. We denote this setting as $\#C=k$.

- •
Distribution-based label imbalance: A Dirichlet distribution $Dir_{N}(\epsilon)$ is used [61, 53, 52]. For each label $c$, a vector $q_{c}\in\mathbb{R}^{N}$ is drawn from the distribution $Dir_{N}(\epsilon)$ and client $i$ is allocated a $q_{c,i}$ proportion of the data samples for label $c$. We denote this setting as $p_{k}\sim Dir(\epsilon)$.

The feature distribution skew mainly has two settings:

- •
Noise-based feature imbalance: The entire dataset is randomly and equally divided among $N$ clients [63]. Given a noise level $\sigma$, noises $\boldsymbol{\hat{x}}\sim Gau(\sigma\cdot i/n)$ are added to the (local) dataset of each client $i$, where $Gau(\sigma\cdot i/n)$ is a Gaussian distribution with mean 0 and variance $\sigma\cdot i/n$. We denote this setting as $\mathbf{\hat{x}}\sim Gau(\sigma)$.

- •
Real-world feature imbalance: The entire dataset is randomly and evenly partitioned among different clients [6], with each client receiving data sharing the same real-world characteristics, such as being from the same writers.

In the quantity skew setting, a vector $q\in\mathbb{R}^{N}$ is drawn from the distribution $Dir_{N}(\epsilon)$, and each client $i$ is allocated a $q_{i}$ proportion of the total data samples of a dataset. We denote this setting as $q\sim Dir(\epsilon)$.

## 3 Data Heterogeneity Evaluation Benchmark

Existing methods for evaluating data heterogeneity can be roughly grouped into two main categories: those based on statistical divergence of data distributions and those based on model performance. In the following sections, we will detail these two categories of methods and their applications in PFL.

### 3.1 Distribution Statistical Divergence-based Approaches

Methods in this category are based on the divergence, a kind of statistical distance measure. It is a function that takes two probability distributions as input and returns a numerical value quantifying the difference between them [46, 14]. In FL, the local model updates are uploaded to the central server, which performs a weighted aggregation of these updates to produce either a global model or a personalized models for each client. The divergence of data distributions among clients can guide more informed decisions on the choice of the aggregation weights [12, 36, 3] or clustering clients with similar data distributions [3, 34, 35].

Some works in this category [12, 3] have some features in common.In these methods, the weighted empirical risk $\hat{\mathcal{L}}_{\boldsymbol{\alpha_{i}}}(h)$ for client $i$ is defined as $\hat{\mathcal{L}}_{\boldsymbol{\alpha_{i}}}(h)=\sum_{j=1}^{N}{\alpha_{i,j}\hat%{\mathcal{L}}_{j}(h)}$. The model of $i$ can be learned by minimizing $\hat{\mathcal{L}}_{\boldsymbol{\alpha_{i}}}(h)$. Let $\hat{h}_{\boldsymbol{\alpha_{i}}}=\operatorname*{arg\,min}_{h\in\mathcal{H}}{%\hat{\mathcal{L}}_{\boldsymbol{\alpha_{i}}}(h)}$ be the hypothesis that minimizes the weighted empirical risk $\hat{\mathcal{L}}_{\boldsymbol{\alpha_{i}}}(h)$.The following bound is used by both [12] and [3]:

$\displaystyle\mathcal{L}_{i}(\hat{h}_{\boldsymbol{\alpha_{i}}})-\mathcal{L}_{i%}(\hat{h}_{i}^{\ast})\leqslant Q_{\gamma,1}\sqrt{\sum\nolimits_{j=1}^{N}{%\alpha_{i,j}^{2}/m_{j}}}+Q_{\gamma,2}\sum\nolimits_{j=1}^{N}\alpha_{i,j}\cdot D%(\mathcal{D}_{i},\mathcal{D}_{j})$ | (1) |

where $Q_{\gamma,1}$ and $Q_{\gamma,2}$ are parameters related to the type of divergence used in [12, 3]. Eq. (1) shows that for client $i$ the gap between $\hat{h}_{\boldsymbol{\alpha_{i}}}$ and $\hat{h}^{\ast}$ depends on the weight $\alpha_{i}$, the number of samples $m_{i}$ and the divergence $D(\mathcal{D}_{i},\mathcal{D}_{j})$ between two data distributions. The optimal model for $i$ can be obtained by choosing appropriate decision variables to minimize the right-hand side of Eq. (1). These methods are based on two statistical divergences, Jensen–Shannon Divergence and $\mathcal{C}$-Divergence.

#### 3.1.1 Jensen–Shannon Divergence

The Jensen-Shannon (JS) divergence between two distributions $P$ and $Q$ is denoted as $\text{JSD}(P,Q)$ [7]. It is based on the Kullback-Leibler (KL) divergence, denoted as $\text{KLD}(P,Q)$. For two discrete probability distributions on the same space $\mathcal{Z}$, we have $\text{KLD}(P,Q)=\sum\nolimits_{z\in\mathcal{Z}}{P(z)\log{\frac{P(z)}{Q(z)}}}$ and $\text{JSD}(P,Q)=\frac{1}{2}\text{KLD}(P,M)+\frac{1}{2}\text{KLD}(Q,M)$ where $M=\frac{1}{2}(P+Q)$ denotes the mixture distribution of $P$ and $Q$. The JS divergence has desirable properties, such as being symmetric and always having a finite value.

Application. Ding and Wang [12] use the JS divergence to estimate the Integral Probability Metrics (IPM), which are another type of statistical distance with desirable theoretical properties for analytical tractability. The IPM between two distributions $\mathcal{D}_{i}$ and $\mathcal{D}_{j}$ are denoted as $d_{\mathcal{H}}(\mathcal{D}_{i},\mathcal{D}_{j})$, which is used by Ding and Wang [12] to instantiate the divergence term $D(\mathcal{D}_{i},\mathcal{D}_{j})$ on the right-hand side of Eq. (1). Furthermore, Ding and Wang [12] provides an optimal solution $\alpha_{i}$ to the corresponding optimal problem. The resulting PFL scheme is called pFedJS.

#### 3.1.2 $\mathcal{C}$-Divergence

$\mathcal{C}$-divergence is defined as [43, 54, 4]: $D_{c}(\mathcal{D}_{i},\mathcal{D}_{j})=\max_{h\in\mathcal{H}}|\mathcal{L}_{i}(%h)-\mathcal{L}_{j}(h)|$. In the hypothesis space $\mathcal{H}$, $D_{c}(\mathcal{D}_{i},\mathcal{D}_{j})$ indicates the maximum difference between the local expected risks of two data distributions $\mathcal{D}_{i}$ and $\mathcal{D}_{j}$. Suppose $\ell$ is the 0-1 loss function. The definition of $\mathcal{C}$-divergence guarantees that its divergence value will be as close to one as possible if two distributions are distinctly different. Thus, it can be used to cluster clients into groups with similar data distributions.

Application. In the FL context, $\mathcal{C}$-divergence has been used to partition all clients into $K$ coalitions $\{\mathcal{C}_{1},\cdots,\mathcal{C}_{k}\}$, each with similar data distributions [3, 35]. The function $\ell(m(x),y)$ is also denoted as $f(x,y)$, mapping from $\mathcal{X}\times\mathcal{Y}$ to $\{0,1\}$. Bao etal. [3] transform the $\mathcal{C}$-divergence into the following form: $D_{i,j}=\max\nolimits_{f\in\mathcal{F}}\left|\Pr\nolimits_{(x,y)\in D_{i}}[f(x%,y)=1]+\Pr\nolimits_{(x,y)\in D_{j}}[f(x,y)=0]-1\right|$.A client classifier $f\in\mathcal{F}$ can be trained to predict the distribution similarity. If the estimated distance $D_{i,j}$ is approximately $100\%$, then the two distributions being compared are significantly different.Unlike in [12], $\alpha_{i}$ is a dependent variable here, and the weight $\alpha_{i,j}$ of $j$ to $i$ is predefined as the proportion of the data quantity of $j$ to the coalition’s data quantity. Specifically, for any $i\in\mathcal{C}_{k}$, $\alpha_{i,j}=\beta_{j}/\sum_{l\in\mathcal{C}_{k}}{\beta_{l}}$ if $j\in\mathcal{C}_{k}$ and $\alpha_{i,j}=0$ otherwise. Here, $D_{c}(\mathcal{D}_{i},\mathcal{D}_{j})$ is used to instantiate the divergence term $D(\mathcal{D}_{i},\mathcal{D}_{j})$ on the right-hand side of Eq. (1). Bao etal. [3] provides an efficient optimizer to optimize the coalition structure. Once $\{\mathcal{C}_{1},\cdots,\mathcal{C}_{k}\}$ is determined, $\{\mathbf{\alpha}_{i}\}_{i=1}^{N}$ will be determined accordingly.The corresponding PFL scheme is called FedCollab.

#### 3.1.3 Distribution Sketch-based Euclidean Distance

Let $\mathbf{z_{i}}=(z_{i,1},\dots,z_{i,d})$, where $i\in\{1,2\}$. The Euclidean distance of two vectors is defined as $\sqrt{\sum_{l=1}^{d}{(z_{1,l}-z_{2,l})^{2}}}$. On the other hand, efficient computation of statistical divergences of high-dimensional data is challenging. To address this, Liu etal. [34] proposes a one-pass distribution sketch algorithm to represent the client data distributions. Then, the divergence is measured by the Euclidean distance of such distribution sketches. Central to their algorithm is Locality Sensitive Hashing (LSH), which uses hash functions to map data samples from high-dimensional space to low-dimensional buckets, ensuring that samples close to each other have a high probability of having the same hash value. Building on this, the Repeated Array of Count Estimators (RACE) is adopted [9, 10, 37, 33]. Let $z=(x,y)$ denote a data sample. RACE uses $R$ LSH functions to construct a matrix $\boldsymbol{A}\in\mathbb{R}^{R\times B}$. Each LSH function $g_{i}$ maps a data vector into $B$ hash bins. For a given dataset $\hat{\mathcal{D}}$, any data sample $z\in\hat{\mathcal{D}}$ is mapped to $R$ different hash values, denoted as $\{g_{i}(z)\mid i\in[1,R]\}$. The corresponding positions in matrix $A$ are then incremented by 1. In this way, RACE can characterize the data distribution of $\hat{\mathcal{D}}_{i}$ for each client $i$ using the corresponding $A$, referred to as the client data sketch $CS_{i}$.

Application. Liu etal. [34] show that an average of all client sketches can effectively approximate the sketch $GS$ of the global data distribution across clients. The central server calculates the Euclidean distance between $GS$ and $CS_{i}$ before the FL training process starts. The inverses of these distances serve as the probabilities for selecting $K$ clients, denoted as $\mathcal{C}$ for local training in round $t$. Specifically, clients with $CS_{i}$ closer to $GS$ are prioritized for collaboration. The local updates from $\mathcal{C}$ are aggregated by the central server, resulting in a global model $w^{t+1}=\sum_{j\in\mathcal{C}}{w_{j}^{t}}$. Finally, each client builds a personalized model based on the global model and its local data. We refer to the PFL approach in [34] as RACE.

### 3.2 Model-based Approaches

#### 3.2.1 Shapley Value

Shapley Value (SV) is a classic approach to quantifying individual contributions within a group under the Cooperative Game Theory [48].Let $\mathcal{N}=\{1,\dots,N\}$. In machine learning, the data Shapley value of client $i$ is defined as $\varphi_{i}(\mathcal{N},V)=\sum\nolimits_{\mathcal{S}\subseteq\mathcal{N}-\{i%\}}{\frac{V(\mathcal{S}\cup\{i\})-V(\mathcal{S})}{\binom{N-1}{|\mathcal{S}|}}}$, where $V(\cdot)$ denotes the utility evaluation function [16, 32, 21]. In FL, the utility evaluation function $V(\mathcal{S})$ is based on the model performance achieved by the participation of $\mathcal{S}$. Let $M_{\mathcal{S}}$ denote the global FL model trained with $\mathcal{S}$ on a separate test dataset $\mathcal{D}_{\mathcal{S}}$. Then, $V(\mathcal{S})=V(M_{\mathcal{S}})=V\left(\mathcal{A}(M^{(0)},\mathcal{D}_{%\mathcal{S}})\right)$, where $\mathcal{A}$ is the learning algorithm and $M^{(0)}$ denotes the initial model [32].

Application. [55] applies the SV technique for PFL. For each client $i$, there is an $N$-dimensional relevance vector $\phi^{i,t}=[\phi_{1}^{i,t},\dots,\phi_{N}^{i,t}]$, where $\phi_{j}^{i,t}$ denotes the relevance scoreof client $j$ to $i$ in round $t$. In each round $t$, the models of clients with the top-$K$ relevance score are downloaded, forming a coalition $\mathcal{S}_{i,k}^{t}$ for client $i$; then, the SV $\varphi_{j}^{t}$ of each client $j\in\mathcal{S}_{i,k}^{t}$ is calculated and used to update the corresponding elements of $\phi^{i,t}$: $\phi_{j}^{i,t+1}=\eta\phi_{j}^{i,t}+(1-\eta)\varphi_{j}^{t}$. For client $i$, the weighted collaboration vector $\alpha_{i}$ is also set according to the SVs: $\alpha_{i,j}^{t}=\max\{\varphi_{j}^{t},\,0\}/\lVert w_{i}^{t}-w_{j}^{t}\rVert$ if $j\in\mathcal{S}_{i,k}^{t}$ and $\alpha_{i,j}^{t}=0$ otherwise, where $\lVert\cdot\rVert$ denotes the Euclidean distance. The PFL scheme in [55] is called pFedSV.

#### 3.2.2 Cosine Similarity

Let $\mathbf{z_{i}}=(z_{i,1},\dots,z_{i,d})$, where $i\in\{1,2\}$. The cosine similarity of two non-zero vectors is defined as $cos(\mathbf{z_{1}},\mathbf{z_{2}})=\frac{\mathbf{z_{1}}\cdot\mathbf{z_{2}}}{%\lVert\mathbf{z_{1}}\rVert\lVert\mathbf{z_{2}}\rVert}$, where $\mathbf{z_{1}}\cdot\mathbf{z_{2}}$ is the inner product and $\lVert\cdot\rVert$ denotes the Euclidean norm. Intuitively, the closer the two vectors are, the smaller the angle between them, and thus the cosine value will be closer to one. Let us consider the personalized models $\mathbf{h_{i}}$ and $\mathbf{h_{j}}$ of clients $i$ and $j$. Their cosine similarity is $cos(\mathbf{h_{i}},\mathbf{h_{j}})$.

Application. Intuitively, when the data distribution of two clients $i$ and $j$ are more similar, their models are more similar, and their collaboration strength (i.e., $\alpha_{i,j}$) should be larger [30, 38].The similarity between model parameters is used to guide the collaboration strength. In each training round $t$, Ye etal. [59] proposes minimizing the following function in a privacy-preserving manner while adhering to the standard FL training process: $\sum\nolimits_{i=1}^{N}{{\beta_{i}\left(\mathcal{L}_{i}\left(\sum\nolimits_{j=%1}^{N}{\alpha_{i,j}^{t}w_{j}^{t}}\right)-\frac{\lambda}{2}\sum\nolimits_{j=1}^%{N}{\alpha_{i,j}^{t}cos(w_{i}^{t},w_{j}^{t})}\right)}}$, where the decision variables are $\{w_{i}^{t}\}_{i=1}^{N}$ and $\alpha^{t}$, and $\lambda$ is a hyperparameter to balance the individual utilities and the collaboration necessity.

#### 3.2.3 Hypernetworks

Each client $i$ has a risk/loss function $\ell_{i}$: $\mathbb{R}^{N}\rightarrow\mathbb{R}_{+}$. Given a learned hypothesis $h\in$ $\mathcal{H}$, let the loss vector $\mathbf{\ell}(h)=[\ell_{1},\dots,\ell_{N}]$ represent the utility loss of the $N$ clients under the hypothesis $h$. The hypothesis $h$ is considered a Pareto solution if there is no other hypothesis $h^{\prime}$ that dominates $h$, i.e., $\nexists h^{\prime}\in\mathcal{H}$, s. t. $\forall i:\ell_{i}(h^{\prime})\leqslant\ell_{i}(h)$ and $\exists j:\ell_{j}(h^{\prime})<\ell_{j}(h)$. Let $r=(r_{1},\dots,r_{N})$ $\in\mathbb{R}^{N}$ denote a preference vector where $\sum_{k=1}^{N}{r_{k}}=1$ and $r_{k}\geq 0,\forall k\in\{1,\dots,N\}$. The hypernetwork $HN$ takes $r$ as input and outputs a Pareto solution $h$, i.e., $h\leftarrow HN(\phi,r)$, where $\phi$ denotes the parameters of the hypernetwork [44].

Application. For each client $i$, linear scalarization can be used. Cui etal. [11] determine an optimal preference vector $r_{i}^{\ast}$ to generate the hypothesis $h_{i}^{\ast}$ that minimizes the loss with the data $\hat{\mathcal{D}}_{i}$. This is expressed as $h_{i}^{\ast}=HN(\phi,r_{i}^{\ast})$, where $r_{i}^{\ast}=\operatorname*{arg\,min}\nolimits_{r}{\hat{\mathcal{L}}_{i}(HN(%\phi,r))}$.For each client $i$, the value of $r_{i}^{\ast}$ is assigned to $\alpha_{i}$. In [11], collaboration equilibrium (CE) is sought with additional considerations. We refer to the PFL scheme here as CE due to its origin in [11].

Category | Dataset | Partitioning | pFedGraph | pFedSV | pFedJS | FedCollab | RACE | CE |

Label distribution skew | MNIST | $p_{k}\sim Dir(0.5)$ | 98.90%$\pm$0.10% | 91.18%$\pm$3.84% | 98.54%$\pm$0.40% | 79.66%$\pm$3.59% | 98.90%$\pm$0.27% | 98.82%$\pm$0.15% |

#C=1 | 10.65%$\pm$0.70% | 10.24%$\pm$0.08% | 10.09%$\pm$0.00% | 10.00%$\pm$0.00% | 10.09%$\pm$0.00% | 10.30%$\pm$0.02% | ||

#C=2 | 95.60%$\pm$0.43% | 35.52%$\pm$6.90% | 10.17%$\pm$0.15% | 19.93%$\pm$0.47% | 10.17%$\pm$0.15% | 84.65%$\pm$7.36% | ||

#C=3 | 97.34%$\pm$0.81% | 52.95%$\pm$16.93% | 9.84%$\pm$0.92% | 27.96%$\pm$1.67% | 12.73%$\pm$6.26% | 93.52%$\pm$2.34% | ||

FMNIST | $p_{k}\sim Dir(0.5)$ | 87.54%$\pm$0.44% | 80.74%$\pm$2.25% | 85.34%$\pm$1.53% | 69.10%$\pm$2.01% | 85.68%$\pm$1.51% | 86.93%$\pm$0.83% | |

#C=1 | 10.67%$\pm$1.33% | 10.05%$\pm$0.09% | 10.00%$\pm$0.00% | 10.00%$\pm$0.00% | 10.00%$\pm$0.00% | 10.00%$\pm$0.00% | ||

#C=2 | 71.37%$\pm$8.10% | 26.94%$\pm$1.29% | 10.00%$\pm$0.00% | 18.81%$\pm$1.19% | 10.00%$\pm$0.00% | 58.44%$\pm$6.91% | ||

#C=3 | 80.96%$\pm$2.21% | 34.44%$\pm$5.55% | 10.00%$\pm$0.00% | 26.56%$\pm$1.03% | 25.45%$\pm$30.90% | 74.11%$\pm$1.79% | ||

CIFAR-10 | $p_{k}\sim Dir(0.5)$ | 64.69%$\pm$0.92% | 44.87%$\pm$1.48% | 45.80%$\pm$1.68% | 36.10%$\pm$0.51% | 49.32%$\pm$3.52% | 62.65%$\pm$3.00% | |

#C=1 | 10.00%$\pm$0.00% | 10.08%$\pm$0.16% | 10.00%$\pm$0.00% | 10.00%$\pm$0.00% | 10.00%$\pm$0.00% | 10.00%$\pm$0.00% | ||

#C=2 | 39.77%$\pm$7.50% | 17.53%$\pm$6.53% | 10.00%$\pm$0.00% | 17.20%$\pm$0.98% | 10.00%$\pm$0.00% | 42.87%$\pm$6.35% | ||

#C=3 | 56.00%$\pm$0.03% | 26.33%$\pm$4.99% | 10.00%$\pm$0.00% | 23.78%$\pm$0.19% | 10.00%$\pm$0.00% | 49.27%$\pm$1.03% | ||

SVHN | $p_{k}\sim Dir(0.5)$ | 86.59%$\pm$0.39% | 63.23%$\pm$6.19% | 71.11%$\pm$7.10% | 54.93%$\pm$3.96% | 74.85%$\pm$5.48% | 87.64%$\pm$2.33% | |

#C=1 | 18.37%$\pm$2.43% | 11.29%$\pm$0.00% | 6.13%$\pm$0.00% | 10.00%$\pm$0.00% | 6.13%$\pm$0.00% | 9.03%$\pm$1.55% | ||

#C=2 | 69.35%$\pm$6.51% | 27.63%$\pm$5.98% | 15.53%$\pm$4.46% | 18.59%$\pm$3.01% | 15.53%$\pm$4.46% | 52.22%$\pm$12.74% | ||

#C=3 | 70.29%$\pm$2.32% | 51.56%$\pm$4.39% | 10.95%$\pm$4.99% | 23.11%$\pm$1.88% | 10.95%$\pm$4.99% | 70.47%$\pm$1.77% | ||

adult | $p_{k}\sim Dir(0.5)$ | 80.83%$\pm$0.83% | 78.97%$\pm$3.57% | 76.40%$\pm$0.05% | 72.89%$\pm$5.35% | 77.57%$\pm$2.29% | 74.86%$\pm$5.15% | |

#C=1 | 67.58%$\pm$17.58% | 39.45%$\pm$0.00% | 23.62%$\pm$0.00% | 50.00%$\pm$0.00% | 23.62%$\pm$0.00% | 50.00%$\pm$0.00% | ||

rcv1 | $p_{k}\sim Dir(0.5)$ | 49.66%$\pm$0.82% | 61.02%$\pm$10.36% | 49.35%$\pm$1.81% | 62.00%$\pm$10.27% | 49.59%$\pm$2.06% | 79.55%$\pm$5.58% | |

#C=1 | 51.16%$\pm$0.00% | 50.28%$\pm$0.05% | 51.16%$\pm$0.00% | 50.00%$\pm$0.00% | 51.16%$\pm$0.00% | 50.00%$\pm$0.00% | ||

covtype | $p_{k}\sim Dir(0.5)$ | 50.72%$\pm$0.48% | 50.96%$\pm$0.24% | 50.00%$\pm$1.20% | 50.24%$\pm$0.00% | 51.16%$\pm$0.00% | 50.58%$\pm$0.58% | |

#C=1 | 50.00%$\pm$0.00% | 49.28%$\pm$0.48% | 48.80%$\pm$0.00% | 50.00%$\pm$0.00% | 48.80%$\pm$0.00% | 50.00%$\pm$0.01% | ||

Number of times that performs the best | 16 | 1 | 1 | 1 | 1 | 5 | ||

Feature distribution skew | MNIST | $\mathbf{\hat{x}}\sim Gau(0.1)$ | 98.78%$\pm$0.22% | 98.87%$\pm$0.25% | 99.02%$\pm$0.10% | 98.61%$\pm$0.06% | 99.11%$\pm$0.01% | 98.98%$\pm$0.06% |

FMNIST | 88.50%$\pm$0.01% | 84.13%$\pm$0.79% | 88.61%$\pm$0.00% | 87.37%$\pm$0.40% | 87.77%$\pm$0.01% | 88.43%$\pm$0.41% | ||

CIFAR-10 | 69.18%$\pm$0.80% | 62.24%$\pm$0.06% | 70.09%$\pm$0.54% | 65.82%$\pm$3.41% | 70.96%$\pm$0.17% | 67.44%$\pm$1.93% | ||

SVHN | 87.18%$\pm$2.71% | 85.65%$\pm$0.09% | 87.34%$\pm$0.57% | 86.29%$\pm$0.00% | 86.57%$\pm$0.28% | 87.61%$\pm$0.03% | ||

FEMNIST | real-world | 99.27%$\pm$0.04% | 97.80%$\pm$1.15% | 99.23%$\pm$0.00% | 98.93%$\pm$0.20% | 99.28%$\pm$0.01% | 99.26%$\pm$0.01% | |

Number of times that performs the best | 0 | 0 | 1 | 0 | 3 | 1 | ||

Quantity skew | MNIST | $q\sim Dir(0.5)$ | 99.04%$\pm$0.06% | 98.29%$\pm$0.67% | 99.12%$\pm$0.12% | 96.61%$\pm$3.43% | 99.14%$\pm$0.08% | 98.99%$\pm$0.04% |

FMNIST | 88.85%$\pm$0.29% | 84.78%$\pm$3.92% | 88.31%$\pm$0.30% | 87.10%$\pm$1.12% | 87.90%$\pm$0.28% | 88.48%$\pm$0.20% | ||

CIFAR-10 | 72.19%$\pm$0.03% | 61.68%$\pm$0.98% | 71.70%$\pm$0.40% | 62.51%$\pm$2.91% | 71.86%$\pm$0.35% | 70.12%$\pm$0.70% | ||

SVHN | 88.16%$\pm$0.17% | 76.63%$\pm$1.52% | 85.05%$\pm$0.72% | 85.14%$\pm$0.44% | 84.54%$\pm$0.26% | 88.52%$\pm$0.39% | ||

adult | 82.34%$\pm$0.04% | 82.24%$\pm$1.26% | 82.23%$\pm$0.08% | 82.15%$\pm$0.18% | 82.25%$\pm$0.29% | 82.07%$\pm$0.10% | ||

rcv1 | 51.16%$\pm$0.00% | 51.40%$\pm$0.25% | 96.31%$\pm$0.17% | 95.44%$\pm$0.31% | 95.73%$\pm$0.57% | 96.25%$\pm$0.22% | ||

covtype | 51.20%$\pm$0.00% | 51.20%$\pm$0.00% | 51.20%$\pm$0.10% | 51.20%$\pm$0.00% | 51.20%$\pm$0.00% | 51.20%$\pm$0.05% | ||

Number of times that performs the best | 4 | 1 | 2 | 1 | 2 | 1 | ||

Homogeneous partition | MNIST | IID | 98.96%$\pm$0.04% | 98.63%$\pm$0.07% | 99.06%$\pm$0.12% | 98.60%$\pm$0.00% | 99.13%$\pm$0.00% | 98.97%$\pm$0.04% |

FMNIST | 88.91%$\pm$0.03% | 82.36%$\pm$0.71% | 87.77%$\pm$0.07% | 87.30%$\pm$0.41% | 88.16%$\pm$0.04% | 88.36%$\pm$0.42% | ||

CIFAR-10 | 68.10%$\pm$0.44% | 62.18%$\pm$0.57% | 69.59%$\pm$0.57% | 63.28%$\pm$0.88% | 70.39%$\pm$0.17% | 66.54%$\pm$1.81% | ||

FEMNIST | 99.33%$\pm$0.01% | 97.64%$\pm$1.36% | 99.24%$\pm$0.03% | 99.12%$\pm$0.06% | 99.28%$\pm$0.02% | 99.25%$\pm$0.00% | ||

SVHN | 88.41%$\pm$0.14% | 85.71%$\pm$0.04% | 86.07%$\pm$0.35% | 87.25%$\pm$1.24% | 86.82%$\pm$0.23% | 87.84%$\pm$0.35% | ||

adult | 82.22%$\pm$0.29% | 81.37%$\pm$0.55% | 81.91%$\pm$1.09% | 82.53%$\pm$0.47% | 81.76%$\pm$0.39% | 83.87%$\pm$2.04% | ||

rcv1 | 51.40%$\pm$0.25% | 95.28%$\pm$0.21% | 96.15%$\pm$0.38% | 51.40%$\pm$0.25% | 95.88%$\pm$0.32% | 81.11%$\pm$29.95% | ||

covtype | 51.23%$\pm$0.06% | 51.20%$\pm$0.01% | 51.23%$\pm$0.06% | 51.25%$\pm$0.05% | 49.95%$\pm$1.25% | 51.23%$\pm$0.06% | ||

Number of times that performs the best | 4 | 0 | 2 | 0 | 2 | 2 |

## 4 Experimental Studies

To investigate the effectiveness of existing FL schemes in studying the statistical heterogeneity of data, we conducted extensive experiments on eight public datasets. These included five image datasets (i.e., MNIST [27], CIFAR-10 [26], FMNIST [57], SVHN [45], FEMNIST [6]), and three tabular datasets (i.e., adult [25], rcv1 [28], and covtype [5]). For specific experimental procedures, we followed the protocol outlined in [29]. By default, the framework involves ten clients, each performing training over 10 epochs per communication round with a batch size of 64, across a total of 50 communication rounds.We use top-1 accuracy to indicate the differences in test case accuracy of various schemes under different non-IID conditions. Unless otherwise specified, the value in the tables represent the average test case results across all clients using their personalized models. For more details on the experimental setup and datasets, please refer to Appendix A. The main insights of this paper are highlighted in italic text.

### 4.1 Analysis of the Accuracy obtained by Collaboration

The accuracy of the six methods under the five standard non-IID settings and the IID data setting is shown in Table 1. Non-IID data can degrade the FL performance to some extent. The results under the IID setting is used as a baseline to evaluate the performance of the six methods under non-IID settings.

Performance Comparison.Figure 1 summarizes the results in Table 1, with the blue ovals at the bottom illustrating which schemes are advantageous under each setting. There is no universally best scheme; the choice depends on the specific setting.In the label distribution skew setting, pFedGraph significantly outperforms other schemes on image data, while CE performs best on average for tabular data. For feature distribution skew, RACE usually achieves the highest accuracy. In the quantity skew setting, pFedGraph continues to excel on image data, while pFedJS achieves the highest accuracy on tabular data. Our observation is as follows: as illustrated by the bottom blue ovals of Figure 1, each non-IID setting has a corresponding technique that maximizes the collaboration advantages and personalized model accuracy.

The pairwise collaboration advantages of clients can be measured by the matrix $\alpha^{t}$ or $\alpha$, which defines a benefit graph of clients. In the line of research [11, 50, 56, 8], some desired properties (i.e., additional constraints) need to be maintained in the collaboration relationships of clients to address the issues with fairness, collaboration and competition [39, 40, 20]; then, the final collaboration relationships of clients will form a subgraph of the benefit graph. Currently, only the hypernetwork technique is used to estimate $\alpha^{t}$, without considering whether other techniques might be more advantageous.As identified by this paper, the matrix $\alpha^{t}$ can in fact be estimated by the pFedJS, FedCollab, pFedSV, pFedGraph, and hypernetwork techniques, where hypernetwork is used in the CE scheme. In the case of [11], the matrix $\alpha^{t}$ needs to be computed and the collaboration relationships of clients are determined before the FL training process starts; once these relationships are established, they remain fixed during the entire training process; thus, a fixed matrix $\alpha$ is required, and only the pFedJS, FedCollab, and Hypernetwork techniques can be applied. In the case of [50], both the matrix $\alpha^{t}$ and the collaboration relationships of clients can be updated in each round of FL training, as long as it does not create conflicts of interest among clients; thus, all five techniques mentioned above can be used to compute the matrix $\alpha^{t}$. Regarding the choice of proper techniques for generating the benefit graph, the better technique is the one achieving higher overall accuracy. From Table 1, we make the following observations: (i) In the case where the benefit graph (i.e., the matrix $\alpha^{t}$) can be updated in each round of FL training, it is desirable that under each non-IID setting, the matrix $\alpha^{t}$ is generated by the corresponding technique illustrated by the blue ovals at the bottom of Figure 1. (ii) In the case where the benefit graph (i.e., the matrix $\alpha$) needs to be fixed and computed before the FL training process starts, it is desirable that under each non-IID setting, the matrix $\alpha$ is generated by the corresponding technique illustrated by the blue ovals at the top of Figure 1.

Effect of different non-IID settings. It can be observed from Table 1 that: 1) for the feature distribution and quantity skew settings, existing schemes achieve accuracy close to the IID setting; 2) among all non-IID settings, the label distribution skew is the most challenging. Notably, in the quantity-based label imbalance setting, as the number of classes $C$ decreases from three to one, the degree of data heterogeneity increases, leading to a decrease in accuracy. In the future, there is still room to improve existing schemes for the label distribution skew setting.

Effect of different tasks.CIFAR-10 is the most complex dataset, and its accuracy varies significantly across different schemes and non-IID settings. For other, simpler image datasets, the accuracy differences are relatively small across all cases. Tabular data also presents significant challenges, with accuracy consistently low across all six schemes (e.g., $\leqslant 52\%$ for covtype in all cases). These observations suggest that more challenging tasks like CIFAR-10 and tabular datasets should be included in future benchmarks to better study the complementarity among clients.

Finally, as generalized in Eq. (1), the difference between the weighted empirical risk and the optimal expected risk can be well bounded when two types of divergences are applied, respectively. An interesting direction for future research is to explore whether such a bound exists for other divergence measures, such as the earth mover distance [1] and the Fréchet distance [19], which are mentioned in [34].

Computation Time | Communication Size | |||||||
---|---|---|---|---|---|---|---|---|

MNIST | CIFAR-10 | adult | rcv1 | MNIST | CIFAR-10 | adult | rcv1 | |

pFedJS | 23.69s | 124.1s | 1.591s | 30.87s | 170.1MB | 240.1MB | 20.1MB | 5770.1MB |

pFedGraph | 9.410s | 6.340s | 4.870s | 27.02s | 170MB | 240MB | 20MB | 5770MB |

pFedSV | 1258s | 5444s | 149.1s | 244.5s | 850MB | 1200MB | 100MB | 28850MB |

FedCollab | 305.9s | 1312s | 48.88s | 63.69s | 215.9MB | 304.8MB | 25.4MB | 7327.9MB |

RACE | 4.169s | 27.26s | 2.003s | 3.017s | 105.9MB | 145.2MB | 13.3MB | 3462.6MB |

CE | 391.8s | 1740s | 142.4s | 166.5s | 33330.1MB | 47020.1MB | 3490.1MB | 925820.1MB |

### 4.2 Efficiency

Schemes except for [34] compute a matrix $\alpha$ only once before the FL training process starts [11, 12, 3] or a matrix $\alpha^{t}$ in each round of FL training [59, 55] to quantify the collaboration advantages of clients. [34] performs a one-time computation of the distance/divergence between the client and global data distribution sketches before the FL training process starts.We evaluated (i) the time required to compute the matrix ($\alpha^{t}$ or $\alpha$) or the divergence in each scheme under the noise-based feature imbalance setting, and (ii) the communication overhead with the server (i.e., the amount of data communicated with the server over 50 communication rounds). Table 2 shows the experimental results.

The computational overhead of pFedSV is the highest across all datasets, followed closely by CE and FedCollab. The high computational overhead of these schemes is due to their use of neural networks to compute $\alpha$. In pFedJS, the JS divergence is computed in the joint feature and label space, whose dimension can be high; we note that the implementation in [12] considers only the label distribution skew case, where only the label space is concerned. In pFedGraph, the optimization of personalized models is performed locally. RACE and pFedJS execute simple binary operations, resulting in significantly faster computation.For the CE scheme, all clients train a Hypernetwork, resulting in a large amount of data exchanged with the server during the preparation of $\alpha$. The communication overhead for pFedGraph, pFedJS, FedCollab and RACE is similar, as they typically require uploading the local model and downloading the personalized model during training. However, pFedSV has a relatively higher communication overhead because it needs to compute the Shapley Value locally using other clients’ models.

Recently, there has been growing interest in measuring collaboration advantages to establish collaboration relationships among clients in the Internet of Things (IoT) field [64, 62, 66, 36, 2]. In these works, the JS divergence is commonly used. As pointed out in [34], divergence estimation can be difficult in high-dimensional spaces and require expensive computation [22, 1].In IoTsscenarios, clients are typically resource-constrained devices with limited hardware resources (energy, bandwidth, computation). With these considerations, Liu etal. [34] propose a one-time distribution sketch technique to estimate data distribution divergence using lightweight computation. However, in their scheme, [34] don’t need to generate the matrix $\alpha$ or $\alpha^{t}$ to quantify the collaboration advantages of clients. As shown in Table 2, using distribution sketches to measure divergence is much faster than using JS divergence. On the other hand, FL is often studied from a game theory perspective [51, 17, 24], where each client in the collaborative FL network aims to maximize its utility by collaborating with others [13]. Thus, the collaboration advantages (i.e., $\alpha$ or $\alpha^{t}$) of clients need to be well understood and quantified in a lightweight way for the IoTsapplications. For example, Lu etal. [36] consider applying FL to IoTs and assume that the collaboration advantage (i.e., the utility) is linearly proportional to the JS divergence, which may be a simplistic assumption. Thus, to make the existing technical development in literature more practical, it may be promising to study how to use the lightweight one-time distribution sketch technique to generate the matrix $\alpha$ or $\alpha^{t}$, which lays a practical foundation for the FL collaboration establishment in the IoTs field.

### 4.3 Other Experiments

Scalability. From Figure 2, it can be observed that the accuracy of schemes may vary, and in many cases, their accuracy decreases as the number of clients increases. Among the six schemes, no scheme dominates the other schemes across all cases. The sensitivity of these schemes’ performance to the number of clients implies that scalability should be an important aspect to consider when evaluating a FL scheme in the related research.

pFedGraph | pFedSV | pFedJS | FedCollab | RACE | CE | |
---|---|---|---|---|---|---|

Feature and quantity skew | 31.38% | 21.67% | 27.78% | 19.21% | 10.00% | 28.26% |

Label and feature skew | 10.51% | 20.73% | 10.00% | 11.89% | 10.00% | 10.00% |

Mixed Types of Data Skew. In reality, different types of skew may coexist. Like [29], we consider two settings: mixed label and feature skew, and mixed feature and quantity skew; please refer to Appendix B for details. The experimental results are presented in Table 3.When facing mixed types of skew, all these schemes encounter significant challenges. For example, in the mixed feature and quantity skew setting, the performance of all schemes on CIFAR-10 is poor, with accuracy not exceeding 32%, much lower than the case when there is only one type of skew, as illustrated in Table 1.It is crucial to find effective solutions for the mixed types of skew case since this case is also common in real-world scenarios.

The effects of batch size, communication rounds, and local epochs on these PFL schemes have also been evaluated through experiments. Please refer to Appendix C for detailed results and analysis.

## 5 Concluding Remarks

In this paper, we summarize the six major techniques (JS divergence, $\mathcal{C}$-divergence, distribution sketch, Shapley Value, Hypernetworks, cosine similarity) to quantify data heterogeneity in FL settings into a first-of-its-kind unified framework to understand their effects in-depth. Extensive experiments over eight popular datasets have been conducted to compare these schemes under five standard non-IID FL settings, providing much-needed insight into which schemes are advantageous under which settings. The unified framework and the experimental results identify the scenarios, under which the current schemes perform relatively poorly, and future research problems for PFL. It is useful for identifying the right techniques for quantifying the collaboration advantages among clients, guiding the related research on collaboration, fairness, and competition in FL settings. The findings suggest that lightweight FL schemes based on techniques such as the distribution sketch are worth studying in the future.

## References

- Andoni etal. [2008]Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer.Earth mover distance over high-dimensional spaces.In
*Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms*, SODA’08, page 343–352, USA, 2008. Society for Industrial and Applied Mathematics. - Arisdakessian etal. [2023]Sarhad Arisdakessian, OmarAbdel Wahab, Azzam Mourad, and Hadi Otrok.Coalitional federated learning: Improving communication and training on non-iid data with selfish clients.
*IEEE Transactions on Services Computing*, 16(4):2462–2476, 2023. - Bao etal. [2023]Wenxuan Bao, Haohan Wang, Jun Wu, and Jingrui He.Optimizing the collaboration structure in cross-silo federated learning.In
*Proceedings of the 40th International Conference on Machine Learning*, volume 202 of*PMLR*, pages 1718–1736. PMLR, 23–29 Jul 2023. - Ben-David etal. [2010]Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and JenniferWortman Vaughan.A theory of learning from different domains.
*Machine learning*, 79:151–175, 2010. - Blackard and Dean [1999]JockA Blackard and DenisJ Dean.Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.
*Computers and electronics in agriculture*, 24(3):131–151, 1999. - Caldas etal. [2018]Sebastian Caldas, Sai MeherKarthik Duddu, Peter Wu, Tian Li, Jakub Konečnỳ, HBrendan McMahan, Virginia Smith, and Ameet Talwalkar.Leaf: A benchmark for federated settings.
*arXiv preprint arXiv:1812.01097*, 2018. - Caticha [2004]Ariel Caticha.Relative entropy and inductive inference.In
*AIP Conference Proceedings*, volume 707, pages 75–96. American Institute of Physics, April 2004. - Chen etal. [2024]Mengmeng Chen, Xiaohu Wu, Xiaoli Tang, Tiantian He, Yew-Soon Ong, Qiqi Liu, Qicheng Lao, and Han Yu.Free-rider and conflict aware collaboration formation for cross-silo federated learning.In
*the Thirty-eighth Annual Conference on Neural Information Processing Systems*, pages 1–10, 2024. - Coleman etal. [2020]Benjamin Coleman, Richard Baraniuk, and Anshumali Shrivastava.Sub-linear memory sketches for near neighbor search on streaming data.In
*International Conference on Machine Learning*, pages 2089–2099. PMLR, 2020. - Coleman etal. [2022]Benjamin Coleman, Benito Geordie, LiChou, RALeo Elworth, Todd Treangen, and Anshumali Shrivastava.One-pass diversified sampling with application to terabyte-scale genomic sequence streams.In
*International Conference on Machine Learning*, pages 4202–4218. PMLR, 2022. - Cui etal. [2022]Sen Cui, Jian Liang, Weishen Pan, Kun Chen, Changshui Zhang, and Fei Wang.Collaboration equilibrium in federated learning.In
*Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pages 241–251, 2022. - Ding and Wang [2022]Shu Ding and Wei Wang.Collaborative learning by detecting collaboration partners.In
*Advances in Neural Information Processing Systems*, volume35, pages 15629–15641. Curran Associates, Inc., 2022. - Donahue and Kleinberg [2021]Kate Donahue and Jon Kleinberg.Model-sharing games: Analyzing federated learning under voluntary participation.In
*Proceedings of the AAAI Conference on Artificial Intelligence*, volume35, pages 5303–5311, 2021. - Famá etal. [2024]Fernanda Famá, Charalampos Kalalas, Sandra Lagen, and Paolo Dini.Measuring data similarity for efficient federated learning: A feasibility study.In
*2024 International Conference on Computing, Networking and Communications (ICNC)*, pages 394–400, 2024.doi: 10.1109/ICNC59896.2024.10556102. - Geyer etal. [2017]RobinC Geyer, Tassilo Klein, and Moin Nabi.Differentially private federated learning: A client level perspective.
*arXiv preprint arXiv:1712.07557*, 2017. - Ghorbani and Zou [2019]Amirata Ghorbani and James Zou.Data shapley: Equitable valuation of data for machine learning.In
*International conference on machine learning*, pages 2242–2251. PMLR, 2019. - Gupta and Gupta [2023]Rajni Gupta and Juhi Gupta.Federated learning using game strategies: State-of-the-art and future trends.
*Computer Networks*, 225:109650, 2023. - He etal. [2024]Tiantian He, Yang Liu, Yew-Soon Ong, Xiaohu Wu, and Xin Luo.Polarized message-passing in graph neural networks.
*Artificial Intelligence*, 331:104129, 2024. - Heusel etal. [2017]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium.
*Advances in neural information processing systems*, 30, 2017. - Huang etal. [2024]Chao Huang, Shuqi Ke, and Xin Liu.Duopoly business competition in cross-silo federated learning.
*IEEE Transactions on Network Science and Engineering*, 11(1):340–351, 2024. - Huang etal. [2022]Chun-Yin Huang, QiLei, and Xiaoxiao Li.Efficient medical image assessment via self-supervised learning.In
*MICCAI Workshop on Data Augmentation, Labelling, and Imperfections*, pages 102–111. Springer, 2022. - Huang etal. [2015]Jin Huang, Rui Zhang, Rajkumar Buyya, Jian Chen, and Yongwei Wu.Heads-join: Efficient earth mover’s distance similarity joins on hadoop.
*IEEE Transactions on Parallel and Distributed Systems*, 27(6):1660–1673, 2015. - Kairouz etal. [2021]Peter Kairouz, H.Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun NitinBhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G.L. D’Oliveira, Hubert Eichner, Salim ElRouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, PhillipB. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konecný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Hang Qi, Daniel Ramage, Ramesh Raskar, Mariana Raykova, Dawn Song, Weikang Song, SebastianU. Stich, Ziteng Sun, AnandaTheertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, LiXiong, Zheng Xu, Qiang Yang, FelixX. Yu, Han Yu, and Sen Zhao.Advances and open problems in federated learning.
*Found. Trends Mach. Learn.*, 14(1–2):1–210, jun 2021.ISSN 1935-8237.doi: 10.1561/2200000083. - Khan etal. [2021]LatifU Khan, Walid Saad, Zhu Han, Ekram Hossain, and ChoongSeon Hong.Federated learning for internet of things: Recent advances, taxonomy, and open challenges.
*IEEE Communications Surveys & Tutorials*, 23(3):1759–1799, 2021. - Kohavi etal. [1996]Ron Kohavi etal.Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.In
*KDD*, volume96, pages 202–207, 1996. - Krizhevsky etal. [2009]Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.2009.
- LeCun etal. [1998]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.
*Proceedings of the IEEE*, 86(11):2278–2324, 1998. - Lewis etal. [2004]DavidD Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li.Rcv1: A new benchmark collection for text categorization research.
*Journal of machine learning research*, 5(Apr):361–397, 2004. - Li etal. [2022]Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He.Federated learning on non-iid data silos: An experimental study.In
*2022 IEEE 38th International Conference on Data Engineering (ICDE)*, pages 965–978, 2022.doi: 10.1109/ICDE53745.2022.00077. - Li etal. [2020a]Tian Li, AnitKumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.Federated optimization in heterogeneous networks.
*Proceedings of Machine learning and systems*, 2:429–450, 2020a. - Li etal. [2020b]Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and QiDou.Fedbn: Federated learning on non-iid features via local batch normalization.In
*International Conference on Learning Representations*, 2020b. - Liu etal. [2022a]Zelei Liu, Yuanyuan Chen, Han Yu, Yang Liu, and Lizhen Cui.Gtg-shapley: Efficient and accurate participant contribution evaluation in federated learning.
*ACM Transactions on Intelligent Systems and Technology*, 13(4), may 2022a.doi: 10.1145/3501811. - Liu etal. [2022b]Zichang Liu, Benjamin Coleman, Tianyi Zhang, and Anshumali Shrivastava.Retaining knowledge for learning with dynamic definition.
*Advances in Neural Information Processing Systems*, 35:14944–14958, 2022b. - Liu etal. [2023]Zichang Liu, Zhaozhuo Xu, Benjamin Coleman, and Anshumali Shrivastava.One-pass distribution sketch for measuring data heterogeneity in federated learning.In
*Advances in Neural Information Processing Systems*, volume36, pages 15660–15679. Curran Associates, Inc., 2023. - Long etal. [2023]Guodong Long, Ming Xie, Tao Shen, Tianyi Zhou, Xianzhi Wang, and Jing Jiang.Multi-center federated learning: clients clustering for better personalization.
*World Wide Web*, 26(1):481–500, 2023. - Lu etal. [2024]Jianfeng Lu, Yue Chen, Shuqin Cao, Longbiao Chen, Wei Wang, and Yun Xin.Leap: Optimization hierarchical federated learning on non-iid data with coalition formation game.In
*Proceedings of the 33rd International Joint Conference on Artificial Intelligence, to appear*, 2024. - Luo and Shrivastava [2018]Chen Luo and Anshumali Shrivastava.Arrays of (locality-sensitive) count estimators (ace): Anomaly detection on the edge.In
*Proceedings of the 2018 World Wide Web Conference*, WWW’18, page 1439–1448. ACM, 2018. - Luo etal. [2021]MiLuo, Fei Chen, Dapeng Hu, Yifan Zhang, Jian Liang, and Jiashi Feng.No fear of heterogeneity: Classifier calibration for federated learning with non-iid data.
*Advances in Neural Information Processing Systems*, 34:5972–5984, 2021. - Lyu etal. [2020a]Lingjuan Lyu, Xinyi Xu, Qian Wang, and Han Yu.Collaborative fairness in federated learning.
*Federated Learning: Privacy and Incentive*, pages 189–204, 2020a. - Lyu etal. [2020b]Lingjuan Lyu, Jiangshan Yu, Karthik Nandakumar, Yitong Li, Xingjun Ma, Jiong Jin, Han Yu, and KeeSiong Ng.Towards fair and privacy-preserving federated deep models.
*IEEE Transactions on Parallel and Distributed Systems*, 31(11):2524–2541, 2020b.doi: 10.1109/TPDS.2020.2996273. - McMahan etal. [2017a]Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and BlaiseAgueray Arcas.Communication-efficient learning of deep networks from decentralized data.In
*Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS’17)*, pages 1273–1282, 2017a. - McMahan etal. [2017b]Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and BlaiseAguera yArcas.Communication-efficient learning of deep networks from decentralized data.In
*Artificial intelligence and statistics*, pages 1273–1282. PMLR, 2017b. - Mohri and MuñozMedina [2012]Mehryar Mohri and Andres MuñozMedina.New analysis and algorithm for learning with drifting distributions.In
*Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23*, pages 124–138. Springer, 2012. - Navon etal. [2020]Aviv Navon, Aviv Shamsian, Ethan Fetaya, and Gal Chechik.Learning the pareto front with hypernetworks.In
*International Conference on Learning Representations*, 2020. - Netzer etal. [2011]Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, AndrewY Ng, etal.Reading digits in natural images with unsupervised feature learning.In
*NIPS workshop on deep learning and unsupervised feature learning*, volume 2011, page7. Granada, Spain, 2011. - Rezende [2018]DaniloJimenez Rezende.Short notes on divergence measures, July 2018.
- Shamsian etal. [2021]Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik.Personalized federated learning using hypernetworks.In
*Proceedings of the 38th International Conference on Machine Learning*, volume 139 of*PMLR*, pages 9489–9502. PMLR, 18–24 Jul 2021. - Shapley etal. [1953]LloydS Shapley etal.A value for n-person games.1953.
- Tan etal. [2023]AlysaZiying Tan, Han Yu, Lizhen Cui, and Qiang Yang.Towards personalized federated learning.
*IEEE Transactions on Neural Networks and Learning Systems*, 34(12):9587–9603, 2023. - Tan etal. [2024]Shanli Tan, Hao Cheng, Xiaohu Wu, Han Yu, Tiantian He, YewSoon Ong, Chongjun Wang, and Xiaofeng Tao.Fedcompetitors: Harmonious collaboration in federated learning with competing participants.In
*Proceedings of the AAAI Conference on Artificial Intelligence*, volume38, pages 15231–15239, 2024. - Tu etal. [2022]Xuezhen Tu, Kun Zhu, NguyenCong Luong, Dusit Niyato, Yang Zhang, and Juan Li.Incentive mechanisms for federated learning: From economic and game theoretic perspective.
*IEEE transactions on cognitive communications and networking*, 8(3):1566–1593, 2022. - Wang etal. [2019]Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni.Federated learning with matched averaging.In
*International Conference on Learning Representations*, 2019. - Wang etal. [2020]Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and HVincent Poor.Tackling the objective inconsistency problem in heterogeneous federated optimization.
*Advances in neural information processing systems*, 33:7611–7623, 2020. - Wu and He [2020]Jun Wu and Jingrui He.Continuous transfer learning with label-informed distribution alignment.
*arXiv preprint arXiv:2006.03230*, 2020. - Wu etal. [2024]Leijie Wu, Song Guo, Yaohong Ding, Junxiao Wang, Wenchao Xu, Yufeng Zhan, and Anne-Marie Kermarrec.Rethinking personalized client collaboration in federated learning.
*IEEE Transactions on Mobile Computing*, pages 1–13, 2024.doi: 10.1109/TMC.2024.3396218. - Wu and Yu [2022]Xiaohu Wu and Han Yu.Mars-fl: Enabling competitors to collaborate in federated learning.
*IEEE Transactions on Big Data*, pages 1–11, 2022. - Xiao etal. [2017]Han Xiao, Kashif Rasul, and Roland Vollgraf.Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
*arXiv preprint arXiv:1708.07747*, 2017. - Yang etal. [2019]Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong.Federated machine learning: concept and applications.
*ACM Transactions on Intelligent Systems and Technology*, 10(2):12:1–12:19, 2019. - Ye etal. [2023]Rui Ye, Zhenyang Ni, Fangzhao Wu, Siheng Chen, and Yanfeng Wang.Personalized federated learning with inferred collaboration graphs.In
*Proceedings of the 40th International Conference on Machine Learning*, volume 202 of*PMLR*, pages 39801–39817. PMLR, 23–29 Jul 2023. - Yu etal. [2020]Felix Yu, AnkitSingh Rawat, Aditya Menon, and Sanjiv Kumar.Federated learning with only positive labels.In
*International Conference on Machine Learning*, pages 10946–10956. PMLR, 2020. - Yurochkin etal. [2019]Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni.Bayesian nonparametric federated learning of neural networks.In
*International conference on machine learning*, pages 7252–7261. PMLR, 2019. - Zhang etal. [2023]Jie Zhang, Song Guo, Jingcai Guo, Deze Zeng, Jingren Zhou, and AlbertY. Zomaya.Towards data-independent knowledge transfer in model-heterogeneous federated learning.
*IEEE Transactions on Computers*, 72(10):2888–2901, 2023. - Zhang etal. [2017]Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang.Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.
*IEEE transactions on image processing*, 26(7):3142–3155, 2017. - Zhang etal. [2024]Ning Zhang, Qian Ma, Wuxing Mao, and XuChen.Coalitional fl: Coalition formation and selection in federated learning with heterogeneous data.
*IEEE Transactions on Mobile Computing*, pages 1–15, 2024.doi: 10.1109/TMC.2024.3375325. - Zhu etal. [2021]Hangyu Zhu, Jinjin Xu, Shiqing Liu, and Yaochu Jin.Federated learning on non-iid data: A survey.
*Neurocomputing*, 465:371–390, 2021. - Zhu etal. [2024]Ran Zhu, Mingkun Yang, and Qing Wang.Shufflefl: Addressing heterogeneity in multi-device federated learning.
*Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies*, 8(2):1–34, 2024.

## Appendix A Experimental Setup Details

The statistics of the datasets is summarized in Table 4. For image datasets, we employed a CNN with two 5x5 convolutional layers followed by 2x2 max pooling, and two fully connected layers with ReLU activation [18]. For tabular datasets, we used an MLP with three hidden layers for training. We used the SGD optimizer with a learning rate of 0.1 for rcv1 and 0.01 for the other datasets, along with a momentum of 0.9.

Datasets | #training cases | #validation cases | #test cases | #features | #classes | |

Image Datasets | MNIST | 48,000 | 12,000 | 10,000 | 784 | 10 |

FMNIST | 48,000 | 12,000 | 10,000 | 784 | 10 | |

CIFAR-10 | 40,000 | 10,000 | 10,000 | 1,024 | 10 | |

SVHN | 58606 | 14651 | 26,032 | 1,024 | 10 | |

FEMNIST | 273,498 | 68,375 | 40,832 | 784 | 10 | |

Tabular Datasets | adult | 26,049 | 6,512 | 16,281 | 123 | 2 |

rcv1 | 12,146 | 3,036 | 5,060 | 47,236 | 2 | |

covtype | 348,607 | 87,152 | 145,253 | 54 | 2 |

## Appendix B Mixed Types of Skew

In the mixed label and feature skew setting, the distribution-based label imbalanced partitioning strategy and the noise-based feature imbalance strategy are applied sequentially. Specifically, the whole dataset is first divided into each client using the distribution-based label imbalanced partitioning strategy. Then, noise is added to the data of each client according to the noise-based feature imbalance strategy. As a result, both label distribution skew and feature distribution skew exist among the local data of different clients.

In the mixed feature and quantity skew setting, the quantity imbalanced partitioning strategy and the noise-based feature imbalance strategy are applied sequentially. Specifically, the whole dataset is allocated into each client by the quantity imbalanced partitioning strategy. Then, noise is added to the data of each client through the noise-based feature imbalance strategy. Consequently, both feature distribution skew and quantity skew exist among the local data of different clients.

## Appendix C FL Training Process

### C.1 Communication Efficiency

Figures 3-7 show the training curves of MNIST, FMNIST, SVHN, CIFAR-10, and rcv1 under different non-IID settings.Under the non-IID setting $\mathbf{\hat{x}}\sim Gau(0.1)$, $p\sim Dir(0.5)$, and $p_{k}\sim Dir(0.5)$, training on these datasets is generally stable. However, when the number of classes $\#C=1,2,3$, the convergence and stability of different schemes vary significantly during the training period. For example, pFedGraph is relatively more unstable compared to other schemes when $\#C=1$. In addition, when $\#C=2,3$, the accuracy of pFedSV, pFedGraph, and CE continues to improve across different datasets, indicating that these schemes require more communication rounds to achieve better performance under these conditions.

### C.2 Robustness to Local Updates

Following [29], we set the local epochs to 10, 20, 40, and 80 to test the robustness of the comparison schemes to local updates. As shown in Figures 8-10, the number of local training epochs is crucial because it directly impacts the final accuracy. Contrary to intuition, a larger number of local epochs does not always lead to higher accuracy. When $\#C=2,3$, the accuracy of pFedJS and pFedSV on the test dataset initially increases or decreases in one direction, but as the number of local epochs continues to grow, the accuracy changes in the opposite direction. In the label distribution skew setting, FedCollab may experience overfitting as the number of epochs increases. The accuracy of CE occasionally decreases slightly in the quantity skew setting. RACE and pFedGraph perform very stably across various datasets and non-IID settings. Overall, these schemes achieve good performance with 10 local epochs. Increasing the number of local epochs further provides diminishing returns in terms of cost-effectiveness.

### C.3 Batch Size

Figures 11-12 present the training curves for the six schemes with batch sizes ranging from 16 to 256. As expected, both excessively large and small batch sizes result in poor accuracy. Notably, in the pFedJS and RACE schemes of Figure 11, the accuracy is particularly low when the batch size is 256, which may be due to getting stuck in local optima or having weak generalization ability. Additionally, when the batch size is 16, the accuracy drops sharply during training, likely due to the collaborative clients remaining largely unchanged and leading to overfitting while training personalized models. Therefore, selecting an appropriate batch size is crucial when training personalized models.

## NeurIPS Paper Checklist

- 1.
Claims

Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Answer: [Yes]

Justification: The main claims presented in the abstract and introduction of the paper align well with the contributions and scope detailed in the full manuscript. These sections effectively summarize the key points and core innovations of the study, providing a clear and concise overview that matches the in-depth discussions and findings elaborated upon in the subsequent sections.

Guidelines:

- •
The answer NA means that the abstract and introduction do not include the claims made in the paper.

- •
The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

- •
The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

- •
It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

- •
- 2.
Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: The data heterogeneity we studied and compared was conducted in the context of personalized federated learning scenarios.

Guidelines:

- •
The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

- •
The authors are encouraged to create a separate "Limitations" section in their paper.

- •
The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

- •
The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

- •
The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

- •
The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

- •
If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

- •
While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

- •
- 3.
Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Yes, after the previous assumptions, we put these six methods under the same framework for measurement and comparison.

Guidelines:

- •
The answer NA means that the paper does not include theoretical results.

- •
All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

- •
All assumptions should be clearly stated or referenced in the statement of any theorems.

- •
The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

- •
Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

- •
Theorems and Lemmas that the proof relies upon should be properly referenced.

- •
- 4.
Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Yes, we mentioned it in Experimental Studies and Experimental Setup Details.

Guidelines:

- •
The answer NA means that the paper does not include experiments.

- •
If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

- •
If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

- •
Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

- •
While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

- (a)
If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

- (b)
If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

- (c)
If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

- (d)
We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

- (a)

- •
- 5.
Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: Our experimental code is provided as URLs in the Abstract.

Guidelines:

- •
The answer NA means that paper does not include experiments requiring code.

- •
Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

- •
While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

- •
The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

- •
The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

- •
The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

- •
At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

- •
Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

- •
- 6.
Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: Yes, we mentioned it in Experimental Studies and Experimental Setup Details.

Guidelines:

- •
The answer NA means that the paper does not include experiments.

- •
The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

- •
The full details can be provided either with the code, in appendix, or as supplemental material.

- •
- 7.
Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We used error bars in Table 1, specifically calculated using seeds 1, 2, and 3.

Guidelines:

- •
The answer NA means that the paper does not include experiments.

- •
The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

- •
The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

- •
The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

- •
The assumptions made should be given (e.g., Normally distributed errors).

- •
It should be clear whether the error bar is the standard deviation or the standard error of the mean.

- •
It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

- •
For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

- •
If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

- •
- 8.
Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We mentioned the computer resources used in Appendix A.

Guidelines:

- •
The answer NA means that the paper does not include experiments.

- •
The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

- •
The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

- •
The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

- •
- 9.
Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

Answer: [Yes]

Justification: We have read the ethics review guidelines and have ensured that this paper fully conforms to them.

Guidelines:

- •
The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

- •
If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

- •
The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

- •
- 10.
Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [No]

Justification: This paper does not discuss any potential negative societal impacts of the research presented.

Guidelines:

- •
The answer NA means that there is no societal impact of the work performed.

- •
If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

- •
Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

- •
The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

- •
The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

- •
If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

- •
- 11.
Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [N/A]

Justification: Our paper poses no such risks.

Guidelines:

- •
The answer NA means that the paper poses no such risks.

- •
Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

- •
Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

- •
We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

- •
- 12.
Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: Yes, in our work, we have duly cited the creators of the assets used. The referenced assets include works by creators Ding and Wang [12], Liu etal. [34], Wu etal. [55], Bao etal. [3], Ye etal. [59], Cui etal. [11] , whose contributions are acknowledged and referenced appropriately in accordance with academic standards and copyright practices.

Guidelines:

- •
The answer NA means that the paper does not use existing assets.

- •
The authors should cite the original paper that produced the code package or dataset.

- •
The authors should state which version of the asset is used and, if possible, include a URL.

- •
The name of the license (e.g., CC-BY 4.0) should be included for each asset.

- •
For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

- •
If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

- •
For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

- •
If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

- •
- 13.
New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: We provide documentation and comments in the code.

Guidelines:

- •
The answer NA means that the paper does not release new assets.

- •
Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

- •
The paper should discuss whether and how consent was obtained from people whose asset is used.

- •
At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

- •
- 14.
Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [N/A]

Justification: There were no human subjects involved in this project.

Guidelines:

- •
The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

- •
Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

- •
According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

- •
- 15.
Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [N/A]

Justification: We do not have any research with human subjects that forms a part of this work.

Guidelines:

- •
The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

- •
Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

- •
We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

- •
For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

- •