Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions

Andrey Evgen'evich Sadchikov; Садчиков Андрей Евгеньевич; Savelii Andreevich Chezhegov; Чежегов Савелий Андреевич; Aleksandr Nikolaevich Beznosikov; Безносиков Александр Николаевич; Alexander Vladimirovich Gasnikov; Гасников Александр Владимирович

doi:10.4213/rm10207

Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions

Authors: Sadchikov A.E.¹, Chezhegov S.A.²^,1, Beznosikov A.N.²^,3^,4, Gasnikov A.V.⁴^,2^,1
Affiliations:
1. Moscow Institute of Physics and Technology (National Research University)
2. Ivannikov Institute for System Programming of the RAS
3. Sber AI Lab
4. Innopolis University
Issue: Vol 79, No 6 (2024)
Pages: 83-116
Section: Articles
URL: https://journals.rcsi.science/0042-1316/article/view/281942
DOI: https://doi.org/10.4213/rm10207
ID: 281942

Cite item

Full Text

Open Access
Restricted Access

Access granted
Restricted Access

Subscription Access

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

Distributed optimization plays an important role in modern large-scale machine learning and data processing systems by optimizing the utilization of computational resources. One of the classical and popular approaches is Local Stochastic Gradient Descent (Local SGD), characterized by multiple local updates before averaging, which is particularly useful in distributed environments to reduce communication bottlenecks and improve scalability. A typical feature of this method is the dependence on the frequency of communications. But in the case of a quadratic target function with homogeneous data distribution over all devices, the influence of the frequency of communications vanishes. As a natural consequence, subsequent studies include the assumption of a Lipschitz Hessian, as this indicates the similarity of the optimized function to a quadratic one to a certain extent. However, in order to extend the completeness of Local SGD theory and unlock its potential, in this paper we abandon the Lipschitz Hessian assumption by introducing a new concept of approximate quadraticity. This assumption gives a new perspective on problems that have near quadratic properties. In addition, existing theoretical analyses of Local SGD often assume a bounded variance. We, in turn, consider the unbounded noise condition, which allows us to broaden the class of problems under study.Bibliography: 36 titles.

Keywords

distributed optimization, quadraticity, strong growth condition

Doctor of physico-mathematical sciences, Associate professor

References

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, ACM, New York, 2016, 308–318
D. Basu, D. Data, C. Karakus, S. N. Diggavi, “Qsparse-local-SGD: distributed SGD with quantization, sparsification and local computations”, NIPS'19: Proceedings of the 33rd international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 32, Curran Associates, Inc., Red Hook, NY, 2019, 1316, 14695–14706, https://proceedings.neurips.cc/paper_files/paper/2019/hash/ d202ed5bcfa858c15a9f383c3e386ab2-Abstract.html
A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133, https://proceedings.neurips.cc/paper_files/paper/2022/hash/ f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html
A. Beznosikov, V. Samokhin, A. Gasnikov, Distributed saddle-point problems: lower bounds, near-optimal and robust algorithms, 2022 (v1 – 2020), 52 pp.
A. Beznosikov, G. Scutari, A. Rogozin, A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184
A. Beznosikov, M. Takac, A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677, https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html
S. Chezhegov, S. Skorik, N. Khachaturov, D. Shalagin, A. Avetisyan, A. Beznosikov, M. Takač, Y. Kholodov, A. Beznosikov, Local methods with adaptivity via scaling, 2024, 41 pp.
S. Ghadimi, Guanghui Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. II. Shrinking procedures and optimal algorithms”, SIAM J. Optim., 23:4 (2013), 2061–2089
M. R. Glasgow, Honglin Yuan, Tengyu Ma, “Sharp bounds for federated averaging (local SGD) and continuous perspective”, Proceedings of the 25th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 151, 2022, 9050–9090, https://proceedings.mlr.press/v151/glasgow22a
Я. Гудфеллоу, И. Бенджио, А. Курвиль, Глубокое обучение, 2-е изд., ДМК Пресс, М., 2018, 652 с.
E. Gorbunov, F. Hanzely, P. Richtarik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564
H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et. al., “Advances and open problems in federated learning”, Found. Trends Mach. Learn., 14:1-2 (2021), 1–210
S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5132–5143
A. Khaled, K. Mishchenko, P. Richtarik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529, https://proceedings.mlr.press/v108/bayoumi20a.html
A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393, https://proceedings.mlr.press/v119/koloskova20a.html
J. Konečny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, D. Bacon, Federated learning: strategies for improving communication efficiency, 2017 (v1 – 2016), 10 pp.
D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507, https://proceedings.neurips.cc/paper_files/paper/2022/hash/ d88f6f81e1aaf606776ffdd06fdf24ef-Abstract-Conference.html
Tian Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, “Federated optimization in heterogeneous networks”, Proc. Mach. Learn. Syst., 2 (2020), 429–450
Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng, Variance reduced local SGD with lower communication complexity, 2019, 25 pp.
L. O. Mangasarian, “Parallel gradient distribution in unconstrained optimization”, SIAM J. Control Optim., 33:6 (1995), 1916–1925
B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282, https://proceedings.mlr.press/v54/mcmahan17a
K. Mishchenko, G. Malinovsky, S. Stich, P. Richtarik, “Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally!”, Proceedings of the 39th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 162, 2022, 15750–15769, https://proceedings.mlr.press/v162/mishchenko22b
A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 2021–2031, https://proceedings.mlr.press/v108/reisizadeh20a.html
H. Robbins, S. Monro, “A stochastic approximation method”, Ann. Math. Statistics, 22 (1951), 400–407
M. Schmidt, N. Le Roux, Fast convergence of stochastic gradient descent under a strong growth condition, 2013, 5 pp.
Ш. Шалев-Шварц, Ш. Бен-Давид, Идеи машинного обучения: от теории к алгоритмам, ДМК Пресс, М., 2019, 436 с.
O. Shamir, N. Srebro, Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008, https://proceedings.mlr.press/v32/shamir14.html
P. Sharma, S. Kafle, P. Khanduri, S. Bulusu, K. Rajawat, P. K. Varshney, Parallel restarted SPIDER – communication efficient distributed nonconvex optimization with optimal computation complexity, 2020 (v1 – 2019), 25 pp.
A. Spiridonoff, A. Olshevsky, Y. Paschalidis, “Communication-efficient SGD: from local SGD to one-shot averaging”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 1861, 24313–24326, https://proceedings.neurips.cc/paper_files/paper/2021/hash/ cc06a6150b92e17dd3076a0f0f9d2af4-Abstract.html
S. U. Stich, Local SGD converges fast and communicates little, 2019 (v1 – 2018), 19 pp.
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 1–33
Jianyu Wang, V. Tantia, N. Ballas, M. Rabbat, SlowMo: Improving communication-efficient distributed SGD with slow momentum, 2020 (v1 – 2019), 27 pp.
B. Woodworth, K. K. Patel, S. Stich, Zhen Dai, B. Bullins, B. Mcmahan, O. Shamir, N. Srebro, “Is local SGD better than minibatch SGD?”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 10334–10343, https://proceedings.mlr.press/v119/woodworth20a
Honglin Yuan, Tengyu Ma, “Federated accelerated stochastic gradient descent”, NIPS'20: Proceedings of the 34th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 33, Curran Associates, Inc., Red Hook, NY, 2020, 448, 5332–5344
Jian Zhang, C. De Sa, I. Mitliagkas, C. Re, Parallel SGD: when does averaging help?, 2016, 13 pp.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions

Full Text

Abstract

Keywords

About the authors

Andrey Evgen'evich Sadchikov

Savelii Andreevich Chezhegov

Aleksandr Nikolaevich Beznosikov

Alexander Vladimirovich Gasnikov

References

Supplementary files