Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions
- Авторы: Садчиков А.Е.1, Чежегов С.А.2,1, Безносиков А.Н.2,3,4, Гасников А.В.4,2,1
-
Учреждения:
- Московский физико-технический институт (национальный исследовательский университет)
- Институт системного программирования им. В.П. Иванникова РАН
- Лаборатория искусственного интеллекта
- Университет Иннополис
- Выпуск: Том 79, № 6 (2024)
- Страницы: 83-116
- Раздел: Статьи
- URL: https://journals.rcsi.science/0042-1316/article/view/281942
- DOI: https://doi.org/10.4213/rm10207
- ID: 281942
Цитировать
Аннотация
Distributed optimization plays an important role in modern large-scale machine learning and data processing systems by optimizing the utilization of computational resources. One of the classical and popular approaches is Local Stochastic Gradient Descent (Local SGD), characterized by multiple local updates before averaging, which is particularly useful in distributed environments to reduce communication bottlenecks and improve scalability. A typical feature of this method is the dependence on the frequency of communications. But in the case of a quadratic target function with homogeneous data distribution over all devices, the influence of the frequency of communications vanishes. As a natural consequence, subsequent studies include the assumption of a Lipschitz Hessian, as this indicates the similarity of the optimized function to a quadratic one to a certain extent. However, in order to extend the completeness of Local SGD theory and unlock its potential, in this paper we abandon the Lipschitz Hessian assumption by introducing a new concept of approximate quadraticity. This assumption gives a new perspective on problems that have near quadratic properties. In addition, existing theoretical analyses of Local SGD often assume a bounded variance. We, in turn, consider the unbounded noise condition, which allows us to broaden the class of problems under study.Bibliography: 36 titles.
Ключевые слова
Об авторах
Андрей Евгеньевич Садчиков
Московский физико-технический институт (национальный исследовательский университет)
Автор, ответственный за переписку.
Email: sadchikov.ae@phystech.edu
Савелий Андреевич Чежегов
Институт системного программирования им. В.П. Иванникова РАН; Московский физико-технический институт (национальный исследовательский университет)
Email: chezhegov.sa@phystech.edu
Александр Николаевич Безносиков
Институт системного программирования им. В.П. Иванникова РАН; Лаборатория искусственного интеллекта; Университет Иннополис
Email: anbeznosikov@gmail.com
Александр Владимирович Гасников
Университет Иннополис; Институт системного программирования им. В.П. Иванникова РАН; Московский физико-технический институт (национальный исследовательский университет)
Email: gasnikov@yandex.ru
ORCID iD: 0000-0002-7386-039X
Scopus Author ID: 15762551000
ResearcherId: L-6371-2013
доктор физико-математических наук, доцент
Список литературы
- M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, ACM, New York, 2016, 308–318
- D. Basu, D. Data, C. Karakus, S. N. Diggavi, “Qsparse-local-SGD: distributed SGD with quantization, sparsification and local computations”, NIPS'19: Proceedings of the 33rd international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 32, Curran Associates, Inc., Red Hook, NY, 2019, 1316, 14695–14706, https://proceedings.neurips.cc/paper_files/paper/2019/hash/ d202ed5bcfa858c15a9f383c3e386ab2-Abstract.html
- A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133, https://proceedings.neurips.cc/paper_files/paper/2022/hash/ f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html
- A. Beznosikov, V. Samokhin, A. Gasnikov, Distributed saddle-point problems: lower bounds, near-optimal and robust algorithms, 2022 (v1 – 2020), 52 pp.
- A. Beznosikov, G. Scutari, A. Rogozin, A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184
- A. Beznosikov, M. Takac, A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677, https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html
- S. Chezhegov, S. Skorik, N. Khachaturov, D. Shalagin, A. Avetisyan, A. Beznosikov, M. Takač, Y. Kholodov, A. Beznosikov, Local methods with adaptivity via scaling, 2024, 41 pp.
- S. Ghadimi, Guanghui Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. II. Shrinking procedures and optimal algorithms”, SIAM J. Optim., 23:4 (2013), 2061–2089
- M. R. Glasgow, Honglin Yuan, Tengyu Ma, “Sharp bounds for federated averaging (local SGD) and continuous perspective”, Proceedings of the 25th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 151, 2022, 9050–9090, https://proceedings.mlr.press/v151/glasgow22a
- Я. Гудфеллоу, И. Бенджио, А. Курвиль, Глубокое обучение, 2-е изд., ДМК Пресс, М., 2018, 652 с.
- E. Gorbunov, F. Hanzely, P. Richtarik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564
- H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227
- P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et. al., “Advances and open problems in federated learning”, Found. Trends Mach. Learn., 14:1-2 (2021), 1–210
- S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5132–5143
- A. Khaled, K. Mishchenko, P. Richtarik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529, https://proceedings.mlr.press/v108/bayoumi20a.html
- A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393, https://proceedings.mlr.press/v119/koloskova20a.html
- J. Konečny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, D. Bacon, Federated learning: strategies for improving communication efficiency, 2017 (v1 – 2016), 10 pp.
- D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507, https://proceedings.neurips.cc/paper_files/paper/2022/hash/ d88f6f81e1aaf606776ffdd06fdf24ef-Abstract-Conference.html
- Tian Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, “Federated optimization in heterogeneous networks”, Proc. Mach. Learn. Syst., 2 (2020), 429–450
- Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng, Variance reduced local SGD with lower communication complexity, 2019, 25 pp.
- L. O. Mangasarian, “Parallel gradient distribution in unconstrained optimization”, SIAM J. Control Optim., 33:6 (1995), 1916–1925
- B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282, https://proceedings.mlr.press/v54/mcmahan17a
- K. Mishchenko, G. Malinovsky, S. Stich, P. Richtarik, “Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally!”, Proceedings of the 39th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 162, 2022, 15750–15769, https://proceedings.mlr.press/v162/mishchenko22b
- A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 2021–2031, https://proceedings.mlr.press/v108/reisizadeh20a.html
- H. Robbins, S. Monro, “A stochastic approximation method”, Ann. Math. Statistics, 22 (1951), 400–407
- M. Schmidt, N. Le Roux, Fast convergence of stochastic gradient descent under a strong growth condition, 2013, 5 pp.
- Ш. Шалев-Шварц, Ш. Бен-Давид, Идеи машинного обучения: от теории к алгоритмам, ДМК Пресс, М., 2019, 436 с.
- O. Shamir, N. Srebro, Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008, https://proceedings.mlr.press/v32/shamir14.html
- P. Sharma, S. Kafle, P. Khanduri, S. Bulusu, K. Rajawat, P. K. Varshney, Parallel restarted SPIDER – communication efficient distributed nonconvex optimization with optimal computation complexity, 2020 (v1 – 2019), 25 pp.
- A. Spiridonoff, A. Olshevsky, Y. Paschalidis, “Communication-efficient SGD: from local SGD to one-shot averaging”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 1861, 24313–24326, https://proceedings.neurips.cc/paper_files/paper/2021/hash/ cc06a6150b92e17dd3076a0f0f9d2af4-Abstract.html
- S. U. Stich, Local SGD converges fast and communicates little, 2019 (v1 – 2018), 19 pp.
- J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 1–33
- Jianyu Wang, V. Tantia, N. Ballas, M. Rabbat, SlowMo: Improving communication-efficient distributed SGD with slow momentum, 2020 (v1 – 2019), 27 pp.
- B. Woodworth, K. K. Patel, S. Stich, Zhen Dai, B. Bullins, B. Mcmahan, O. Shamir, N. Srebro, “Is local SGD better than minibatch SGD?”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 10334–10343, https://proceedings.mlr.press/v119/woodworth20a
- Honglin Yuan, Tengyu Ma, “Federated accelerated stochastic gradient descent”, NIPS'20: Proceedings of the 34th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 33, Curran Associates, Inc., Red Hook, NY, 2020, 448, 5332–5344
- Jian Zhang, C. De Sa, I. Mitliagkas, C. Re, Parallel SGD: when does averaging help?, 2016, 13 pp.
Дополнительные файлы
