REAL-TIME ANALYTICS: BENEFITS, LIMITATIONS, AND TRADEOFFS

S. D. KUZNETSOV; Кузнецов С. Д.; P. E. VELIKHOV; Велихов П. Е.; Q. Fu; Фу Ц.

doi:10.31857/S0132347423010053

АНАЛИТИКА В РЕАЛЬНОМ ВРЕМЕНИ: ПРЕИМУЩЕСТВА, ОГРАНИЧЕНИЯ И КОМПРОМИССЫ

Авторы: Кузнецов С.Д.¹^,2^,3^,4, Велихов П.Е.⁵, Фу Ц.⁶
Учреждения:
1. Институт системного программирования им. В.П. Иванникова РАН
2. Московский государственный университет имени М.В. Ломоносова
3. Московский физико-технический институт
4. НИУ “Высшая школа экономики”,
5. TigerGraph
6. Техкомпания Хуавэй
Выпуск: № 1 (2023)
Страницы: 3-31
Раздел: АНАЛИЗ ДАННЫХ
URL: https://journals.rcsi.science/0132-3474/article/view/137608
DOI: https://doi.org/10.31857/S0132347423010053
EDN: https://elibrary.ru/GRYYWT
ID: 137608

Цитировать

Полный текст

Аннотация
Об авторах
Список литературы
Дополнительные файлы
Статистика

Аннотация

Аналитика в реальном времени – относительно новая ветвь аналитики. Обычное “определение” аналитики в реальном времени заключается в том, чтобы как можно быстрее анализировать данные по самым последним данным. Это определяет суть фундаментальных потребностей пользователей, но никоим образом не является конкретным требованием к соответствующим программным комплексам в силу нечеткости “определения”. В результате разные производители систем управления аналитическими данными и исследователи относят к системам аналитики в реальном времени совершенно разные системы, отличающиеся архитектурой, функциональностью и даже временными параметрами. Цель этой статьи – проанализировать различные подходы к предоставлению аналитики в реальном времени, их преимущества и недостатки, а также компромиссы, на которые неизбежно приходится идти как разработчикам систем, так и их пользователям.

Список литературы

William H. Inmon. Building the Data Warehouse. John Wiley & Sons, 1992. 312 p.
Ralph Kimball. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. Wiley, 1996. 374 p.
Information Technology. Gartner Glossary. Real-time Analytics. Available at https://www.gartner.com/en/information-technology/glossary/real-time-analytics, accessed: 06/16/2021
Arun Kejariwal, Sanjeev Kulkarni, Karthik Ramasamy. Real Time Analytics: Algorithms and Systems. Extended version of VLDB’15 tutorial proposal. arXiv:1708.02621, 2017. 7 p.
Zoran Milosevic, Weisi Chen, Andrew Berry, Fethi A. Rabhi. Real-Time Analytics. In Big Data: Principles and Paradigms, Morgan Kaufmann, 2016. P. 39–61.
Fatma Özcan, Yuanyuan Tian, Pınar Tözün. Hybrid Transactional/Analytical Processing: A Survey. Proceedings of the 2017 ACM International Conference on Management of Data, 2017. P. 1771–1775.
Кузнецов С.Д., Велихов П.Е., Фу Ц. Аналитика в реальном времени, гибридная транзакционная/аналитическая обработка, управление данными в основной памяти и энергонезависимая память. Труды ИСП РАН. 2021. Т. 33. Вып. 3. С. 171–198. D. Kuznetsov, Pavel E. Velikhov, and Qiang Fu. Real-time analytics, hybrid transactional/analytical processing, in-memory data management, and non-volatile memory. Proceedings of the Ivannikov ISPRAS Open Conference, 2020. P. 78–90.
Monika Rauch Henzinger, Prabhakar Raghavan, and Sridar Rajagopalan. Computing on data streams. SRC Technical Note 1998-11, May 26, 1998. 16 p.
The “Stream Team” Page. Available at http://infolab.stanford.edu/sdt/, accessed 07.07.2021.
Special Issue on Data Stream Processing. IEEE Bulletin of the Technical Committee on Data Engineering. 2003. V. 26. № 1.
Stan Zdonik, Michael Stonebraker et al. The Aurora and Medusa Projects. IEEE Bulletin of the Technical Committee on Data Engineering. 2003 V. 26. № 1. P. 3–10.
Sailesh Krishnamurthy, Sirish Chandrasekaran et al. TelegraphCQ: An Architectural Status Report. IEEE Bulletin of the Technical Committee on Data Engineering. 2003. V. 26. № 1. P. 11–18.
Arasu A., Babcock B. et al. STREAM: The Stanford Stream Data Manager. IEEE Bulletin of the Technical Committee on Data Engineering. 2003. V. 26. № 1. P. 19–26.
Douglas Terry, David Goldberg, David Nichols, Brian Oki. Continuous queries over append-only databases. ACM SIGMOD Record. 1992. V. 21. Iss. 2. P. 321–330.
Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. ACM SIGMOD Record. 2000. V. 29. Iss. 2. P. 379–390.
Sirish Chandrasekaran, Owen Cooper et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. Proceedings of the 2003 CIDR Conference, 2003. 12 p.
Johannes Gehrke, Flip Korn, Divesh Srivastava. On Computing Correlated Aggregates Over Continual Data Streams. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001. P. 13–24.
Arvind Arasu, Brian Babcock et al. STREAM: The Stanford Data Stream Management System. Technical Report. Stanford InfoLab, 2004. 21 p. Later appeared as a chapter in Data Stream Management. Processing High-Speed Data Streams. Springer, 2016. P. 317–336.
Arvind Arasu, Shivnath Babu, Jennifer Widom. CQL: A Language for Continuous Queries over Streams and Relations. Lecture Notes in Computer Science. 2003. V. 2921. P. 1–19.
Daniel J. Abadi. Don Carney, et al. Aurora: a new model and architecture for data stream management. The International Journal on Very Large Data Bases. 2003. V. 12. Iss. 2. P. 120–139.
Uğur Çetintemel, Daniel Abadi. The Aurora and Borealis Stream Processing Engines. A chapter in Data Stream Management. Processing High-Speed Data Streams. Springer, 2016. P. 337–359.
Daniel J. Abadi, Yanif Ahmad et al. The Design of the Borealis Stream Processing Engine. Proceedings of the 2005 CIDR Conference, 2005. P. 277–289.
TIBCO StreamBase. Available at https://www.tibco.com/sites/tibco/files/resources/DS-TIBCO-StreamBase-final.pdf, accessed 07/14/2021.
StreamSQL Guide. Available at https://docs.tibco.com/pub/sb-lv/2.1.8/doc/html/streamsql/index.html, accessed 07/14/2021.
Namit Jain, Shailendra Mishra et al. Towards a Streaming SQL Standard. Proceedings of the VLDB Endowment. 2008. V. 1. Iss. 2. P. 1379–1390.
Michael Stonebraker, Uǧur Çetintemel, Stan Zdonik. The 8 requirements of real-time stream processing. ACM SIGMOD Record. 2005. V. 34. Iss. 4. P. 42–47.
Sandra Geisler. Data Stream Management Systems. In Data Exchange, Integration, and Streams. Dagstuhl Follow-Ups. 2013. V. 5. P. 275–304.
Special Issue on Next-Generation Stream Processing. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4.
Martin Kleppmann, Jay Kreps. Kafka, Samza and the Unix Philosophy of Distributed Data. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4. P. 4–14.
Paris Carbone, Stephan Ewen. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4. P. 28–38.
Scott Schneider, Buğra Gedik, Martin Hirzel. Language Runtime and Optimizations in IBM Streams. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4. P. 61–72.
Andrew Witkowski, Srikanth Bellamkonda et al. Continuous Queries in Oracle. Proceedings of the 33rd International Conference on Very Large Data Bases, 2007. P. 1173–1184.
Oracle Fusion Middleware Understanding Stream Analytics. Available at https://docs.oracle.com/en/middleware/fusion-middleware/osa/18.1/understanding-stream-analytics/understanding-oracle-stream-analytics.pdf, accessed 07/16/2021.
Thomas Vengal. What is Oracle Stream Analytics? Available at https://blogs.oracle.com/dataintegration/what-is-oracle-stream-analytics, accessed 07/16/2021.
IBM Streams. Available at https://www.ibm.com/cloud/streaming-analytics, accessed 07/16/2021.
Alain Biem, Eric Bouillet et al. IBM InfoSphere Streams for Scalable, Real-Time, Intelligent Transportation Services. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. P. 1093–1104.
Hirzel M., Andrade H. et al. IBM Streams Processing Language: Analyzing BigData in motion. IBM Journal of Research and Development. 2013. V. 57. № 3/4. 11 p.
Mohamed Ali, Badrish Chandramouli et al. Spatio-Temporal Stream Processing in Microsoft StreamInsight. IEEE Bulletin of the Technical Committee on Data Engineering. 2010. V. 33. № 2. P. 69–74.
Mohamed Ali, Badrish Chandramouli et al. The Extensibility Framework in Microsoft StreamInsight. Proceedings of the IEEE 27th International Conference on Data Engineering, 2011. P. 1242–1253.
Rob Pierry. StreamInsight – Master Large Data Streams with Microsoft StreamInsight. MSDN Magazine. 2011. V. 26. № 06.
What is Microsoft StreamInsight? Available at https://azurecloudai.blog/2013/01/30/what-is-microsoft-streaminsight/, accessed 07/16/2021.
Welcome to Azure Stream Analytics. Available at https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction, accessed 07/16/2021.
Data Engineering Streaming. Available at https://www.informatica.com/products/big-data/big-data-streaming.html, accessed 07/16/2021.
SAS’s Event Stream Processing. Available at https://www.sas.com/en_us/software/event-stream-processing.html, accessed 07/16/2021.
Apache Kafka. Available at https://kafka.apache.org/, accessed 07/16/2021.
Apache Samza. Available at http://samza.apache.org/, accessed 07/16/2021.
Apache Kafka Architecture – Kafka Component Overview. Available at https://www.instaclustr.com/apache-kafka-architecture/#, accessed 07/16/2021.
Apache ZooKeeper. Available at https://zookeeper.apache.org/, accessed 07/16/2021.
Apache Hadoop YARN. Available at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed 07/16/2021.
Rahul Anand. What is Apache Samza? Available at https://www.quora.com/What-is-Apache-Samza-1, accessed 07/16/2021.
What is Apache Flink? – Architecture. Available at https://flink.apache.org/flink-architecture.html, accessed 07/16/2021.
Spark Streaming Programming Guide. Available at https://spark.apache.org/docs/latest/streaming-programming-guide.html, accessed 07/16/2021.
Spark API Documentation. Available at https://spark.apache.org/docs/2.4.0/api.html, accessed 07/16/2021.
BigQuery. Available at https://cloud.google.com/bigquery, accessed 07/17/2021.
A Deep Dive into Google BigQuery Architecture. Available at https://panoply.io/data-warehouse-guide/bigquery-architecture/, accessed 07/17/2021.
Sergey Melnik, Andrey Gubarev et al. Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment. 2010. V. 3. № 1. P. 330–339.
Foto N. Afrati, Dan Delorey et al. Storing and Querying Tree Structured Records in Dremel. Proceedings of the VLDB Endowment. 2014. V. 7. № 11. P. 1131–1142.
Mosha Pasumansky. Inside Capacitor, BigQuery’s next-generation columnar storage format. Available at https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format, accessed 07/17/2021.
Dean Hildebrand, Denis Serenyi. Colossus under the hood: a peek into Google’s scalable storage system. Available at https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system, accessed 07/17/2021.
Abhishek Verma, Luis Pedrosa et al. Large-scale cluster management at Google with Borg. Proceedings of the Tenth European Conference on Computer Systems, 2015. P. 1–17.
Arjun Singh, Joon Ong, et al. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network. ACM SIGCOMM Computer Communication Review, 2015. P. 183–197.
Amazon Redshift and PostgreSQL. Available at https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html, accessed 07/17/2021.
Data warehouse system architecture. Available at https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html, accessed 07/17/2021.
Anurag Gupta, Deepak Agarwal et al. Amazon Redshift and the Case for Simpler Data Warehouses. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015. P. 1917–1923.
The Microsoft Modern Data Warehouse. White paper, 2016. Available at http://download.microsoft.com/download/C/2/D/ C2D2D5FA-768A-49AD-8957-1A434C6C8126/Microsoft_Modern_Data_Warehouse_white_paper.pdf, accessed 07/18/2021.
Azure Synapse SQL architecture. Available at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture, accessed 07/18/2021.
What is Azure Synapse Analytics? Available at https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is, accessed 07/18/2021.
Use transactions in a SQL pool in Azure Synapse. Available at https://github.com/MicrosoftDocs/azure-docs/blob/ master/articles/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-transactions.md, accessed 07/18/2021.
Ashish Motivala, Jiaqi Yan. The Snowflake Elastic Data Warehouse, SIGMOD 2016 and beyond. Available at https://15721.courses.cs.cmu.edu/spring2018/slides/25-snowflake.pdf, accessed 07/18/2021.
Benoit Dageville, Thierry Cruanes et al. The Snowflake Elastic Data Warehouse. Proceedings of the 2016 International Conference on Management of Data, 2016. P. 215–226.
Anastassia Ailamaki, David J. DeWitt et al. Weaving Relations for Cache Performance. Proceedings of the 27th International Conference on Very Large Data Bases, September 2001. P. 169–180.
David Karger, Eric Lehman et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, 1997. P. 654–663.
Goetz Graefe. The Cascades Framework for Query Optimization. IEEE Bulletin of the Technical Committee on Data Engineering. V. 18. № 3, 1995. P. 19–29.
Franz Faerber, Alfons Kemper et al. Main Memory Database Systems. Foundations and Trends in Databases. 2016. V. 8. № 1–2. P. 1–130.
Frederik Transier, Peter Sanders. Engineering basic algorithms of an in-memory text search engine. ACM Transactions on Information Systems, 2010, Article No. 2.
J. Andrew Ross. SAP NetWeaver BI Accelerator. SAP PRESS, 2008. 260 p.
Sang K. Cha and Changbin Song. P*TIME: Highly Scalable OLTP DBMS for Managing Update-Intensive Stream Workload. Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004. P. 1033–1044.
André Bögelsack, Stephan Gradl, Manuel Mayer, Helmut Krcmar. SAP MaxDB Administration. SAP PRESS, 2009. 326 p.
Franz Faerber, Norman May et al. The SAP HANA Database – An Architecture Overview. IEEE Bulletin of the Technical Committee on Data Engineering. 2012. V. 35. № 1. P. 28–33.
Per-Åke Larson, Cipri Clinciu et al. SQL Server Column Store Indexes. Proceedings of the ACM SIGMOD International Conference on Management of data, 2011. P. 1177–1184.
Per-Åke Larson, Mike Zwilling, Kevin Farlee. The Hekaton Memory-Optimized OLTP Engine. Bulletin of the Technical Committee on Data Engineering. 2013. V. 36. № 2. P. 34–40.
Per-Åke Larson, Adrian Birka et al. Real-Time Analytical Processing with SQL Server. Proceedings of the VLDB Endowment. 2015. V. 8. № 12. P. 1740–1751.
Ahmed Eldawy, Justin Levandoski, Per-Åke Larson. Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database. Proceedings of the VLDB Endowment. 2014. V. 7. № 11. P. 931–942.
Tirthankar Lahiri, Marie-Anne Neimat, Steve Folkman. Oracle TimesTen: An In-Memory Database for Enterprise Applications. Bulletin of the Technical Committee on Data Engineering. 2013. V. 36. № 2. P. 6–13.
Sherry Listgarten and Marie-Anne Neimat. Modelling Costs for a MM-DBMS. In Proceedings of the International Workshop on Real-Time Databases, Issues and Applications (RTDB), 1996. P. 72–78.
Tirthankar Lahiri, Shasank Chavan et al. Oracle Database In-Memory: A dual format in-memory database. 2015 IEEE 31st International Conference on Data Engineering, Seoul, 2015. P. 1253–1258.
Niloy Mukherjee, Shasank Chavan et al. Distributed Architecture of Oracle Database In-memory. Proceedings of the VLDB Endowment. 2015. V. 8. № 12. P. 1630–1641.
Shasank Chavan, Gurmeet Goindi. Oracle Database In-Memory on Exadata: A Potent Combination. Oracle OpenWorld 2018. Available at https://www.oracle.com/technetwork/database/exadata/pro4016-exadataandinmemory-5187037.pdf, accessed 07/18/2021.
Ronald Barber, Peter Bendel et al. Business Analytics in (a) Blink. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2012. V. 35. № 1. P. 9–14.
IBM Informix Warehouse Accelerator. Technical white paper. URL: https://www.iiug.org/library/ids_12/IWA%20 White%20Paper-2013-03-21.pdf, accessed 07/18/2021.
Vijayshankar Raman, Gopi Attaluri et al. DB2 with BLU Acceleration: So Much More than Just a Column Store. Proceedings of the VLDB Endowment. 2013. V. 6. № 11. P. 1080–1091.
Whei-Jen Chen, Brigitte Bläser et al. Architecting and Deploying DB2 with BLU Acceleration. IBM Redbooks, 2014. 420 p.
Faster analytics with Hyper. Available at https://www.tableau.com/products/new-features/hyper, accessed 07/18/2021.
Alfons Kemper and Thomas Neumann. HyPer – Hybrid OLTP&OLAP High Performance Database System. Technical Report, TUM-I1010, Munich Technical University, 2010. 29 p.
Alfons Kemper, Thomas Neumann et al. Transaction Processing in the Hybrid OLTP&OLAP Main-Memory Database System HyPer. Bulletin of the Technical Committee on Data Engineering. 2013. V. 36. № 2. P. 41–47.
Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proceedings of the VLDB Endowment. 2012. V. 5. № 10. P. 1064–1075.
Thomas Neumann, Tobias Mühlbauer, Alfons Kemper. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. Proceedings of the ACM SIGMOD International Conference on Management of data, 2015. P. 677–689.
Mihnea Andrei, Christian Lemke, et al. SAP HANA Adoption of Non-Volatile Memory. Proceedings of the VLDB Endowment. 2017. V. 10. № 12. P. 1754–1765.
Bob Dorr. How It Works (It Just Runs Faster): Non-Volatile Memory SQL Server Tail of Log Caching on NVDIMM. Available at https://docs.microsoft.com/ru-ru/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm, accessed 07/18/2021.
Oracle Database 20c. Database Administrator’s Guide. Using Persistent Memory Database. Available at https://docs.oracle.com/en/database/oracle/oracle-database/20/admin/index.html, accessed 07/18/2021.
Joy Arulraj, Andrew Pavlo. Non-Volatile Memory Database Management Systems. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2019. 192 p.
Ismail Oukid. Architectural Principles for Database Systems on Storage-Class Memory. Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn, 2019. P. 477–486.

Дополнительные файлы

Доп. файлы

Действие

1. JATS XML

Скачать

Скачать (498KB)

Имя пользователя
Пароль
Запомнить меня

Забыли пароль?	Регистрация

Имя пользователя
Пароль
Запомнить меня

Забыли пароль?	Регистрация

№ 4 (2025)

№ 4 (2025)

АНАЛИТИКА В РЕАЛЬНОМ ВРЕМЕНИ: ПРЕИМУЩЕСТВА, ОГРАНИЧЕНИЯ И КОМПРОМИССЫ

Полный текст

Аннотация

Об авторах

С. Д. Кузнецов

П. Е. Велихов

Ц. Фу

Список литературы

Дополнительные файлы