REAL-TIME ANALYTICS: BENEFITS, LIMITATIONS, AND TRADEOFFS

封面

如何引用文章

全文:

开放存取 开放存取
受限制的访问 ##reader.subscriptionAccessGranted##
受限制的访问 订阅存取

详细

Real-time analytics is a relatively new branch of analytics. A common definition of real-time analytics is that it consists in analyzing data as quickly as possible over the most recent data possible. This defines the essence of the fundamental needs of users, but in no way is a specific requirement for the corresponding software systems due to the vagueness of the definition. As a result, different manufacturers of analytical datamanagement systems and researchers classify real-time analytics systems as extremely different systems, which differ in architecture, functionality, and even timing. The purpose of this article is to analyze the different approaches to providing real-time analytics, their advantages and disadvantages, and the tradeoffs that both system designers and their users inevitably have to make.

作者简介

S. KUZNETSOV

Ivannikov Institute for System Programming of the Russian Academy of Sciences; Moscow State University; Moscow Institute of Physics and Technology (State University); National Research University, Higher School of Economics

Email: kuzloc@ispras.ru
Moscow, Russia; Moscow, Russia; Dolgoprudny, Moscow oblast, Russia; Moscow, Russia

P. VELIKHOV

TigerGraph

Email: pavel.velikhov@tigergraph.com
Redwood City, United States

Q. Fu

Huawei Technologies Co., Ltd.

编辑信件的主要联系方式.
Email: fqiang.fuqiang@huawei.com
Moscow, Russia

参考

  1. William H. Inmon. Building the Data Warehouse. John Wiley & Sons, 1992. 312 p.
  2. Ralph Kimball. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. Wiley, 1996. 374 p.
  3. Information Technology. Gartner Glossary. Real-time Analytics. Available at https://www.gartner.com/en/information-technology/glossary/real-time-analytics, accessed: 06/16/2021
  4. Arun Kejariwal, Sanjeev Kulkarni, Karthik Ramasamy. Real Time Analytics: Algorithms and Systems. Extended version of VLDB’15 tutorial proposal. arXiv:1708.02621, 2017. 7 p.
  5. Zoran Milosevic, Weisi Chen, Andrew Berry, Fethi A. Rabhi. Real-Time Analytics. In Big Data: Principles and Paradigms, Morgan Kaufmann, 2016. P. 39–61.
  6. Fatma Özcan, Yuanyuan Tian, Pınar Tözün. Hybrid Transactional/Analytical Processing: A Survey. Proceedings of the 2017 ACM International Conference on Management of Data, 2017. P. 1771–1775.
  7. Кузнецов С.Д., Велихов П.Е., Фу Ц. Аналитика в реальном времени, гибридная транзакционная/аналитическая обработка, управление данными в основной памяти и энергонезависимая память. Труды ИСП РАН. 2021. Т. 33. Вып. 3. С. 171–198. D. Kuznetsov, Pavel E. Velikhov, and Qiang Fu. Real-time analytics, hybrid transactional/analytical processing, in-memory data management, and non-volatile memory. Proceedings of the Ivannikov ISPRAS Open Conference, 2020. P. 78–90.
  8. Monika Rauch Henzinger, Prabhakar Raghavan, and Sridar Rajagopalan. Computing on data streams. SRC Technical Note 1998-11, May 26, 1998. 16 p.
  9. The “Stream Team” Page. Available at http://infolab.stanford.edu/sdt/, accessed 07.07.2021.
  10. Special Issue on Data Stream Processing. IEEE Bulletin of the Technical Committee on Data Engineering. 2003. V. 26. № 1.
  11. Stan Zdonik, Michael Stonebraker et al. The Aurora and Medusa Projects. IEEE Bulletin of the Technical Committee on Data Engineering. 2003 V. 26. № 1. P. 3–10.
  12. Sailesh Krishnamurthy, Sirish Chandrasekaran et al. TelegraphCQ: An Architectural Status Report. IEEE Bulletin of the Technical Committee on Data Engineering. 2003. V. 26. № 1. P. 11–18.
  13. Arasu A., Babcock B. et al. STREAM: The Stanford Stream Data Manager. IEEE Bulletin of the Technical Committee on Data Engineering. 2003. V. 26. № 1. P. 19–26.
  14. Douglas Terry, David Goldberg, David Nichols, Brian Oki. Continuous queries over append-only databases. ACM SIGMOD Record. 1992. V. 21. Iss. 2. P. 321–330.
  15. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. ACM SIGMOD Record. 2000. V. 29. Iss. 2. P. 379–390.
  16. Sirish Chandrasekaran, Owen Cooper et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. Proceedings of the 2003 CIDR Conference, 2003. 12 p.
  17. Johannes Gehrke, Flip Korn, Divesh Srivastava. On Computing Correlated Aggregates Over Continual Data Streams. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001. P. 13–24.
  18. Arvind Arasu, Brian Babcock et al. STREAM: The Stanford Data Stream Management System. Technical Report. Stanford InfoLab, 2004. 21 p. Later appeared as a chapter in Data Stream Management. Processing High-Speed Data Streams. Springer, 2016. P. 317–336.
  19. Arvind Arasu, Shivnath Babu, Jennifer Widom. CQL: A Language for Continuous Queries over Streams and Relations. Lecture Notes in Computer Science. 2003. V. 2921. P. 1–19.
  20. Daniel J. Abadi. Don Carney, et al. Aurora: a new model and architecture for data stream management. The International Journal on Very Large Data Bases. 2003. V. 12. Iss. 2. P. 120–139.
  21. Uğur Çetintemel, Daniel Abadi. The Aurora and Borealis Stream Processing Engines. A chapter in Data Stream Management. Processing High-Speed Data Streams. Springer, 2016. P. 337–359.
  22. Daniel J. Abadi, Yanif Ahmad et al. The Design of the Borealis Stream Processing Engine. Proceedings of the 2005 CIDR Conference, 2005. P. 277–289.
  23. TIBCO StreamBase. Available at https://www.tibco.com/sites/tibco/files/resources/DS-TIBCO-StreamBase-final.pdf, accessed 07/14/2021.
  24. StreamSQL Guide. Available at https://docs.tibco.com/pub/sb-lv/2.1.8/doc/html/streamsql/index.html, accessed 07/14/2021.
  25. Namit Jain, Shailendra Mishra et al. Towards a Streaming SQL Standard. Proceedings of the VLDB Endowment. 2008. V. 1. Iss. 2. P. 1379–1390.
  26. Michael Stonebraker, Uǧur Çetintemel, Stan Zdonik. The 8 requirements of real-time stream processing. ACM SIGMOD Record. 2005. V. 34. Iss. 4. P. 42–47.
  27. Sandra Geisler. Data Stream Management Systems. In Data Exchange, Integration, and Streams. Dagstuhl Follow-Ups. 2013. V. 5. P. 275–304.
  28. Special Issue on Next-Generation Stream Processing. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4.
  29. Martin Kleppmann, Jay Kreps. Kafka, Samza and the Unix Philosophy of Distributed Data. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4. P. 4–14.
  30. Paris Carbone, Stephan Ewen. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4. P. 28–38.
  31. Scott Schneider, Buğra Gedik, Martin Hirzel. Language Runtime and Optimizations in IBM Streams. IEEE Bulletin of the Technical Committee on Data Engineering. 2013. V. 38. № 4. P. 61–72.
  32. Andrew Witkowski, Srikanth Bellamkonda et al. Continuous Queries in Oracle. Proceedings of the 33rd International Conference on Very Large Data Bases, 2007. P. 1173–1184.
  33. Oracle Fusion Middleware Understanding Stream Analytics. Available at https://docs.oracle.com/en/middleware/fusion-middleware/osa/18.1/understanding-stream-analytics/understanding-oracle-stream-analytics.pdf, accessed 07/16/2021.
  34. Thomas Vengal. What is Oracle Stream Analytics? Available at https://blogs.oracle.com/dataintegration/what-is-oracle-stream-analytics, accessed 07/16/2021.
  35. IBM Streams. Available at https://www.ibm.com/cloud/streaming-analytics, accessed 07/16/2021.
  36. Alain Biem, Eric Bouillet et al. IBM InfoSphere Streams for Scalable, Real-Time, Intelligent Transportation Services. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. P. 1093–1104.
  37. Hirzel M., Andrade H. et al. IBM Streams Processing Language: Analyzing BigData in motion. IBM Journal of Research and Development. 2013. V. 57. № 3/4. 11 p.
  38. Mohamed Ali, Badrish Chandramouli et al. Spatio-Temporal Stream Processing in Microsoft StreamInsight. IEEE Bulletin of the Technical Committee on Data Engineering. 2010. V. 33. № 2. P. 69–74.
  39. Mohamed Ali, Badrish Chandramouli et al. The Extensibility Framework in Microsoft StreamInsight. Proceedings of the IEEE 27th International Conference on Data Engineering, 2011. P. 1242–1253.
  40. Rob Pierry. StreamInsight – Master Large Data Streams with Microsoft StreamInsight. MSDN Magazine. 2011. V. 26. № 06.
  41. What is Microsoft StreamInsight? Available at https://azurecloudai.blog/2013/01/30/what-is-microsoft-streaminsight/, accessed 07/16/2021.
  42. Welcome to Azure Stream Analytics. Available at https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction, accessed 07/16/2021.
  43. Data Engineering Streaming. Available at https://www.informatica.com/products/big-data/big-data-streaming.html, accessed 07/16/2021.
  44. SAS’s Event Stream Processing. Available at https://www.sas.com/en_us/software/event-stream-processing.html, accessed 07/16/2021.
  45. Apache Kafka. Available at https://kafka.apache.org/, accessed 07/16/2021.
  46. Apache Samza. Available at http://samza.apache.org/, accessed 07/16/2021.
  47. Apache Kafka Architecture – Kafka Component Overview. Available at https://www.instaclustr.com/apache-kafka-architecture/#, accessed 07/16/2021.
  48. Apache ZooKeeper. Available at https://zookeeper.apache.org/, accessed 07/16/2021.
  49. Apache Hadoop YARN. Available at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed 07/16/2021.
  50. Rahul Anand. What is Apache Samza? Available at https://www.quora.com/What-is-Apache-Samza-1, accessed 07/16/2021.
  51. What is Apache Flink? – Architecture. Available at https://flink.apache.org/flink-architecture.html, accessed 07/16/2021.
  52. Spark Streaming Programming Guide. Available at https://spark.apache.org/docs/latest/streaming-programming-guide.html, accessed 07/16/2021.
  53. Spark API Documentation. Available at https://spark.apache.org/docs/2.4.0/api.html, accessed 07/16/2021.
  54. BigQuery. Available at https://cloud.google.com/bigquery, accessed 07/17/2021.
  55. A Deep Dive into Google BigQuery Architecture. Available at https://panoply.io/data-warehouse-guide/bigquery-architecture/, accessed 07/17/2021.
  56. Sergey Melnik, Andrey Gubarev et al. Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment. 2010. V. 3. № 1. P. 330–339.
  57. Foto N. Afrati, Dan Delorey et al. Storing and Querying Tree Structured Records in Dremel. Proceedings of the VLDB Endowment. 2014. V. 7. № 11. P. 1131–1142.
  58. Mosha Pasumansky. Inside Capacitor, BigQuery’s next-generation columnar storage format. Available at https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format, accessed 07/17/2021.
  59. Dean Hildebrand, Denis Serenyi. Colossus under the hood: a peek into Google’s scalable storage system. Available at https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system, accessed 07/17/2021.
  60. Abhishek Verma, Luis Pedrosa et al. Large-scale cluster management at Google with Borg. Proceedings of the Tenth European Conference on Computer Systems, 2015. P. 1–17.
  61. Arjun Singh, Joon Ong, et al. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network. ACM SIGCOMM Computer Communication Review, 2015. P. 183–197.
  62. Amazon Redshift and PostgreSQL. Available at https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html, accessed 07/17/2021.
  63. Data warehouse system architecture. Available at https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html, accessed 07/17/2021.
  64. Anurag Gupta, Deepak Agarwal et al. Amazon Redshift and the Case for Simpler Data Warehouses. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015. P. 1917–1923.
  65. The Microsoft Modern Data Warehouse. White paper, 2016. Available at http://download.microsoft.com/download/C/2/D/ C2D2D5FA-768A-49AD-8957-1A434C6C8126/Microsoft_Modern_Data_Warehouse_white_paper.pdf, accessed 07/18/2021.
  66. Azure Synapse SQL architecture. Available at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture, accessed 07/18/2021.
  67. What is Azure Synapse Analytics? Available at https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is, accessed 07/18/2021.
  68. Use transactions in a SQL pool in Azure Synapse. Available at https://github.com/MicrosoftDocs/azure-docs/blob/ master/articles/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-transactions.md, accessed 07/18/2021.
  69. Ashish Motivala, Jiaqi Yan. The Snowflake Elastic Data Warehouse, SIGMOD 2016 and beyond. Available at https://15721.courses.cs.cmu.edu/spring2018/slides/25-snowflake.pdf, accessed 07/18/2021.
  70. Benoit Dageville, Thierry Cruanes et al. The Snowflake Elastic Data Warehouse. Proceedings of the 2016 International Conference on Management of Data, 2016. P. 215–226.
  71. Anastassia Ailamaki, David J. DeWitt et al. Weaving Relations for Cache Performance. Proceedings of the 27th International Conference on Very Large Data Bases, September 2001. P. 169–180.
  72. David Karger, Eric Lehman et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, 1997. P. 654–663.
  73. Goetz Graefe. The Cascades Framework for Query Optimization. IEEE Bulletin of the Technical Committee on Data Engineering. V. 18. № 3, 1995. P. 19–29.
  74. Franz Faerber, Alfons Kemper et al. Main Memory Database Systems. Foundations and Trends in Databases. 2016. V. 8. № 1–2. P. 1–130.
  75. Frederik Transier, Peter Sanders. Engineering basic algorithms of an in-memory text search engine. ACM Transactions on Information Systems, 2010, Article No. 2.
  76. J. Andrew Ross. SAP NetWeaver BI Accelerator. SAP PRESS, 2008. 260 p.
  77. Sang K. Cha and Changbin Song. P*TIME: Highly Scalable OLTP DBMS for Managing Update-Intensive Stream Workload. Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004. P. 1033–1044.
  78. André Bögelsack, Stephan Gradl, Manuel Mayer, Helmut Krcmar. SAP MaxDB Administration. SAP PRESS, 2009. 326 p.
  79. Franz Faerber, Norman May et al. The SAP HANA Database – An Architecture Overview. IEEE Bulletin of the Technical Committee on Data Engineering. 2012. V. 35. № 1. P. 28–33.
  80. Per-Åke Larson, Cipri Clinciu et al. SQL Server Column Store Indexes. Proceedings of the ACM SIGMOD International Conference on Management of data, 2011. P. 1177–1184.
  81. Per-Åke Larson, Mike Zwilling, Kevin Farlee. The Hekaton Memory-Optimized OLTP Engine. Bulletin of the Technical Committee on Data Engineering. 2013. V. 36. № 2. P. 34–40.
  82. Per-Åke Larson, Adrian Birka et al. Real-Time Analytical Processing with SQL Server. Proceedings of the VLDB Endowment. 2015. V. 8. № 12. P. 1740–1751.
  83. Ahmed Eldawy, Justin Levandoski, Per-Åke Larson. Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database. Proceedings of the VLDB Endowment. 2014. V. 7. № 11. P. 931–942.
  84. Tirthankar Lahiri, Marie-Anne Neimat, Steve Folkman. Oracle TimesTen: An In-Memory Database for Enterprise Applications. Bulletin of the Technical Committee on Data Engineering. 2013. V. 36. № 2. P. 6–13.
  85. Sherry Listgarten and Marie-Anne Neimat. Modelling Costs for a MM-DBMS. In Proceedings of the International Workshop on Real-Time Databases, Issues and Applications (RTDB), 1996. P. 72–78.
  86. Tirthankar Lahiri, Shasank Chavan et al. Oracle Database In-Memory: A dual format in-memory database. 2015 IEEE 31st International Conference on Data Engineering, Seoul, 2015. P. 1253–1258.
  87. Niloy Mukherjee, Shasank Chavan et al. Distributed Architecture of Oracle Database In-memory. Proceedings of the VLDB Endowment. 2015. V. 8. № 12. P. 1630–1641.
  88. Shasank Chavan, Gurmeet Goindi. Oracle Database In-Memory on Exadata: A Potent Combination. Oracle OpenWorld 2018. Available at https://www.oracle.com/technetwork/database/exadata/pro4016-exadataandinmemory-5187037.pdf, accessed 07/18/2021.
  89. Ronald Barber, Peter Bendel et al. Business Analytics in (a) Blink. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2012. V. 35. № 1. P. 9–14.
  90. IBM Informix Warehouse Accelerator. Technical white paper. URL: https://www.iiug.org/library/ids_12/IWA%20 White%20Paper-2013-03-21.pdf, accessed 07/18/2021.
  91. Vijayshankar Raman, Gopi Attaluri et al. DB2 with BLU Acceleration: So Much More than Just a Column Store. Proceedings of the VLDB Endowment. 2013. V. 6. № 11. P. 1080–1091.
  92. Whei-Jen Chen, Brigitte Bläser et al. Architecting and Deploying DB2 with BLU Acceleration. IBM Redbooks, 2014. 420 p.
  93. Faster analytics with Hyper. Available at https://www.tableau.com/products/new-features/hyper, accessed 07/18/2021.
  94. Alfons Kemper and Thomas Neumann. HyPer – Hybrid OLTP&OLAP High Performance Database System. Technical Report, TUM-I1010, Munich Technical University, 2010. 29 p.
  95. Alfons Kemper, Thomas Neumann et al. Transaction Processing in the Hybrid OLTP&OLAP Main-Memory Database System HyPer. Bulletin of the Technical Committee on Data Engineering. 2013. V. 36. № 2. P. 41–47.
  96. Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proceedings of the VLDB Endowment. 2012. V. 5. № 10. P. 1064–1075.
  97. Thomas Neumann, Tobias Mühlbauer, Alfons Kemper. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. Proceedings of the ACM SIGMOD International Conference on Management of data, 2015. P. 677–689.
  98. Mihnea Andrei, Christian Lemke, et al. SAP HANA Adoption of Non-Volatile Memory. Proceedings of the VLDB Endowment. 2017. V. 10. № 12. P. 1754–1765.
  99. Bob Dorr. How It Works (It Just Runs Faster): Non-Volatile Memory SQL Server Tail of Log Caching on NVDIMM. Available at https://docs.microsoft.com/ru-ru/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm, accessed 07/18/2021.
  100. Oracle Database 20c. Database Administrator’s Guide. Using Persistent Memory Database. Available at https://docs.oracle.com/en/database/oracle/oracle-database/20/admin/index.html, accessed 07/18/2021.
  101. Joy Arulraj, Andrew Pavlo. Non-Volatile Memory Database Management Systems. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2019. 192 p.
  102. Ismail Oukid. Architectural Principles for Database Systems on Storage-Class Memory. Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn, 2019. P. 477–486.

补充文件

附件文件
动作
1. JATS XML
2.

下载 (498KB)
3.

下载 (88KB)
4.

下载 (210KB)
5.

下载 (47KB)
6.

下载 (67KB)
7.

下载 (582KB)
8.

下载 (63KB)
9.

下载 (367KB)
10.

下载 (230KB)
11.

下载 (100KB)
12.

下载 (326KB)
13.

下载 (346KB)
14.

下载 (93KB)
15.

下载 (198KB)
16.

下载 (127KB)
17.

下载 (282KB)
18.

下载 (423KB)
19.

下载 (63KB)
20.

下载 (63KB)
21.

下载 (232KB)
22.

下载 (67KB)
23.

下载 (325KB)
24.

下载 (115KB)

版权所有 © С.Д. Кузнецов, П.Е. Велихов, Ц. Фу, 2023

##common.cookie##