基于Vue.js和SpringBoot的新能源汽车充电站管理系统外文翻译-洪萨配资

温州商学院本科毕业设计（论文）外文翻译

毕业设计（论文）题目：
姓名	学号
指导教师	班级	19计算机本*

原文题目：《EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem》

作者：Panagiotis K ,Argyro M ,Athanasios K

原文出处：Panagiotis K ,Argyro M ,Athanasios K , et al.EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem[J].Information,2023,14(2):93-93.

EverAnalyzer：一个利用Hadoop生态系统的自调整大数据管理平台

摘要

大数据是一种影响当今世界的现象，每秒钟都会产生新的数据。如今的企业面临着来自日益多样化的数据以及索引、搜索和分析如此庞大的数据的重大挑战。在这种情况下，存在一些用于处理和分析大数据的框架和库。在这些框架中，Hadoop MapReduce、Mahout、Spark和MLlib似乎是最受欢迎的，尽管尚不清楚它们中的哪一个最适合并在各种数据处理和分析场景中执行。本文提出了EverAnalyzer，这是一个可自我调整的大数据管理平台，旨在通过利用所有这些框架来填补这一空白。该平台能够以流式和批量方式收集数据，利用从用户处理和分析过程中获得的元数据来收集数据。基于这些元数据，平台为用户旨在执行的数据处理/分析活动推荐了最佳框架。为了验证该平台的效率，使用了30个与各种疾病相关的不同数据集进行了大量实验。结果显示，EverAnalyzer在80%的情况下正确地提出了最佳框架，表明该平台在大多数实验中做出了最佳选择。

关键词：大数据；数据管理；数据收集；数据分析；数据处理；Hadoop；MapReduce；火花象夫MLlib

简介

由于物联网（IoT）的发展和社交媒体的广泛使用，全球互联网消费有所增加。由于这一增长，积累了大量数据，在大多数情况下极难处理。根据Statista[1]的数据，2020年全球消耗的数据总量已增至64.2泽塔字节，2021年增至79泽塔字节。预计到2025年，数据总量将增加180泽塔字节以上。与此同时，Forbes[2]估计，到2025年，分析将需要超过150 Zettabytes的实时数据。据《福布斯》报道，与处理非结构化数据的公司相比，处理结构化数据的企业有不同的要求。《福布斯》发现，超过95%的组织在管理非结构化数据集方面需要帮助。

所有这些信息都被称为大数据，它被定义为从多种来源和格式收集的大量数据[3]。许多企业收集和分析来自各种来源的数据，以便就其客户、市场需求和趋势做出更好的商业决策。出于这些目的，已经创建了各种大数据处理和分析技术，以有效地从这些大型数据集中提取信息，从而成功地评估底层数据[4]。在这些工具中，在Apache Hadoop生态系统上创建的工具是使用最广泛的[5]。Hadoop已经成为信息技术（IT）商业和学术环境中最知名的工具之一，因为它能够管理大量数据。

然而，随着现代互联网用户生成大量非结构化数据，对内存资源的需求也在增加[6]，分布式数据处理很好地满足了对内存资源增加的需求[7]。在这方面，用于数据处理分发的两个最广泛使用的工具是MapReduce[8]和Spark[9]的开源工具，它们为处理和分析大量数据提供了有效的解决方案，同时为开发人员提供了有用的功能，开发人员可以通过应用编程接口（API）轻松利用这些功能[10]。这两个工具都基于Hadoop生态系统，其中MapReduce用于并行处理集群中的数据，而Spark是另一个为集群数据处理构建的解决方案[11]。然而，Spark的主要目的是提供一种编程模型，该模型可用于受MapReduce功能约束的任何形式的大数据应用程序，同时保持容错[12]。Spark不仅是MapReduce的替代方案，而且还提供了各种实时数据处理功能。上述工具是Mahout[13]和MLlib[14]的工具的基础，它们用于使用机器学习（ML）算法进行大数据分析[15]。

本研究的目的是开发和部署EverAnalyzer，这是一个灵活的大数据管理平台，能够自动收集、预处理、处理和分析实时（即流式）和存储（即批处理）数据。尽管如此，大多数现有的大数据管理平台已经支持这样一个管道，然而，它们利用了现成的技术和工具。此外，这些平台支持执行独立任务的工具，例如单个数据处理或单个数据分析任务。因此，使用这些平台，可以利用特定的框架，这些框架有自己的优点、缺点和局限性。这个问题的解决方案是实现一个系统，该系统可以理解用于管理不同案例数据集以进行处理或分析活动的各种工具的优点和缺点，并为每个案例确定最佳工具，以执行耗时更少、效率更高的行动。EverAnalyzer正是为了弥补这一差距，提供了创新，使其系统能够自动识别哪种底层数据处理（即MapReduce或Spark）和数据分析（即Mahout或MLlib）工具最适合成功高效地处理和分析摄入的数据。系统的选择不仅受数据量的影响，还受应用于相关数据场景的先前处理和分析任务的执行速度的影响。因此，EverAnalyzer可以应用于广泛的场景，更好地帮助用户处理和分析活动，从而减少他们的总体工作量。为了验证上述所有内容，通过一项实验对该平台进行了评估，该实验评估了EverAnalyzer向用户提供关于他们希望执行的操作所使用的最佳框架的经验建议的能力。数据是从三十（30）个不同的数据集中收集的，这些数据集与医疗保健部门的各种疾病和状况有关。数据经过预处理、处理和分析，而EverAnalyzer则根据请求的处理/分析过程的最短执行时间，为最合适的框架（即处理任务分别为MapReduce或Spark，分析任务分别为Mahout或MLlib）提供了建议。收集了该框架的所有建议，并将其与所选两个工具之间执行时间最好的框架进行了比较，结果表明EverAnalyzer在80%的时间内提出了正确的建议。然而，当数据集数量增加时，这一百分比似乎单调攀升。这意味着每个执行的处理/分析任务都会训练EverAnalyzer导出更好、更具代表性的结果。因此，如果平台使用更多的数据集，预计正确答案的百分比将增加，从而将整个平台的准确性提高到80%以上。

本文的其余部分组织如下。第2节详细总结了为评估研究的有意义见解而进行的文献综述，重点是大数据及其寿命，特别是处理和分析阶段。在第3节中，对所提出的平台（EverAnalyzer）的设计和构建进行了全面分析，包括平台的目标、用户及其架构。第4节描述了EverAnalyzer生成的实验结果，第5节提供了导出结果的解释，以及如何根据研究文献对其进行解释。最后，第6节包含了本研究的结论、局限性、下一步行动和未来的研究方向；它还描述了使用EverAnalyzer的设计和实现指南进行的未来实验。

文献综述

大数据被定义为从各种来源以各种格式收集的大量数据[3]。这类数据具有一些特定的特征（数据的Vs），主要指数据量（即数据大小）、多样性（即数据格式）、速度（即数据产生率）、准确性（即数据真实性的大小）、有效性（即资料有效性）、波动性（即资料验证时间）和价值（即数据在分析方面的有用性）[3]。这些特征表明，大数据的管理具有挑战性，但如果管理得当，它可能会非常有价值。为此，公司可以使用大数据来评估和提取有关其产品和客户的重要信息。然而，由于它们的形式和大小广泛，分析它们有时是一项复杂而耗时的任务。与此同时，人们越来越多地使用互联网来帮助他们进行日常活动和娱乐，这导致收集的数据量逐年增加。

这导致数据可能是结构化的、半结构化的，甚至是非结构化的，这使得它们很难用传统的关系数据库管理系统（RDBMS）进行管理，而实现这些系统既昂贵又耗时[16]。结构化数据是指已知其包含的信息及其包含方式的数据。另一方面，半结构化数据缺乏关于其所包含信息的一些规范，而非结构化数据不传达关于其结构的信息。手机、传感器、全球定位系统（GPS）信号、社交媒体和其他每秒产生大量数据的来源可以收集大量这些数据[17]。因此，大数据是指从需要一些处理或分析活动的现成数据集中获得的批数据（例如，从外部系统数据库中获得的已存储数据），或从不断流式传输信息的实时来源中获得的流式传输数据（例如，从社交媒体收集的实时数据）[18]。

因此，在大数据的整个生命周期中管理大数据已成为一项极具挑战性的任务，这项任务从未停止过激发企业和研究人员的兴趣。更具体地说，大数据的利用由一个生命周期来表示，该生命周期包括过多的阶段，从收集数据开始，到最终销毁数据结束[19]。图1描述了所有这些阶段，指的是：（i）收集，其中数据是从各种来源收集的，大多数时候的格式由于其非结构化性质而难以处理；（ii）存储器，其中摄取的数据被存储在适当的数据库中；（iii）处理，在标准结构中对数据进行预处理，使其更容易在后续阶段进行管理；（iv）分析，其中使用各种ML方法从存储的数据中产生有意义的结果和见解；（v）利用，将提取的结果和获得的见解用于各种现实生活和测试场景；（vi）销毁，这是整个生命周期的最后一个也是最重要的阶段，因为在收集阶段，许多敏感数据可能从各种来源收集，要求数据遵守严格的协议，以确保其机密性、完整性和可用性不受损害。为此，应该强调的是，建议的平台的目的是调查收集、存储、处理和分析的各个阶段，下文将对此进行进一步分析。

1. 大数据收集

大数据收集被描述为收集大量数据以进一步分析并获得有用结果的过程[20，21]。这些数据可以使用问卷调查和访谈等传统方法收集；然而，还有许多更有效的方法。网络服务、配备传感器的设备（如手机和平板电脑）以及智能交通卡只是几个例子[22]。从这些设备收集的所有数据可以是批处理的，这意味着它们被收集到预定义的大小，然后被存储在一起，以便稍后作为一组数据进行分析，也可以是流式的，指的是在收集时被分析的数据。这两种数据之间的区别在于，流式数据处理直接应用于摄入的数据，而批处理数据处理收集并预处理预定量的数据[18]。此外，如果无法为处理/分析活动收集足够的数据，则有创建合成数据的方法[23]，合成数据代表分析最有可能用于正确执行所需分析的真实数据。

已经建立了各种工具，如Sebek[24]、Hflow[25]、Honeywall[26]、Nepenthes[27]、Kojoney[28]和Capture HPC[29]，以成功收集这种不同类型和格式的数据。Kafka[30]和Flume[31]是使用最广泛的两种数据收集工具。Kafka是一种流式数据收集和处理工具，Flume主要用于管理将流式数据作为批处理数据收集的基础设施。Flafka是通过结合这两种工具创建的，能够利用Kafka和Flume将流数据保存为批处理数据[32]。

1. 大数据存储

大数据存储被描述为在保持数据访问可靠性和可用性的同时存储和管理大规模数据集的过程[33，34]。大数据存储对希望采用它的系统的基础设施有着重大影响。一方面，存储基础设施必须为存储服务提供可靠的空间，但另一方面，它还必须提供用于查询和分析大量数据的动态访问接口。

由于大数据的数量在不断扩大，越来越多地使用被称为数据库管理系统（DBMS）的复杂系统来存储和管理这些数据。结构化查询语言（SQL）系统和非SQL（NoSQL）系统是RDBS的两种代表性类型[35]。NoSQL系统更适合存储和管理大数据，因为SQL系统需要有组织的数据才能高效，而NoSQL系统则用于非结构化数据。为了更好地管理现有非结构化数据的各种形式，NoSQL数据库管理系统分为三个独立的核心类别，即：（i）键值存储，将数据存储为键值对的集合，其中键作为唯一标识符，键和值的范围从简单对象到复杂复合对象（例如，Redis[36]；Scalaris[37]，Tokyo Tyrant[38]，Riak[39]）；（ii）文档存储，其是用于以文档形式存储信息的数据库（例如，SimpleDB[40]、CouchDB[41]、MongoDB[42]、Terrastore[43]）；（iii）使用表、行和列的列存储，但与关系数据库不同，同一表中的列的名称和格式可能因行而异（例如，Bigtable[44]、HBase[45]、HyperTable[46]、Cassandra[47]）。

1. 大数据处理

大数据处理是一组访问大量数据以提取有意义的信息用于决策支持和提供的技术[48，49]。大数据处理采用了一系列方法，如字数和字符串匹配，这些方法可以分布在庞大的处理单元集群中[50]。数据处理算法通常具有较低的算法复杂性，允许它们执行快速计算。它们易于实现，可以解释各种数据集，而由于其高速性，它们可以用于任何数据集，无论其大小。然而，直接获得的数据集（即原始数据集）通常不可能作为数据处理任务进行处理，因为在大数据的情况下，这些数据集不符合特定的结构，因为它们来源广泛。因此，在进行数据处理工作之前，大数据必须首先经过数据预处理阶段，以规范数据结构。在数据结构被规范化之后，使用优选的数据处理算法来处理数据是简单的。

同时，传统的编程范式无法有效地处理数据，因为数据通常存储在数千个商品服务器上。因此，新的并行编程方法正在数据中心部署，以提高NoSQL数据库的性能[48]。MapReduce是一种流行的大规模商品集群大数据处理编程模型，它已发展成为Hadoop生态系统的重要组成部分[48]。这种编程模型的主要优点是其简单性，允许用户轻松利用它执行大数据处理任务[51]。Pig是一个类似SQL的环境，用于对大数据执行处理任务[52]，而Hive是这种工具的另一个例子，它提供了比MapReduce更好的环境，并简化了代码开发，因为程序员不需要处理MapReduce编码的复杂性[53]。同样，已经开发了许多解决方案来解决MapReduce的差距，例如延迟的数据加载和数据重用。其中包括Starfish，这是一个基于Hadoop的框架，旨在通过使用数据生命周期分析来提高MapReduce作业的性能，也是一个适应用户需求和系统工作负载的自调整系统，无需用户配置或更改底层设置或参数[54]。Spark是MapReduce的替代方案，旨在克服磁盘I/O限制并提高以前解决方案的性能。执行内存中计算的能力是Spark的主要特点，因为它可以将数据缓存在内存中，消除了MapReduce对迭代任务的磁盘开销限制[55]。其他类似于MapReduce的编程模型包括Dryad，它是一个用于运行基于定向非循环图（DAG）的大数据应用程序的分布式执行引擎。虽然MapReduce只允许一组输入和输出数据，但Dryad允许用户使用任何数量的输入和输出资料[56]。Pregel是另一种能够处理大规模图形用于各种目的的工具，包括网络图形分析和社交网络服务[57]。最后，数据处理技术也可用于流数据。由于数据是从其来源获取的，这些技术提供了处理工作流，消除了将数据转换为批处理数据的要求[58]。此类工具的示例包括Storm[59]、Flink[60]、Spark Streaming[61]、Samza[62]、Apex[63]和Google Cloud Dataflow[64]等。

1. 大数据分析

大数据分析被定义为从不同来源获取数据，对其进行处理以提取相关模式和见解，并将结果分发给适当的利益相关者的过程[65，66]。数据分析分为四（4）种离散类型，指的是：（i）对“发生了什么？”问题做出回应并从原始数据中挖掘信息的描述性分析；（ii）诊断分析，报告过去，同时试图回答“为什么会发生这种情况？”；（iii）预测分析，回答未来相关问题“会发生什么？”和“为什么会发生？”；大数据分析被定义为从不同来源获取数据，对其进行处理以提取相关模式和见解，并将结果分发给适当的利益相关者的过程[65，66]。数据分析分为四（4）种离散类型，指的是：（i）对“发生了什么？”问题做出回应并从原始数据中挖掘信息的描述性分析；（ii）诊断分析，报告过去，同时试图回答“为什么会发生这种情况？”；（iii）预测分析，回答未来相关问题“会发生什么？”和“为什么会发生？”；

EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop EcosystemAbstract: Big Data is a phenomenon that affects today’s world, with new data being generated every second. Today’s enterprises face major challenges from the increasingly diverse data, as well as from indexing, searching, and analyzing such enormous amounts of data. In this context, several frameworks and libraries for processing and analyzing Big Data exist. Among those frameworks Hadoop MapReduce, Mahout, Spark, and MLlib appear to be the most popular, although it is unclear which of them best suits and performs in various data processing and analysis scenarios. This paper proposes EverAnalyzer, a self-adjustable Big Data management platform built to fill this gap by exploiting all of these frameworks. The platform is able to collect data both in a streaming and in a batch manner, utilizing the metadata obtained from its users’ processing and analytical processes applied to the collected data. Based on this metadata, the platform recommends the optimum framework for the data processing/analytical activities that the users aim to execute. To verify the platform’s efficiency, numerous experiments were carried out using 30 diverse datasets related to various diseases. The results revealed that EverAnalyzer correctly suggested the optimum framework in 80% of the cases, indicating that the platform made the best selections in the majority of the experiments.

Keywords: Big Data; data management; data collection; data analysis; data processing; Hadoop; MapReduce; Spark; Mahout; MLlib

1. Introduction

Global internet consumption has increased due to the growth of the Internet of Things (IoT) and the extensive use of social media. As a result of this rise, vast amounts of data have accumulated, which in most of the cases are extremely difficult to be handled. According to Statista [1], the total amount of data consumed globally has increased to 64.2 Zettabytes in 2020, 79 Zettabytes in 2021, and is expected to increase by more than 180 Zettabytes by 2025. At the same time, Forbes [2] estimates that more than 150 Zettabytes of real-time data will be required for analysis by 2025. Companies dealing with structured data have different requirements than companies dealing with unstructured data, according to Forbes, which discovered that over 95% of organizations require assistance in managing unstructured datasets.

All this information is referred to as Big Data, which is defined as massive volumes of data collected from multiple sources and formats [3]. Many businesses gather and analyze data from various sources to make better business decisions regarding their customers, market demands, and trends. For these purposes, various Big Data processing and analysis technologies have been created to efficiently extract information from these large datasets in order to successfully evaluate the underlying data [4]. Among those tools, the ones created upon the Apache Hadoop Ecosystem are the most widely used [5]. Hadoop has become one of the most well-known tools in the Information Technology (IT) business and academic environment, due to its capacity to manage huge amounts of data.

However, as modern internet users generate massive amounts of unstructured data, the need for memory resources is increasing as well [6], with distributed data processing being a good answer to the demand for increased memory resources [7]. In this regard, two of the most widely used tools for data processing distribution are the open-source tools of MapReduce [8] and Spark [9], which provide effective solutions for processing and analyzing massive amounts of data, while providing useful functions to developers who can easily exploit them via Application Programming Interfaces (APIs) [10]. Both tools are based on the Hadoop Ecosystem, where MapReduce is used to process data in a processing cluster in parallel, whereas Spark is another solution that has been built for clustered data processing [11]. However, Spark’s major purpose is to provide a programming model that can be utilized in any form of Big Data application that is constrained by the MapReduce features, while remaining error tolerant [12]. Spark is not only an alternative to MapReduce, but it also provides a variety of real-time data processing functionalities. The aforementioned tools serve as the basis for the tools of Mahout [13] and MLlib [14], which are used to perform Big Data analysis using Machine Learning (ML) algorithms [15].

The purpose of this research is to develop and deploy EverAnalyzer, a flexible Big Data management platform capable of automatically gathering, pre-processing, processing, and analyzing both real-time (i.e., streaming) and stored (i.e., batch) data. Nevertheless, most of the existing Big Data management platforms already support such a pipeline, exploiting, however, off-the-shelf technologies and tools. In addition, these platforms support tools that perform standalone tasks, such as individual data processing or individual data analysis tasks. Hence, using those platforms, specific frameworks are exploited, having their own set of benefits, shortcomings, and limitations. The solution to this problem is the implementation of a system that can comprehend the advantages and disadvantages of the various tools used to manage diverse case datasets for pursuing a processing or analytical activity and identify the optimum tool per case for performing less time-consuming and more efficient actions. EverAnalyzer comes to bridge exactly this gap, providing the innovation that enables its system to automatically recognize which of the underlying data processing (i.e., MapReduce or Spark) and data analysis (i.e., Mahout or MLlib) tools are most suitable for successfully and efficiently processing and analyzing the ingested data. The system’s choice is influenced not only by the amount of data, but also by the execution speed of prior processing and analysis tasks that have been applied on relevant data scenarios. As a result, EverAnalyzer may be applied to a wide range of scenarios, better assisting users in both processing and analytical activities, hence decreasing their overall workload. To verify all of the above, the platform was evaluated through an experiment that assesses EverAnalyzer’s capability to provide empirical suggestions to its users about the best framework to be utilized for the operations that they wish to perform. Data was collected from thirty (30) distinct datasets related to various diseases and conditions in the healthcare sector. The data was pre-processed, processed, and analyzed, while EverAnalyzer provided a suggestion for the most suitable framework (i.e., MapReduce or Spark for processing tasks, and Mahout or MLlib for analysis tasks, respectively) based on the shortest execution time for the requested processing/analysis process. All the framework’s suggestions were gathered and compared with the framework that had the best execution time between the two chosen tools, revealing that EverAnalyzer made a correct recommendation 80% of the time. However, when the number of datasets increased, this percentage appeared to climb monotonically. This means that each performed processing/analysis task trains EverAnalyzer to export better and more representative results. Hence, if the platform uses a larger number of datasets, it is expected that the percentage of correct answers will be increased, raising the overall platform’s accuracy to a percentage greater than 80%.

The remainder of this paper is organized as follows. Section 2 offers a detailed summary of the literature review that was conducted to assess meaningful insights for the study, focusing on Big Data and its lifespan, focusing in particular on the processing andanalysis phases. In Section 3, a thorough analysis of how the proposed platform (EverAnalyzer) is designed and built is presented, including the platform’s goals and users as well as its architecture. Section 4 depicts the experimentation results generated by EverAnalyzer, and Section 5 provides an interpretation of the exported results as well as how they can be interpreted in relation to the studied literature. Finally, Section 6 contains the study’s conclusions, limitations, next steps, and future research directions; it also describes future experiments that would be interesting to conduct using EverAnalyzer’s design and implementation guidelines.

2. Literature Review

Big Data is defined as large volumes of data collected from various sources and in various formats [3]. Such data have some specific characteristics (Vs of the data), which primarily refer to data Volume (i.e., data size), Variety (i.e., data format), Velocity (i.e., data production rate), Veracity (i.e., size of data authenticity), Validity (i.e., data validity), Volatility (i.e., time of data validation), and Value (i.e., data usefulness in terms of analysis) [3]. These characteristics indicate that Big Data is challenging to be managed, but when it is properly managed, it may be highly valuable. For this purpose, companies can use Big Data to evaluate and extract important information about their products and customers. However, due to the wide range of their forms and sizes, analyzing them is sometimes a complicated and time-consuming task. At the same time, people are increasingly using the Internet to help them with their everyday activities and entertainment, which causes the amount of collected data to increase year after year.

This results in data that may be structured, semi-structured, or even unstructured, making them difficult to manage with traditional Relational Database Management Systems (RDBMS), which are expensive and time-consuming to implement [16]. Structured data refers to data that are known for the information they contain and the manner in which they are contained. Semi-structured data, on the other hand, lacks some specifications about the information they contain, whereas unstructured data conveys no information on their structure. Large amounts of these data can be collected by mobile phones, sensors, Global Positioning System (GPS) signals, social media, and other sources that generate massive amounts of data every second [17]. As a result, Big Data refers to either batch data deriving from ready-to-use datasets that require some processing or analytic activities (e.g., already stored data derived from external systems’ databases), or streaming data derived from live sources that are constantly streaming information (e.g., realtime data gathered from social media) [18].

As a result, managing Big Data throughout their lifecycle has become a very challenging task that never ceases to pique the interest of enterprises and researchers. More specifically, the utilization of Big Data is represented by a lifecycle that includes a plethora of phases, beginning with collection of the data and concluding with their final destruction [19]. Figure 1 depicts all of these phases, referring to the: (i) collection, in which data are collected from various sources, most of the time in formats that are difficult to handle due to their unstructured nature; (ii) storage, in which the ingested data are stored in the appropriate database; (iii) processing, in which data are pre-processed in a standard structure to make it easier to manage in subsequent phases; (iv) analysis, in which various ML methods are used to produce meaningful results and insights from the stored data; (v) utilization, in which the extracted results and gained insights are put to use in a variety of real-life and testing scenarios; (vi) destruction, the final and most important phase of the entire lifecycle, since many sensitive data may be collected from various sources during the collection phase, requiring the data’s compliance to a strict protocol to ensure that their confidentiality, integrity, and availability are not compromised. To this end, it should be emphasized that the suggested platform’s purpose is to investigate the phases of collection, storage, processing, and analysis, which are further analyzed below.

2.1. Big Data Collection

Big Data collection is described as the process of gathering massive amounts of data in order to further analyze them and obtain useful results [20,21]. These data can be collected using traditional methods such as questionnaires and interviews; however, there is a plethora of more effective approaches. Web services, sensor-equipped devices such as mobile phones and tablets, and smart transportation cards, are just a few examples [22]. All the data collected from these devices may be either batch, meaning that they are collected up to a predefined size and then stored all together to be analyzed later as a set of data, or streaming, referring to data that are analyzed while being collected. The distinction between those two kinds of data is that streaming data processing is applied directly to the ingested data, whereas batch data processing collects and preprocesses a predetermined quantity of data [18]. Furthermore, if it is not possible to collect enough data for a processing/analytical activity, there are methods for creating synthetic data [23], which represent the real data that an analysis would most likely use to properly execute the required analysis.

Various tools, such as Sebek [24], Hflow [25], Honeywall [26], Nepenthes [27], Kojoney [28], and Capture-HPC [29] have been built to successfully collect such varied types and formats of data. Kafka [30] and Flume [31] are two of the most widely used data collection tools. Whereas Kafka is a streaming data collection and processing tool, Flume is primarily used to manage infrastructures for collecting streaming data as batch data. Flafka is created by combining those two tools, providing the ability to save streaming data as batch data exploiting both Kafka and Flume [32].

2.2. Big Data Storage

Big Data storage is described as the process of storing and managing large-scale datasets while maintaining data access reliability and availability [33,34]. Big Data storage has a significant impact on the infrastructure of the system that desires to adopt it. On the one hand, the storage infrastructure must provide reliable space to storage services, but on the other hand, it must also provide a dynamic access interface for querying and analyzing large amounts of data.

Because the volume of Big Data is continuously expanding, complex systems known as Database Management Systems (DBMS) are increasingly being employed to store and manage these data. Structured Query Language (SQL) systems and Non-SQL (NoSQL) systems are the two representative types of RDBSs [35]. NoSQL systems are preferable for storing and managing Big Data, since SQL systems require organized data to be efficient, whilst NoSQL systems are meant to be used for unstructured data. To better manage the variety of the forms of the existing unstructured data, NoSQL DBMSs are classified into three separate core categories, namely: (i) key-value stores that store data as a collection of key-value pairs in which a key serves as a unique identifier, with both keys and values ranging from simple objects to complex compound objects (e.g., Redis [36]; Scalaris [37], Tokyo Tyrant [38], Riak [39]); (ii) document stores that are databases for storing information in the form of documents (e.g., SimpleDB [40], CouchDB [41], MongoDB [42], Terrastore [43]); (iii) column stores that use tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table (e.g., Bigtable [44], HBase [45], HyperTable [46], Cassandra [47]).

2.3. Big Data Processing

Big Data processing is a group of techniques for accessing large amounts of data in order to extract meaningful information for decision support and provision [48,49]. Big Data processing employs a range of methods, such as wordcount and string matching, which can be distributed across vast clusters of processing units [50]. Data processing algorithms typically have low algorithmic complexity, allowing them to perform quick computations. They are simple to implement and can interpret a variety of datasets, whereas they may be used on any dataset, regardless of its size, due to their high speed. However, directly obtained datasets (i.e., raw datasets) are frequently impossible to process as a data processing task, since in the case of Big Data such datasets do not comply with a specific structure as they derive from a broad range of sources. Thus, Big Data must first go through a data pre-processing phase to normalize the data structure before going through a data processing job. After the data structure is normalized, it is then simple to process the data using the preferred data processing algorithms.

At the same time, traditional programming paradigms are incapable of handling data effectively because it is often stored on thousands of commodity servers. As a result, new parallel programming methods are being deployed in datacenters to improve the performance of NoSQL databases [48]. MapReduce is a popular programming model for Big Data processing on large-scale commodity clusters, and it has evolved as an important component of the Hadoop ecosystem [48]. The main advantage of this programming model is its simplicity, which allows its users to easily exploit it for Big Data processing tasks [51]. Pig is an SQL-like environment that is used for performing processing tasks upon Big Data [52], whereas Hive is another example of such tool that provides a better environment than MapReduce and simplifies the code development as programmers are not required to deal with the complexities of MapReduce coding [53]. Similarly, many solutions have been developed to address MapReduce’s gaps, such as delayed data loading and data reuse. Among those tools are Starfish, which is a Hadoop-based framework aiming to improve the performance of MapReduce jobs through the use of data lifecycle analytics, as well as being a self-tuning system that adapts to users’ needs and systems’ workloads without requiring users to configure or change the underlying settings or parameters [54]. Spark is an alternative to MapReduce that aims to overcome disk I/O limitations and improve the performance of prior solutions. The ability to perform in-memory computations is the main feature that distinguishes Spark, since it enables data to be cached in memory, removing the disk overhead limitation of MapReduce for iterative tasks [55]. Other programming models similar to MapReduce include Dryad, which is a distributed execution engine for running Directed Acyclic Graph-based (DAG) Big Data applications. While MapReduce only allows for a single set of input and output data, Dryad allows users to use any number of input and output data [56]. Pregel is another tool capable of processing large-scale graphs for a variety of purposes, including network graph analysis and social networking services [57]. Finally, data processing technologies are available for streaming data as well. As data is acquired from their source, these technologies provide processing workflows, removing the requirement to convert data to batch data [58]. Examples of such tools are Storm [59], Flink [60], Spark Streaming [61], Samza [62], Apex [63], and Google Cloud Dataflow [64], among others.

2.4. Big Data Analysis

Big Data analysis is defined as the procedure for acquiring data from diverse sources, processing them to extract relevant patterns and insights, and distributing the results to the appropriate stakeholders [65,66]. Data analysis is classified into four (4) discrete types, which refer to: (i) descriptive analytics that respond to the question “What happened?” and mines information from raw data; (ii) diagnostic analytics that report on the past while attempting to answer the question “Why did it happen?”; (iii) predictive analytics that answer future-related questions “What will happen?” and “Why will it happen?”;Big Data analysis is defined as the procedure for acquiring data from diverse sources, processing them to extract relevant patterns and insights, and distributing the results to the appropriate stakeholders [65,66]. Data analysis is classified into four (4) discrete types, which refer to: (i) descriptive analytics that respond to the question “What happened?” and mines information from raw data; (ii) diagnostic analytics that report on the past while attempting to answer the question “Why did it happen?”; (iii) predictive analytics that answer future-related questions “What will happen?” and “Why will it happen?”;

基于Vue.js和SpringBoot的新能源汽车充电站管理系统外文翻译

基于vue.js和springboot的学生信息可视化系统

drawio-libs终极指南：专业图表绘制的完整解决方案

基于Vue的山林动植物科普资源系统设计与实现任务书

安达发|APS计划排产排程排单软件实现医疗器械的“零缺陷排程”！穿透表面看本质2025-12-17 11:50

如何用EmotiVoice构建个性化语音助手？完整教程来了

2025：科技投资正酣，如何答好这道题？