The paper aims at proposing a solution for designing and developing a seamless automation and integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to seamlessly handle and scale very large amount of unstructured and structured data from diversified and heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing component; 3) the ability to automatically select the most appropriate libraries and tools to compute and accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high learning performance, but with minimal human intervention and supervision. The whole focus is to provide a seamless integrated solution which can be effectively used to analyze Big Data with high-frequency and high-dimensional features from different types of data characteristics and different application problem domains, with high accuracy, robustness, and scalability. This paper highlights the research methodologies and research activities that we propose to be conducted by the Big Data researchers and practitioners in order to develop and support seamless integration of machine learning capabilities for Big Data analytics.
The potential benefits of Big Data are massive, but there are many technical challenges that must be addressed in order to leverage the advantages of utilizing such technology. First, the massive samples in big data are typically aggregated from multiple sources at different time points using different technologies. Nowadays, data is not collected in batch and then processed offline. Instead, data arrives continuously and must be processed online and in real-time to gain useful insight. For example, financial market data need to be continuously streamed and synchronized during the opening market session until the market closes. As new data arrives, the data must synchronize and combine with past historical market data before it can be processed. In some financial market use-case scenarios, multiple analysis must be done simultaneously in different investment period (e.g., minutes, hours, daily, monthly, etc.) and the results must be synchronized to compute the final aggregated output. Hence, there is a need for efficient data-storage methods in storing and managing a dynamic and massive amount of datasets to support such complexity of online data-processing.