Performance Evaluation of Big Data and Business Intelligence Open Source Tools: Pentaho and Jaspersoft Victor M. Parra1 and Malka N. Halgamuge2 1 Charles Sturt University [email protected] 2 The University of Melbourne [email protected] 1 ABSTRACT Despite the recent increase in the utilisation of the tools of ”Big Data” and ”Business Intelligence” (BI), the investigation that has been carried out regarding the inferences related to its implementation and performance is relatively scarce. Analytical tools have a significant impact on the development and sustainability of a company since the evaluation of the clients information are critical aspects and crucial in the progress towards a competitive market. Due to the notable increase in inhabitants, managing huge volumes of information has turned out to be too complex for businesses. All corporations at certain phase in their life cycle, require to implement different and improved data processing systems in order to optimize the decision making procedures. Enterprises utilise Business Intelligence outcomes to pull together records that has been extracted from consolidated analyses from signals in the data grouping of Business Intelligence information scheme. This, in turn, gives a marked advantage to companies in the development of activities based on predictions, and also to compete with competitors in the market. Nevertheless, numerous companies implement these BI tools these systems, without leading an introductory examination of the requirements and needs of an organization, or deprived of defining the advantages and goals the organisation want to accomplish through the application. Companies hardly ever carry out an analysis of the costs associated with the execution of these systems, which inevitably leads to the establishment of non-recommended implementations, which are generally incomplete, complex and very impractical, in short, unmaintainable even if executed. Business Intelligence applications are precise implements that resolve this matter for companies that require, data storing and administration. The chapter examines the most recommended Business Intelligence open source applications currently available: Pentaho and Jaspersoft, processing big data over and done with six databases of diverse sizes, with a special focus on their Extract Transform and Load (ETL) and Reporting 2 Victor M. Parra and Malka N. Halgamuge procedures by calculating their routines through Computer Algebra Systems (CAS). Moreover, the chapter correspondingly makes available a complete explanation of the structures, features, and comprehensive implementation scenario for numerous academics and IT specialists who endure the suitability of Big Data processing, and the application of Business Intelligence open source applications founded on ones requirements. 2 INTRODUCTION Business Intelligence software adapts stored data of a companys clientele profile and turns it into information that forms the pool of knowledge to create a competitive value and advantage in the market it is in [1]. Additionally, Business Intelligence is used to back up, improve the business with practical data, and utilise the study of the data, to continually increase the organisations competitiveness. Section of this examination is to make available appropriate and opportune outcomes, for managements and to make the decision based on factual information, so their decision-making is based on concrete evidence. Howard Dresner, from Gartner Group [2], was the first to wedge the term Business Intelligence (BI), by means of a term to describe an assembly of concepts and procedures to support the decision making, by using information found upon facts. BI system gives enough data to use and evaluate the needs and desires of customers. Additionally it allows to [3]: i) Create information for divisions or total ranges in a corporation, ii) Design a database for clients, iii) Build settings for decision making, iv) Share materials among parts or sections of a corporation, v) Sandbox approaches of multidimensional strategies and proposals, vi) Mine, transform and process data, vii) Provide an original method to decision making and viii) Increase the feature of client service. The advantages of systemizing BI comprise the combination of data from numerous sources of information [4], generating users profiles for data managing, decreasing the reliance on the IT division, reducing the period of getting information, improves the examination, and even enhances the readiness to access real-time information in accordance with precise existing business standards. The Gartner Magic Quadrant released for business intelligence platforms in 2015 [5] has emphasized the variations being made by the BI area in order to speedily expand systems that might be implemented by corporate operators and specialists to obtain information from the collected data. Usually, BI is assumed as a group of procedures, tools and technologies utilised to convert data into information and then into a personal profile of customers that is then produced into valuable structured data ready to be used by different areas of company enterprise [6]. Consequently, Big Data will assist in the growth of better methods that permit (BI) tools to be used to collect information, for instance [7]: i) Trans Title Suppressed Due to Excessive Length 3 form and examine large amount of information; ii) Rise the world of data to take into account when decision-making: and intrinsic chronological data of the business, to include data from external sources; iii) Make available an instant answer to the sustained delivery of actual time data of the systems and the opportunities of interconnections among equipment; iv) Dealing by way of structures of composite and varied data; v) and finally, to segregate from the material restrictions of data storing and process by using increasable solutions and great accessibility at a reasonable costs. This chapter offers an investigational examination of the contrasts of two most suggested and accessible BI open source applications at present: Pentaho and Jaspersoft, processing big data over and done with six databases of diverse sizes, with an exceptional emphasis on ones ETL and Reporting procedures by calculating their routines with Computer Algebra Systems. The purpose of this work is to analyse, assess these tools, and outline how they increase the quality of data, which inadvertently assist us to recognize the market conditions to make upcoming calculations based on tendencies. Section II defines the competencies and components of both Pentaho and Jaspersoft BI Open Sources. Section III presents the computer algebra systems SageMath and Matlab. This is followed by the materials and methods (Section IV) used in the analysis and experimentation, particularly the ETL and Reporting measurements and how they were implemented. In Section V, the results of the study for CPUtime as a variable of the ”size” from the initial data for the ETL and Reporting procedures from both Business Intelligence Open Sources, applying two different Computer Algebra Systems. Section VI contains the discussion of the experimentation. Section VII, the conclusion of the study. 3 PENTAHO AND JASPERSOFT BUSINESS INTELLIGENCE OPEN SOURCES 3.1 Pentaho The Pentaho BI project is a continuing open source communal initiative that offers companies with a great application for their business intelligence requirements. Taking advantage of the richness of open source technologies and influences from the open source development community, this application is capable to transform considerable more rapidly than commercial sellers do [8]. Consequently, Pentaho offers an option that exceeds trademarked Business Intelligence solutions in numerous parts such as architecture, standards support, functionality and simplicity of deploy. This business intelligence tool developed under the philosophy of free software for business decision making and management has a platform composed of different programs that meet the requirements of BI. It offers solutions for information management and analysis, including multidimensional OLAP 4 Victor M. Parra and Malka N. Halgamuge analysis, reporting, data mining and the creation of dashboards for the user [9]. This platform has been developed under the Java programming language and has an implementation environment based on Java, making Pentaho a very flexible solution to cover a high range of business needs. Pentaho defines itself as a ”solution-oriented” and ”process-centric” BI platform that includes all the key components required to implement processbased solutions as it has been designed from the ground up. The solutions that Pentaho offers basically consist of an integrated analysis and reporting tools infrastructure with a business process workflow engine [11]. The platform is able to execute the necessary business rules, expressed in the form of processes, activities, and it is able to present and deliver the appropriate information at the required time. Pentaho compromises proper solutions through the range of resources to improve and preserve the processes of BI projects from the Extract Transform and Load using data integration to the dashboards through Dashboard Designer [8]. This tool has constructed its application Business Intelligence incorporating diverse current and documented creditworthiness projects. Data Integration was recognized as Kettle; certainly, it kept this one old name as an informal term. Mondrian is an additional element of Pentaho that holding its individual object. Pentaho components: 3.1.1 Pentaho Data Integration (previously Kettle) It is the module in control of the ETL procedures. However, Extract, Transform and Load artefacts are most commonly implemented in data warehouses applications [10], as Pentaho Data Integration is able correspondingly to be implemented for additional commitments: • Transferring data among programs or databases • Distributing data from databases to flat files • Filling data vastly into databases • Data debugging • Incorporating components Pentaho Data Integration is informal to implement. Each procedure is formed through a graphical instrument where users can determine what to do deprived of creating an algorithm to specify how to do it; due to, it might be said that Pentaho Data Integration is focused on metadata. Pentaho Data Integration is able to be implemented as a separate artefact, or it might be implemented together with the Pentaho Suite. As an Extract Transform and Load instrument, this is the most widespread open source application offered. Title Suppressed Due to Excessive Length 5 It supports a massive collection of input and output presentations, comprising text files, data pages, and different sets of databases engines. Furthermore, the modification competences of Pentaho Data Integration permits to operate data by means of almost zero restrictions. This component is one of the best commonly implemented ETL solutions and well appreciated in the market [9]. This one has an extended history, strength, and robustness that becomes a much-suggested tool. It permits renovations and function in a very easy and instinctive method, as it is illustrated in Fig. 1. Similarly, the Data Integration projects are very simple to keep up. Fig. 1. Pentaho Data Integration Interface and Workflow, the ETL component permits renovations and function in a very easy and instinctive method. 3.1.2 Web Application-BI Server Pentaho BI Server provides the end user web server and platform. This can interact with the Business Intelligence solution previously created with Pentaho Reporting, Pentaho Analytics, Pentaho Services, DashBoard, Pentaho Data Integration and Pentaho Data Mining. This way, the user can navigate the data, adjusting the view of the data, the filters of visualization, adding or removing the fields of aggregation [11]. Data can be represented in an SVG or Flash form, dashboards widgets, or also integrated with data mining systems and web portals (portlets). In addition, with Microsoft Excel Analysis 6 Victor M. Parra and Malka N. Halgamuge Services, dynamic data can be analysed in Microsoft Excel (using the OLAP server Mondrian). This hundred percent Java2EE component permits to administrate altogether BI resources [10]. The web application has a BI user interface accessible where reports are kept, OLAP views and dashboards as that one is shown in Fig. 2. Furthermore, this component gives admission to an administration support that permits the management and supervision for both application and usage. A Report can be designed using the Report Designer or Report Design Wizard and it can deploy a report using the Pentaho BI Server. Fig. 2. Pentaho Server User Interface to administrate the stored BI assets, OLAP views, and dashboards. Moreover, the availability of a management permits the management and supervision for both application usage. 3.1.3 Pentaho Reporting Most organizations use reports to record, visualize and analyse results. Therefore, reports are considered a major need in Business Intelligence. Pentaho Reporting enables organizations to easily access, format, and distribute information to employees, customers, and associates. Pentaho provides access to OLAP or XML-based relational data sources, as well as offering multiple output formats such as PDF, HTML, Excel or even plain text [8]. It also allows Title Suppressed Due to Excessive Length 7 bringing this information to the end users via the web, e-mail, corporate portals or own applications. Pentaho Reporting allows increasing the reporting platform as needs grow. The Pentaho Report Designer is an independent tool that is part of Pentaho Reporting, which simplifies the reporting process, allowing report designers to create documents quickly and visually rich, based on the project Reports from Pentaho JFreeReport. The report designer offers a familiar graphical environment with intuitive and easy-to-use tools and a very accurate and flexible reporting structure to give the designer the freedom to generate reports that are totally tailored to their needs. This module arranges for a complete reporting solution. Enhancing the whole features required in any reporting situation, as illustrated in Fig. 3. In summary, this component is the old figure of JFreeReport [11] that offers an instrument for reporting, an implementation engine, and metadata instrument for showing reports Ad-hoc, and a customer GUI that consents Ad-hoc reports. Fig. 3. Pentaho Reporting Interface with a complete reporting solution, enhancing the whole features required in any reporting situation. 8 Victor M. Parra and Malka N. Halgamuge 3.1.4 OLAP Mondrian OLAP is the acronym for On-Line Analytical Processing. It is an application implemented in the arena of Business Intelligence whose aim is to expedite the consultation of large amounts of data. It uses many dimensional structures (or OLAP Cubes) that contain summary data from huge Databases or Transactional Systems (OLTP). Implemented in company reports for transactions, advertising, administration reports, data mining and comparable ranges. The cause for implementing OLAP for queries is the rapidity of reply [8]. The relational database stores entities in distinct tables if they are normalized. This arrangement is good in an OLTP system nonetheless for difficult multi-table queries, it is moderately slow. An improved search method, although inferior from the operative viewpoint, is a multidimensional database. In order to obtain the online analytical processing (OLAP) functionality, two other applications are used: the Mondrian OLAP server, which combined with Jpivot, allows queries to Datamarts, results are presented through a browser and the user can drill down and the rest of the typical navigations. Mondrian, now renamed Pentaho Analysis Services, is the integrated OLAP engine in the Open Source Business Intelligence suite Pentaho. Mondrian is an ROLAP engine with cache, which places it near the concept of Hybrid OLAP. ROLAP means that in Mondrian there are no data (except in the cache) nevertheless they reside in an external Database Management System. It is in this database that the tables that make up the multidimensional information with which Mondrian works (the star models of our data marts for example) reside in this database. MOLAP is the name given to the OLAP engines in which the data resides in a dimensional structure [12]. Mondrian is responsible for receiving dimensional queries (MDX language) and returning the data of a cube, only this cube is not a physical thing nonetheless a set of metadata that defines how to map these queries that deal with dimensional concepts to SQL statements that already deals with relational concepts that obtain necessary information to satisfy the dimensional query. Some of the advantages of this model are: • Not having to generate static cubes saving costs to generate them and the memory they occupy. • The possibility of always using the resident data in the database, so that it works with updated data. Very useful in an operational BI environment. • Although MOLAP systems traditionally have a certain performance advantage, Mondrian’s hybrid approach, is the use of cache and aggregate tables, that make it possible to obtain very good performance with it, Title Suppressed Due to Excessive Length 9 without losing the advantages of the classic ROLAP model. It is very important to take advantage of the database where the tables reside. Viewer OLAP Pentaho Analyser: OLAP Viewer that is contained in the Enterprise version [13]. Recent and simpler to custom than JPivot as shown in Fig. 4. AJAX delivers a GUI that consents big malleability when generating the OLAP interfaces. Fig. 4. Pentaho Analyser Interface to generate OLAP views that offer a complete reporting solution. Enhancing the whole features required in any reporting situation. 3.1.5 Dashboards These panels are formed by a series of graphic controls that allow to be seen in a quick and intuitive way the main markers of the performance of the business so that on a screen users can get measures on the operation of the company. Currently, there is not much support for the implementation of these controls so you have to do a lot of work by hand, nevertheless soon expects a series of wizards that allow them to develop them visually as the rest of the Pentaho applications [12]. Dashboards are a development of Pentaho. They 10 Victor M. Parra and Malka N. Halgamuge collect information on all the components of the platform including external applications, RSS feeds and web pages. They include content management and filtering, role-based security and drill down. They can be integrated into third-party applications, portals or within the Pentaho platform. To generate graphs, they rely on JFreeChart, a library to generate the most common graphs (2D, 3D, bars, time series, Gantt, etc.), interfaces to access different data sources, export to PNG, JPEG and PDF and Support for servlets, JSPs, applets and client applications. All components of the Pentaho Reporting and Pentaho Analysis module can be a part of a Dashboard. In Pentaho Dashboards, it is very easy to incorporate a wide variety of types of graphs, tables and speedometers (dashboard widgets) and integrate them with the JSP Portlets, where you can view reports, charts and OLAP analysis. Pentaho offers the option of constructing dashboards [13] over and done with the web interface making use of the dashboard designer as illustrated in Fig. 5. Fig. 5. Pentaho Dashboards Interface that offers the option of constructing dashboards over and done with the web interface making use of the dashboard designer. Title Suppressed Due to Excessive Length 11 3.2 Jaspersoft Jaspersoft is TIBCO’s multi-client business intelligence platform. An enterpriseclass BI self-service software that allows generating embedded analysis boards for large organizations. This BI tool offers pixel-enhanced visualization, Web and Mashup standard reporting, Bi self-service, Multitenancy for cloud and SaaS-based applications, Terabyte scalable memory engine for Big Data analysis, ETL and real-time integration. In addition, TIBCO Jaspersoft provides interactive reports to senior management, dashboards to executives, analytics, data exploration and discovery for analysis, and integration for data architects [14]. JasperSoft’s framework allows the user to easily integrate the various data sources available in the company and through multidimensional analysis techniques indicate the presented in control panels and dynamic reports, and provides this sensitive information to top management. This Open Source reporting solution is one of the most desired tools by many developers to implant in any Java application that needs a reporting scheme. The Reporting engine is the heart of Jaspersoft B.I solution [15], which has a different approach from Pentaho. Jaspersoft has unified ones projects that as well resolves current and associated projects; nevertheless, it is not engaged yet. Jasper can access the Mondrian code to adjust it and continue its developments with Mondrian. Jaspersoft components: 3.2.1 ETL Jasper ETL is used for the selection and processing of multiple data sources that configure and power the corporate data warehouse. Jaspersoft’s data integration software extracts, transforms and loads data from different sources into a data warehouse for the creation of analysis reports. This software uses more than 400 connectors to integrate a wide variety of corporate systems and legacy applications. JasperETL is, in fact, Talend Studio. Talend, is different to Kettle, which has not been integrated by Jasper and it is still an autonomous corporation that delivers its products individually [15]. Talend has similarly a native and instinctive user interface even though the methodology is very dissimilar. Talend is a code creator that is the consequence of an ETL implementation and it is a built-in Java or Perl code. This component is able to similarly assemble and create Java procedures or commands. This module is more adapted to a kind of programmer implemented with an advanced level of technical knowledge than it needs by Kettle as shown in Fig. 6. In conclusion, this method offers a higher advanced level of technical operation. 12 Victor M. Parra and Malka N. Halgamuge Fig. 6. JasperETL interface is, in fact, Talend Studio, it is a native and instinctive user interface, and a code creator as well. 3.2.2 Web ApplicationJasperServer Jasperserver configured as a stand-alone application container that contains all of the elements described above adding security and resource accessibility capabilities. JasperServer is a hundred percent Java2EE that permits the management of the completely BI assets [16]. The general appearance of the web application is a little simple without giving up the power as it is illustrated in Fig. 7. Nevertheless, having all resources accessible on the top button bar creates a hundred percent useful application and contains all the essential assets for BI. Its main features: • It is the suite’s report server • Can be used stand-alone or from other applications. • Provides the entry point for reporting and data analysis • User interface very easy to use and customizable • It can execute, program and distribute reports to users and groups • Can store issued reports • Manage shared resources. Title Suppressed Due to Excessive Length 13 Fig. 7. Jasper-Server Interface is a hundred percent Java2EE to administrate all BI resources. 3.2.3 Reports It is used for the design and presentation of dashboards and reports that configure dashboards with indicators required by the Management of the organization. This software allows presenting data from one or more sources in 10 highly interactive format types, for business users [14]. As said previously, the report engine is the solution of Jaspersoft and it is shown in Fig. 8. This module offers characteristics, for instance, i) Report development environment: Ireport is a structure focused on background NetBeans, ii) System of metadata. It, accompanied by ad-hoc views, the strongest point of the component, iii) Web Interface for ad-hoc views actually fine committed, iv) The run-time JasperReports extensively was recognized and implemented in numerous, and v) The reports are able to be transferred into PDF, HTML, XML, CSV, RTF, XLS and TXT. The main features are: • It is the basic library of the project • It is the engine that runs the reports • Widely used in the open source world • Integrable in desktop and web applications java 14 Victor M. Parra and Malka N. Halgamuge • Great functionality and stability Two different types of reports can be set or created: • Predefined Ireport: IReport is an operational background that permits a huge amount of characteristics [17]. • Ad hoc: it is the proper strong point of Jasper solutions. The corrector of ad-hoc views is well organized and well contained tool for analysing [17]. It gives: i) Choice of diverse sorts of prototypes and layouts, ii) Variety of diverse data references, iii) Authentication verification on the fly, iv) Making of reports by dragging fields to the wanted place, and iv) Correction of whole features of the reports. Fig. 8. Jaspersoft Reporting Interfaces to design and present different dashboards and reports with diverse indicators. Moreover, to present data from many resources in highly interactive formats for business users. 3.2.4 OLAP Jaspersoft data analysis application is implemented to design, manage and display every type of data through OLAP analysis or in memory in order to detect problems, categorise tendencies and make faster and more accurate decisions. JasperAnalysis is used for the design and support of OLAP cubes that complement the structure of the scorecards that provide information research and analysis tools online. Mondrian is the OLAP engine that is implemented by JasperServer, furthermore, it works along with Viewfinder-JasperAnalysis [18]. It is not JPivot Title Suppressed Due to Excessive Length 15 but then again with a layer of makeups as illustrated in Fig. 9. Previously cited in Pentaho description. Its main characteristics are: • It is the ROLAP user and server application. • Allows users to scan data well beyond the usual reporting capabilities. • The server does all the heavy work, so the presentation layer is very light. • It can be linked with reports, both as origin or destination of the same. Fig. 9. The OLAP engine implemented by JasperServer and working with Viewfinder-JasperAnalysis, which contains a layer of makeup. 3.2.5 Dashboards Jaspersoft dashboard software combines data and metrics to provide a summary graphical view of information for users to react quickly. Thanks to embedded dashboard solutions, it is possible to increase the competitiveness of applications with advanced reports and analysis. These dashboards extend their applicability through: Dashboards: • Supports iFrame with URL for peripheral content such as pictures, RSS feeds, etc. 16 Victor M. Parra and Malka N. Halgamuge • United dashboard interfaces mixes internal company info with peripheral data sources • The internal and peripheral data of the dashboard report are controlled by global parameters Embedded dashboard solutions: • Delivers a profounder vision into data through reports and embedded dashboards • Simply mix collaborating reports, diagrams and peripheral Web content on its own display • Provides interactive dashboards Two different type of Dashboard can be created through Dashboard Designer. Illustrated in Fig.10. • Predefined: Despite the designer boards do not bring with much logic [19], this tool is able to include its own developments, as it is a Java platform. • Ad-hoc: Dashboard Designer: As a web editor, this dashboard has a very basic and straightforward functionality. Fig. 10. Jaspersoft Dashboards Designers equipped with the option to choice predefined or ad-hoc dashboards. Title Suppressed Due to Excessive Length 17 4 COMPUTER ALGEBRA SYSTEMS 4.1 Sagemath SageMath is a computer algebra system (CAS) assembled on mathematical sets and compared as NumPy, Sympy, PARI / GP or Maxima, as it gets a mutual power over and done with a common language which is built in Python. The interface code mixes segments with graphics, texts or formulas improved with LaTeX rendered. Furthermore, SageMath is separated into a kernel that performs different calculation operations, in addition to an interface that visualizes the processes and interacts with the user. Furthermore, , the CAS has a text based on command lines that uses a Python code, which empowers itself to perform interactive control calculations [20]. Python is a powerful programming language that supports object-oriented expressions and efficient programming. The core of SageMath is constructed in Python in addition to a adapted form of Pyrex called Cython. These features allow SageMath to perform parallel processing [21] and making use of both multi-core processors along with symmetric multiprocessors. Additionally, it is able to offer interfaces to other licensed tools such as Mathematica, Magma and Maple; this important characteristic permits users to fusion software and contrast outcomes and performance. In addition, Sage is an environment of mathematical calculations of open code that allows carrying out algebraic, symbolic and numerical calculations. The goal of Sage is to create a free and viable option to Magma, Maple, Mathematica and Matlab, all of them powerful commercial programs. Sage serves as a symbolic calculator of arbitrary precision, nonetheless it can also perform calculations and solve problems using numerical methods. These calculations are performed through built-in and other external algorithms such as Maxima, NTL, GAP, Pari / gp, R and Singular [21]. Sage not only consists of the program itself, but also does the calculations, which can communicate with the terminal, that incorporates a graphical user interface through any web browser. Any and every set covers most functions, for instance, i) Libraries of fundamental and singular functions, ii) 2D and 3D charts of together purposes and data, iii) Data management implements and responsibilities, iv) A set of tools for aggregating GUIs to controls, v) Implements for image treating via Python and Pylab, vi) Implements to display and examine diagrams, vii) Filters for introducing and transferring data, pictures, audiovisual, CAD, and GIS, viii) Sage implanted in documents LaTeX6 [22]. 18 Victor M. Parra and Malka N. Halgamuge 4.2 Matlab Matlab is a computer algebra system (CAS) that offers an incorporated background that consents to develop and deliver a specific feature, for instance algorithm execution, data representation, and functions. Even, it compromises interaction with other programming languages and other hardware mechanisms [23], among others are innovated. The Matlab suite has two annexes that spread out these functionalities: Simulink is a scenario that permits multidomain simulation and GUIDE, which is a graphical user interface (GUI). In addition, ones potential possibly will be extended by using Matlab toolboxes, and Simulink bocks with blocks arrangements. Matlab has an interpreted programming language, which allows it to be executed in two interactive environments that makes use of scripts (* .m extension files). This feature permits vector and matrix type operations to work properly, in addition to calculations and object-oriented programming. The capability to work in .NET or Java environments has been possible through the release of new Matlab Builder tool, which in turn contains an embedded ”Application Deployment”, which makes it feasible to use and manage Matlab’s own functions such as library files. It is important to say that the Matlab Component Runtime (MCR) must be implemented/installed on the same computer where the core program is running, to ensure the correct functionality of Matlab [24]. Conducting various measurements and providing an interface for interacting with other programming languages is one of the great versatility that is provided by this CAS. In this way, Matlab can invoke subroutines, processes or functions that have been elaborated in C or FORTRAN [25]. In the dimension and time that this process is carried out, a wrapper function is created that allows them to be translated and returned by their Matlab data types. MATLAB also provides a number of specific solutions called TOOLBOXES. These are very important to most MATLAB users and are MATLAB function sets that encompass the MATLAB background to resolve specific types of issues, for instance: • Signal processing • Design of control systems • Simulation of dynamic systems • Identification of systems • Neural networks and others. Title Suppressed Due to Excessive Length 19 Probably the most important feature of MATLAB is its ability to grow. In summary, the most important features of MATLAB are: • Writing the program in a mathematical language. • Implementation of the matrices as a basic element of the language, which allows a great reduction of the code, since it does not need to implement the matrix calculation. • Implementation of complex arithmetic. • A great content of specific orders, grouped in TOOLBOXES. • Ability to extend and adapt the language, using script files and .m functions. 5 MATERIALS AND METHODS The action of measuring the execution times of an application is not an easy or insignificant work. The results obtained can be significantly different in each measurement, from one computer to another. The factors influence these times can be, among others, the algorithm used, the type of operating system, the number of processors and their speeds, the set of instructions of each processor, the amount and speed of the memories (RAM and cache), and the mathematical coprocessor. In addition, the same algorithm used on the same computer could show different results in the measurement of the times; this could be due to, the times that other applications are using or to the determinant of whether, there is enough RAM or not for the execution of the application or algorithm. The goal of this study is to make a comparison exclusively of the ETL and Reporting processes of both BI tools, trying to obtain independent conclusions from one computer to another. In the same way, perform a measurement of execution times as a function of the ”dimensions” of the input data. To accomplish these purposes, two methodologies will be used: - Calculating the processing time used by BI tools with the same input data proportions and Calculate the number of instructions carried out by each tool. 5.1 ELT Measurement The execution times and performance of the ETL processes of both BI tools were measured with Sage. For this, an algorithm in code C was implemented and it is illustrated as follow: 20 Victor M. Parra and Malka N. Halgamuge Algorithm 1 : ELT Measurement int rbi, rdb, rsage; cout ”Run BI Tool” cin rbi; endl; rbi=0 cout ”Run Database” cin rdb; rdb=0 while rdb ≤ 6 do rdb=rdb+1 cout ”Run Sage” cin rsage; endl; while rbi = 1|rbi = 2 do if rbi =1 then count ”Show Statistics” endl; return 0; end if if rbi =2 then count ”Run Database Again” endl; return rdb; end if return 0; end while end while In order to measure CPU time, Sage uses the notions of CPU time, plus Wall time [21] that are intervals that the PC uses only for BI tool. The CPU time is used to the computations, and the Wall time clock flanked by the initiation and the completion of the calculations. The two calculations are subjected to unexpected changes. The easiest method to acquire the execution time is to write the word time in the command as illustrated in Fig. 11. The time command is not properly adaptable and requires additional parameters to be able to calculate the exact times dedicated to only one type of program or application. For this case then the CPUtime and Walltime commands are used. CPUtime performs the progressive measurements of the calculations made accurately which computes the times that were dedicated to Sage by the CPU. Walltime is the traditional UNIX clock. In addition, the previous and later times used in the implementation of the algorithms in Sage were documented and measured, and the differences are shown in Fig. 12. Title Suppressed Due to Excessive Length 21 Fig. 11. The Sage algorithm is to calculate CPU time of ETL procedures in both BI tools for short and large data magnitudes. Fig. 12. The Sage algorithm with CPUtime and Walltime parameters to measure the ETL procedure in both BI tools. 22 Victor M. Parra and Malka N. Halgamuge The times used in the execution of the factorial function, together with the data of different size dimensions, are stored in the list of CPU times as illustrated by the algorithm in Fig. 13. Fig. 13. The Sage algorithm to store the lists of the CPU times implemented in the execution of the factorial function, together with the data of different size dimensions. 5.2 Reporting Measurements The execution times and performance of the Reporting processes of both Business Intelligence Tools (Pentaho and Jaspersoft) were measured with the Computer Algebra System ”Matlab”. In order to carry out this process, an algorithm in C code was implemented and it is illustrated as follow: Title Suppressed Due to Excessive Length 23 Algorithm 2 : Reporting Measurement int rbi, rdb, rmatlab; cout ”Run BI Tool” cin rbi; endl; rbi=0 cout ”Run Database” cin rdb; endl; rdb=0 while rdb ≤ 6 do rdb=rdb+1 cout ”Run Matlab” cin rmatlab; endl; while rbi = 1|rbi = 2 do if rbi =1 then count ”Build Analytics” endl; return 0; end if if rbi =2 then count ”Show Error” endl; return rdb; end if return 0; end while end while Additionally, to carry out this process, a function written in C language was used through that had a High-Resolution Performance Counter was ap plied. The purpose is to be able to calculate the execution times of the Re porting processes in both BI tools as shown in Fig. 14. In this instance, Query Performance Counter functions as a clock () and Query Performance Frequency as CLOCKS PER SEC. Then, the first instruc tion generates the counter value and the next generates the frequency (in cycles per second, hertz). In this case, an LARGE INTEGER is the clearest method to characterize a 64-bit integer generated by a union. 24 Victor M. Parra and Malka N. Halgamuge Fig. 14. The Algorithm to calculate the run time of the reporting process in both BI tools. 5.3 Computer system The aim of carrying out this examination and the corresponding analysis, BI tools (Pentaho and Jaspersoft), CAS (Sage and Matlab), and databases are installed and configured on a computer with the characteristics which are described below: i) Operating system: x64-based, ii) Operating system version: 10.0.10240 N/D iii) Compilation 10240, iv) Number of processors: 1, v) Processor: Intel (R) Core (TM) i5-3317U vi) Processor speed: 1.7GHz, vii) Instructions: CISC, viii) RAM: 12Gb, ix) RAM speed: 1600MHz, x) Cache: SSD express 24Gb, xi) Mathematical coprocessor: 80387, xii) GPU: HD 400 on board. 5.4 Databases Analysis Table 1 illustrates and describes the key characteristic of the Excel databases that are used to carry out this examination. The six dissimilar databases were obtained from UCI Machine Learning Repository [26]. Title Suppressed Due to Excessive Length 25 . Table 1. ILLUSTRATION OF THE EXCEL DATABASES USED FOR THIS EXAMINATION. DataBase Number of Attributes Number of Instances Size in Mb DB1 21 65.055 0.009 DB2 26 118.439 0.017 DB3 35 609.287 0.134 DB4 40 999.231 1.321 DB5 51 1.458.723 35.278 DB6 62 2.686.655 144.195 6 RESULTS In this examination, the data resulting from CPU time measurements (using two different Computational Algebra Systems and in relation to the size of the different databases) was acquired from the input registers applied to the ETL and Reporting processes for both tools of BI. The considerable fluctuations in the computational time measurement re sulted in each of the processes are directly linked to factors such as the im plemented algorithms, the type of operating system installed on the test com puter, the number and speed of its processors, set of instructions comprising the processor, the amount and speed of RAM and cache, and the mathemat ical coprocessor, among others. Similarly, on the same computer, running in the same algorithm at different times can produce different results, caused by allocating the amounts of memory to other programs and their execution speeds. The data obtained from the CPU times taken by the computer during the execution of the ETL and Reporting processes, processing the six databases, are detailed in Tables 2 and 3. Similarly, the increase of the processing data in the tools of BI can be taken into account as a process of difference between such applications. The first evaluation (Table 2) shows that the computation times dedicated by the Pentaho ETL process and which were calculated by 26 Victor M. Parra and Malka N. Halgamuge Sage were: 8 min; 12.01 min; 21 min; 32.01 min; 39.06 min and 48.01 min. In contrast, the computation times employed by Jaspersoft, calculated by Sage as well were: 9.54 min; 19.32 min; 31.88 min; 44.73 min; 55 min and 67.69 min, processing 0.009 Mb of DB1; 0.017 Mb of DB2; 0.134 Mb of DB3; 1.321 Mb of DB4; 35.278 Mb of DB5 and 144.195 Mb of DB6, correspondingly for both BI applications. The results obtained from the CPU time used during the ETL process, processing the data of the different databases by both BI tools, are illustrated in Table 2. Table 2. OUTCOMES OF THE CPU TIME OF THE ETL PROCESS USED BY BOTH BI TOOLS AND THE PERCENTAGE INCREASED AFTER THE PROCESSING OF THE DATABASES. Time in Minutes Tool Process DB1 DB2 DB3 DB4 DB5 DB6 Pentaho ETL 8 12.01 21 32.01 39.06 48.01 Jaspersoft ETL 9.54 19.32 31.88 44.73 55 67.69 Increment in the Process of Data Jaspersoft ETL 19.22% 60.85% 51.79% 39.75% 40.77% 40.99% Oppositely, the second segment of the analysis, regarding the measurement of the Reporting process (Table 3), it can be seen that Matlab calculated the times used by Pentaho tool are as follows: 3.75 min; 5.35 min; 8.47 min; 12.03 min; 17.07 min and 22.60 min. In contrast, the times used in the same process by the Jaspersoft tool were: 3 min; 4.02 min; 6.05 min; 8.13 min; 11.16 min and 14.15 min, processing 0.009 Mb from DB1; 0.017 Mb from DB2; 0.134 Mb from DB3; 1.321 Mb from DB4; 35.278 Mb from DB5 and 144.195 Mb from DB6, correspondingly for both BI tools. The results obtained from the CPU time used during the Reporting process, processing the data of the different databases by both BI tools, are illustrated in Table 3. Title Suppressed Due to Excessive Length 27 Table 3. OUTCOMES OF THE CPU TIME OF THE REPORTING PROCESS USED BY BOTH BI TOOLS AND THE PERCENTAGE INCREASED AFTER THE PROCESSING OF THE DATABASES. Time in Minutes Tool Process DB1 DB2 DB3 DB4 DB5 DB6 Pentaho Reporting 3.75 5.35 8.47 12.03 17.07 22.60 Jaspersoft Reporting 3 4.02 6.05 8.13 11.16 14.15 Increment in the Process of Data Pentaho Reporting 25 % 32.99% 40% 48% 53 % 59.75 % The data that was obtained as a result of the execution times of the ETL and Reporting processes of both BI tools, after having processed the databases, then it will shown and compared graphically. Thus, it is observed in Fig. 15 that the Jaspersoft tool had a significant increase in CPU times for the ETL process, represented by 19.22%, 60.85% 51.79%, 39.75%, 40, 77 and 40.99% for processing DB1, DB2, DB3, DB4, DB5 and DB6, correspondingly. This signifies that the Pentaho tool performs better, compared to Jaspersoft. Fig. 15. CPU Time of the ETL process taken by both BI tools after processing the databases. 28 Victor M. Parra and Malka N. Halgamuge Then, as seen in Fig. 16, the results obtained from the Reporting process had a notable increase after being executed by the Pentaho tool; expressed by 25%, 32.99%, 40%, 48%, 53% and 59.75% processing DB1, DB2, DB3, DB4, DB5 and DB6, respectively. In this situation, Jaspersoft had an improved performance over Pentaho. Fig. 16. CPU Time of the Reporting process taken by both BI tools after processing the databases. The results correspondingly exposed that the CPU times for both the ETL and Reporting processes are in a straight line associated with the data dimension of the databases. Finally, it can be established that Pentaho observed better performance for the ETL process and Jaspersoft superior performance for the Reporting process. 7 DISCUSSION Clearly, after the experimentation and analysis of the data resulting from the ETL process, the ”Jaspersoft” BI tool showed a marked increase in CPU times for data processing; this is denoted with an average of 42.28% over Pentaho when showing its performance in the six databases. The Pentaho Title Suppressed Due to Excessive Length 29 tool exposed its data integration capabilities and ETL process competences as presented in ”Pentaho Business Analytics: An Open Source Alternative to Business Intelligence” [9] and Pentaho and Jaspersoft: A Comparative Study of Business Intelligence Open Source Tools Processing Big Data to Evaluate Performances [30], with better performance. Undoubtedly, for this piece of research, this study can demonstrate that the Pentaho tool showed a greater ETL processing capability; simultaneously it covered the objectives and requirements of data integration, the same time with Big Data. Its parallel processing engine provided a great performance and these characteristics are presented in ”Pentaho Reports Review” [11]. In the second segment of this research, the BI tool ”Pentaho” showed a significant increase in CPU times used for data processing for the Reporting process, it compared the results of the ”Jaspersoft” tool. The difference showed 43.12% when processing the six databases. The evaluation part of the study has shown that the ”Jaspersoft” BI tool has had a better performance with Reporting processes. This disclosure is specifically associated with other investigative results, which supports the Jaspersoft’s increasing BI capabilities, circumscribing documents based on its operational output, interactive end-user search, data integration and analysis, as it was mentioned in the articles: ”Pentaho Reports Review” [11] and performances evaluate of Pentaho and Jaspersoft [30]. In addition, the exploration of numerous security features could be an attractive way to examine and implement Big Data projects in the future, which is mentioned in the articles: ”Threat Analysis of Portable Hack Tools from USB Storage Devices and Protection Solutions” [27] and Optimizing windows Security features to block malware and hack tools on USB storage devices [29]. 8 CONCLUSION Two of the best BI applications on the marketplace have been tested in this study: Pentaho and Jaspersoft. Both tools exhibit important features in their modules. On the one hand, Pentaho presents an ETL component easy-to-use and maintain, as well as great flexibility to perform transformations; It groups all information components only on a functional platform; Web application with Java J2EE application hundred percent extensible, flexible and config 30 Victor M. Parra and Malka N. Halgamuge urable; OLAP Mondrian with a unified engine considerably used in JAVA environments; Configuration management is made up of almost all environments, which are intercommunicated with other apps using web services; Reports with an instinctive tool that makes it easy for users to generate reports easily; Dashboard Designer for the generation of Ad-hoc dashboards, dashboards founded on SQL queries or metadata, and countless autonomy by providing a extensive variety of devices and choices. On the other hand, Jaspersoft owns JasperETL (Talend) with native Java / Perl; The service fixes are well-defined, most of them supported by the same web application; Web application with a Java component j2EE hundred percent expandable, flexible and configurable; The reports and command of the Ad-hoc editor of the command boxes are solved in an outstanding way; Reports are expedited; Ad hoc reports have an interesting, flexible, powerful, instinctive and easy-to-use interface. The key focus of this experimental study is the evaluation of the ETL and Reporting processes of BI tools, measuring their performance through two computational algebra systems, Sage and Matlab. The evaluation of the ETL process produced noticeable results, showing marked increments used by Jaspersoft in CPU times, over those used by Pentaho. This made it factual as that Jaspersoft tool used 42.28% more time on the performance metrics when processing the data of the six databases. Pentaho meanwhile, has evidenced during measurement tests, the use of more CPU time in the Reporting process, compared to those used by Jaspersoft. In this way, it was also established that the notable increase in these execution times used by Pentaho, amounted to 43.12% on performance metrics at the time of processing the same databases. Finally, this experimental analysis is a rather convenient reference document for many researchers, in addition to those who are supporting Big Data’s processing decisions and the implementation of Open Source Business Intelligence tools, founded on the process perspectives. For the author, future research could be involved with the development of new experiments in the space of BI and Data Warehouse to support of organizational decision-making, taking this research as a reference. Title Suppressed Due to Excessive Length 31 References 1. B. List, R. M. Bruckner, K. Machaczek, J.Schiefer (2002) A Comparison of Data Warehouse Development Methodologies Case Study of the Process Warehouse. Database and Expert Systems Applications DEXA, France, Volume 2453, pp 203-215 2. H. Dresner (1993) Business intelligence: competing Against Time. Paper presented a Twelfth Annual Office Information System Conference, Garther Group. Earl, M.J. London 1989 3. S. Atre, L. T. Moss (2003) Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications. Addison Wesley, Boston. 4. J. F. Gonzalez (2011) Critical Success Factors of a Business Intelligence Project. Novtica, 211: 20-25 5. R. L. Sallam, B. Hostmann, K. Schegel, J. Tapadinhas, J. Parenteau, T. W. Oestreich (2015) Magic Quadrant for Business Intelligence and Analytics Platforms. Available: http://www.gartner.com/doc/2989518/magic-quadrantbusiness-intelligence-analytics. Accessed 9 Nov 2016 6. Gartner, Inc (2016) IT Glossary. Available http:www.gartner.com/itglossary/business-intelligence-bi/. Accessed 9 Nov 2016 7. R. Kune, P. K. Konugurthi, A. Agarwal, R. R. Chillarige, R. Buyya (2016) The Anatomy of Big Data Computing. Software: Practice and Experience 79-105 8. Pentaho A Hitachi Group Company (2005-2016) Pentaho: Data Integration, Business Analytics and Bid Data Leaders. Available via Pentaho Corporation http:// www.pentaho.com. Accessed 10 Nov 2016 9. D. Tarnaveanu (2012) Pentaho Business Analytics: a Business Intelligence Open Source Alternative. Database System Journal 3:13 10. T. Kapila (2014) Pentaho BI & Integration with a Custom Java Web Application. Available http:// www.neevtech.com/blog/2014/08/13/pentaho-biintegration-with-a-custom-java-web-application-2/. Accessed 11 Nov 2016 11. Innovent Solutions (2016) Pentaho Reports Review. Available http:// www.innoventsolutions.com/pentaho-review.html. Accessed: 12 Dec 2016 12. G. Pozzani (2014) OLAP Solutions using Pentaho Analysis Services. Available http:// www.profs.sci.univr.it/p˜ozzani/attachments/pentaho lect4.pdf. Accessed: 12 Dec 2016 13. Sanket (2015) Fusion Charts Integration in Pentaho BI Dashboards. Available http:// www.fusioncharts.com/blog/2011/05/free-plugin-integratefusioncharts-in-pentaho-bi-dashboards/ Accessed: 13 Nov 2016 14. TIBCO Jaspersoft (2016) Jaspersoft Business Intelligence Software. Available via TIBCO Software. http:// www.jaspersoft.com Accessed 15 Nov 2016 15. S. Vidhya, S. Sarumathi, N. Shanthi (2014) Comparative Analysis of Diverse Collection of Big data Analytics Tools. International journal of Computer, Electrical, Automation, Control and Information Engineering 9:7 16. T. olavsrud (2014) Jaspersoft Aims to Simplify Embedding Analytics and Visualizations. Available http:// www.cio.com/article/2375611/businessintelligence/jaspersoft-aims-to-simplify-embedding-analytics-andvisualizations.html Accessed: 16 Dec 2016 17. S. Pochampalli (2014) Jaspersoft BI Suite Tutorials. Available http://www.jasper-bi-suite.blogspot.com.au/ Accessed: 17 Dec 2016 32 Victor M. Parra and Malka N. Halgamuge 18. J. Vinay (2013) OLAP Cubes in Jasper Server , Available http:// www.hadoopheadon.blogspot.com.au/. 2013/07/setting-up-olap-cubes-injasper.html Accessed: 19 Nov 2016 19. Informatica Data Quality Unit (2013) Data Quality: Dashboards and Reporting. Available http://www. Markerplace.informatica.com/solution/data quality dashBoards andreporting-961 Accessed: 21 Dec 2016 20. Sagemath (2016) Sagemath — Open-Source Mathematical Software System. Available via Sage. http:// www.sagemath.org Accessed 21 Nov 2016 21. AIMS Team (2016) Sage. Available http:// www.launchpad.net/a˜ims/+archive/ubuntu/sagemath Accessed: 21 Dec 2016 22. W. Stein (2016) The Origins of SageMath. Available http:// www.wstein.org/talks/2016-06-sage-bp/bp.pdf Accessed: 28 Nov 2016 23. MathWorks (2016) MATLAB - MathWorks - MathWorks Australia. Available via MathWorks. http:// www.au.mathworks.com Accessed 28 Dec 2016 24. M. S. Gockenbach (1999) A Practical Introduction to Matlab. Availabl httpe:// www.math.mtu.edu/m˜sgocken/intro/intro tml. Accessed: 28 Dec 2016 25. K. Black (2016) Matlab Tutorials. Available http:// www.cyclismo.org/tutorial/matlab/ Accessed: 29 Nov 2016 26. M. Lichman (2013) UCI Machine Learning Repository. Available http:// www.archive.ics.uci.edu/ml Accessed 10 Dec 2016 27. D. V. Pham, A. Syed, A. Mohammad and M. N. Halgamuge (2010) Threat Analysis of Portable Hack Tools from USB Storage Devices and Protection Solutions. International Conference on Information and Emerging Technologies 1-5 28. D. V. Pham, A. Syed and M. N. Halgamuge (2011) Universal serial bus based software attacks and protection solutions. Digital Investigation 7, 3:172-184 29. D. V. Pham, M. N. Halgamuge, A. Syed and P. Mendis (2010) Optimizing windows security features to block malware and hack tools on USB storage devices. Progress in electromagnetics research symposium 350-355 30. Victor M. Parra, Ali Syed, Azeem Mohammad and Malka N. Halgamuge (2016). Pentaho and Jaspersoft: A Comparative Study of Business Intelligence Open Source Tools Processing Big Data to Evaluate Performances. International Journal of Advanced Computer Science and Applications 10.14569:1-10