1 Big-Data Security Management Issues Marisa Paryasto, Andry Alamsyah, Budi Rahardjo, Kuspriyanto Abstract—Big data phenomenon arises from the increasing number of data collected from various sources, including the internet. Big data posses characteristics that make it difﬁcult to managefromsecuritypointofview.ThispaperlooksatNISTrisk management guidance and determines whether it is applicable to big data. Index Terms—Big data, security, management I. INTRODUCTION DATA analytics is the key to understand information of knowledge about some certain activities. The process of data analysis includes checking, cleaning, modelling, and transformation of the data. The information gathered from this process is then used for suggestion, summary and support for decision making [5] [6]. Big data phenomenon is triggered by the rapid growth of various social network services. User generated content is responsible for generating a huge volume of data to be analyzed for many purposes, from business to security. Machine to machine communication (M2M) and the Internet of Things also produce a vast amount of data. Data from other ﬁelds, eg. DNA sequencing, also contribute to big data. Big data implications in data analytics is signiﬁcant, such as in management business [8], the data gathered from online conversations between members in a community can be used as a consideration for marketing strategy, supply chain management, customer relationship management, competitive advantage and business intelligent of a company. In informatics, a new learning system approach for artiﬁcial intelligence using big data has already been generated. In research on astronomy, NASA has also use Big data to support research to map stars formation on the sky. Big data has certain characteristics that are called 4V [7]; volume, variety, velocity, and value. Currently, 2.5 quintillion bytes of data are produced daily. The format (and content) of data varies and unstructured. The speed of data creation is faster than the speed of analysis. The use and value of the data also varies. This creates a problem in the analysis and safe guarding of the big data. One of the biggest problems in big data is security. Some big data initiatives failed due to the unclear security controls. Thus, security is important in big data implementation. Security can be seen from three aspects; conﬁdentiality, integrity, and availability. Conﬁdentiality is responsible for securing access of big data. Unfortunately, the massive size of data sources and mixing of the sources make it difﬁcult to decide who is granted to access and analyze the data [2]. Who can access the derived data? Can a third party access or sell the combined data? In terms of integrity, there is no single method to guarantee integrity of data in a variety unstructured format. Conventional uniform message digest may not work. As for availablity, redundancy is difﬁcult due to the size and distributed nature of big data. NIST Special Publication 800-30 (2012) [4] is a guidance documentation for conducting risk assessments of information security. This guidance provides senior leader/executives in organisations the information needed to determine appropriate coursesofactioninresponsetoidentiﬁedrisk.Theobjectiveof this paper is to map big data characteristics into steps outline in the NIST document. (See Figure 1.) The ﬁnal goal is to see whether the guidance is applicable to big data. Our roadmap is outlined in Figure 2. II. BIG DATA A. Deﬁnitions Big data compared to traditional relational databases in terms of requirements and architecture, is essentially different. Big data is often measured by 4V (volume, variety, velocity and value). Referring to [1] some of the fundamental differences in Big Data architecture are listed below. 1) Distributedarchitecture.Bigdataarchitectureishighly distributed,withthescaleofthousandsdataandprocessing nodes. Big data architecture is generally highly resilient and fault tolerant because the data is horizontally partitioned, replicated and distributed among multiple data nodes available. 2) Real-time,streamandcontinuouscomputations.Data are produced in real-time and in streaming fashion. Computations to these data must be done continuously (and hopefully can also be done in real-time). 3) Ad-hoc queries. Since the content and value of data varies, the queries to the data also varies and ad-hoc. The queries are done on the ﬂy. 4) Parallel and powerful programming language. The computations performed in big data are much more complex, highly parallel and computationally intensive than ones that are done with traditional programming or database languages. 5) Move the code. Due to the size of data, it is easier to move the code than to move the data. This makes it more difﬁcult in terms of security control. 6) Non-relational data. The data stored in big data is mostly non-relational. The main advantage of non relational data is that it can accomodate large volume and variety of data. 7) Auto-tiering. In big data, it is exteremely difﬁcult to know precisely where the data is exactly located among the available data nodes because the hottest data blocks 2 Fig. 1. Generic Risk Model NIST are tiered into higher performance media while the coldest data is sent to lower cost high capacity drives. 8) Variety of input data sources. Big data requires collecting data from many sources such as logs, end to point devices, social media, etc. It is more difﬁcult to determine who have access to what. B. Vulnerabilities The following is a list of vulnerabilities found in big data. 1) Insecurecomputation. An insecure program can access sensitive data (personal proﬁle, age credit cards, etc.), can corrupt the data leading to incorrect results and can perform Denial of Service to big data solution leading to ﬁnancial loss. 2) End-point input validation/ﬁltering. There are two fundamental challenges in data collection process: input validation and data ﬁltering. The amount of data collection in big data makes it difﬁcult to validate and ﬁlter data on the ﬂy. 3) Granular access control. Existing solutions of big data are designed for performance and scalability, keeping almost no security in mind. 4) Insecure data storage and communication. These includes data storage at various distributed data nodes, auto-tiering,realtimeanalyticsandcontinuouscomputation,securecommunication(amongnodes,middlewares, and end users) and transactional logs of big data. 5) Privacy preserving data mining and analytics. There are many concerns pertaining to monetizing and sharing big data analytics in terms of invasion of privacy, invasive marketing and unintentional disclosure of sensitive information. C. Security Big data has complexities that mostly people and companies are unprepared to deal with. These complexities include securityandgovernanceofdataingeneral.Informationgovernance is the capability to create information resource that can be trusted by employess, partners, and customers, as well as government organizations [3]. Big data comes from many data sources that might have different security and governance policies. A well-deﬁned security strategy has to be applied on whatever information management.Combinationofsecurityandgovernancestrategy need collaboration and coordination to share responsibilities accross organizations/ parties involved to make sure the accountability is enforced to the data being used. Common security solution to data is done by encrypting the data. However, different kinds of data require different forms of security protections. Applying the same kind of encryption (choosing the highest one) may result in high cost and complicated procedures. Some data-safeguarding techniques suggested in [3] are: 1) Data anonymization. The process of removing all data can be uniquely tied to an individual. 2) Tokenization. Protecting sensitive data by replacing it with tokens or alias values meaningless to unauthorized people. Data scrubbing is another term commonly used. 3) Cloud database controls. Setting up access controls to protectthedatabase.However,thisapproachisverynew. III. METHODOLOGY Our main questions is whether the risk management framework offered by NIST SP800-30 can be used in big data. The approach that we use is to map big data characteristics 3 Fig. 2. Big Data Security Roadmap TABLE I BIG DATA VULNERABILITY CLASSES into steps outlined in the NIST document. In each step, NIST suggested methodology to obtain the data. There are three ways big data affected the NIST framework; (1) no change, (2) the methodology is the same but the data is larger, and (2) the methodology must be change. Using these, we map the content as shown in Table II. Looking at the tabel, we can see that big data has effect to the methodology but not in a way that requires a new methodology. At most, we have to deal with larger data. Thus, NIST SP800-30 framework is still viable for big data. IV. CONCLUSION NIST Risk Assessment framework described in NIST SP800-30 [4] can be use for big data. The methodology in obtaining the data for risk assessment is still the same, although we may have to deal with larger data. REFERENCES [1] Jitendra Chauchan. Top 5 big data vulnerability classes, 2013 July. [2] K. Davis and D. GordonPatterson. Ethics of Big Data. O’Reilly, 2012. [3] Judith Hurwitz, Alan Nugent, Fern Halper, and Marcia Kaufman. Big Data for Dummies. 2013. 4 TABLE II RISK ASSESSMENT ACTIVITIES NIST AND EQUIVALENT RISK SECURITY IN BIG DATA [4] Computer Security Division Information Technology Laboratory. Guide to elliptic curve cryptography for conducting risk assessments. Technical report, National Institute of Standards and Technology, 2012. [5] A. MacAfee and E. Brynjolfsson. Big data: The management revolution. Harvard Business Review Magazine, October 2012. [6] S. Sagiroglu and D. Sinanc. Big data: A review. In International Conference on Collaboration Technology and System. International Conference on Collaboration Technology and System, 2013. [7] A. Sathi. Big Data Analytics: Disrupting Technologies for Changing Game. MC Press, 2012. [8] D. Zage, K. Glass, and R. Colbaugh. Improving supply chain security using big data. In International Conference on Intelligence and Security Informatics. IEEE, 2013. Marisa Paryasto is a researcher at Bandung of Institute of Technology and lecturer at Telkom University Andry Alamsyah is a lecturer at Telkom University Budi Rahardjo is a researcher and lecturer at Bandung Institute of Technology. Kuspriyantois a professor and lecturer at Bandung of Institute of Technology