Saturday, September 21, 2019
Data Anonymization in Cloud Computing
Data Anonymization in Cloud Computing    Data Anonymization Approach For Privacyà  Preserving In Cloud    Saranya M    Abstractââ¬âPrivate data such as electronic health recordsà  and banking transactions must be shared within the cloudà  environment to analysis or mine data for research purposes.  Data privacy is one of the most concerned issues in big dataà  applications, because processing large-scale sensitive data setsà  often requires computation power provided by public cloudà  services. A technique called Data Anonymization, the privacyà  of an individual can be preserved while aggregate informationà  is shared for mining purposes. Data Anonymization is aà  concept of hiding sensitive data items of the data owner. Aà  bottom-up generalization for transforming more specific dataà  to less specific but semantically consistent data for privacyà  protection. The idea is to explore the data generalization fromà  data mining to hide detailed data, rather than discovering theà  patterns. When the data is masked, data mining techniquesà  can be applied without modification.  Keywordsââ¬âData Anonymization; Cloud; Bottom Up  Generalization; Mapreduce; Privacy Preservation.  I. INTRODUCTION  Cloud Computing refers to configuring, manipulating,à  and accessing the applications through online. It providesà  online data storage, infrastructure and application.which isà  a disruptive trend which poses a significant impact onà  current IT industry and research communities [1]. Cloudà  computing provides massive storage capacity computationà  power and by utilizing a large number of commodityà  computers together. It enable users to deploy applicationsà  with low cost, without high investment in infrastructure.  Due to privacy and security problem, numerous potentialà  customers are still hesitant to take advantage of cloudà  [7].However, Cloud computing reduce costs throughà  optimization and increased operating and economicà  efficiencies and enhance collaboration, agility, and scale, byà  enabling a global computing model over the Internetà  infrastructure. However, without proper security andà  privacy solutions for clouds, this potentially cloudà  computing paradigm could become a huge failure.  Cloud delivery models are classified into three. They areà  software as a service (saas), platform as a service (paas)à  and infrastructure as a service (iaas). Saas is very similar toà  the old thin-client model of software provision, clientsà  where usually web browsers, provides the point of accessà  to running software on servers.Paas provides a platform onà  which software can be developed and deployed. Iaas isà  comprised of highly automated and scalable computerà  resources, complemented by cloud storage and networkà  capability which can be metered ,self-provisioned andà  available on-demand[7].  Cloud is deployed using some models which includeà  public, private and hybrid clouds. A public cloud is one inà  which the services and infrastructure are provided off-siteà  over the Internet. A private cloud is one in which theà  services and infrastructure are maintained on a privateà  network. Those clouds offer a great level of security. Aà  hybrid cloud includes a variety of public and privateà  options with multiple providers.  Big data environments require clusters of servers toà  support the tools that process the large volumes of data,à  with high velocity and with varied formats of big data.  Clouds are deployed on pools of server, networkingà  resources , storage and can scale up or down as needed forà  convenience.  Cloud computing provides a cost-effective way forà  supporting big data techniques and advanced applicationsà  that drives business value. Big data analytics is a set ofà  advanced technologies designed to work with largeà  volumes of data. It uses different quantitative methods likeà  computational mathematics, machine learning, robotics,à  neural networks and artificial intelligence to explore theà  data in cloud.  In cloud infrastructure to analyze big data makes senseà  because Investments in big data analysis can be significantà  and drive a need for efficient and cost-effectiveà  infrastructure, Big data combines internal and externalà  sources as well as Data services that are needed to extractà  value from big data[17].  To address the scalability problem for large scale data setà  used a widely adopted parallel data processing frameworkà  like Map Reduce. In first phase, the original datasets areà  partitioned into group of smaller datasets. Now thoseà  datasets are anonymized in parallel producing intermediateà  results. In second phase, the obtained intermediate resultsà  are integrated into one and further anonymized to achieveà  consistent k-anonymous dataset.  Mapreduce is a model for programming and Implementingà  for processing and generating large data items. A mapà  function that processes a key-value pair,This generates aà  set of intermediate key-value pair. A reduce function whichà  merges all intermediate data values associated with thoseà  intermediate key.  II. RELATED WORK  Ke Wang, Philip S. Yu , Sourav Chakraborty adapts anà  bottom-up generalization approach which works iterativelyà  to generalize the data. These generalized data is useful forà  classification.But it is difficult to link to other sources. Aà  hierarchical structure of generalizations specifies theà  generalization space.Identifying the best generalization isà  the key to climb up the hierarchy at each iteration[2].  Benjamin c. M. Fung, ke wang discuss that privacy preservingà  technology is used to solve some problemsà  only,But it is important to identify the nontechnicalà  difficulties and overcome faced by decision makers whenà  deploying a privacy-preserving technology. Theirà  concerns include the degradation of data quality, increasedà  costs , increased complexity and loss of valuableà  information. They think that cross-disciplinary research isà  the key to remove these problems and urge scientists in theà  privacy protection field to conduct cross-disciplinaryà  research with social scientists in sociology, psychology,à  and public policy studies[3].  Jiuyong Li,Jixue Liu , Muzammil Baig , Raymond Chi-Wing Wong proposed two classification-aware dataà  anonymization methods .It combines local valueà  suppression and global attribute generalization. Theà  attribute generalization is found by the data distribution,à  instead of privacy requirement. Generalization levels areà  optimized by normalizing mutual information forà  preserving classification capability[17].  Xiaokui Xiao Yufei Tao present a technique,calledà  anatomy, for publishing sensitive datasets. Anatomy is theà  process of releasing all the quasi-identifier and sensitiveà  data items directly in two separate tables. This approachà  protect the privacy and capture large amount of correlationà  in microdata by Combining with a grouping mechanism.  A linear-time algorithm for computing anatomized tablesà  that obey the l-diversity privacy requirement is developedà  which minimizes the error of reconstructing microdataà  [13].  III. PROBLEM ANALYSIS  The centralized Top Down Specialization (TDS)à  approaches exploits the data structure to improveà  scalability and efficiency by indexing anonymous dataà  records. But overheads may be incurred by maintainingà  linkage structure and updating the statistic informationà  when date sets become large.So,centralized approachesà  probably suffer from problem of low efficiency andà  scalability while handling large-scale data sets. Aà  distributed TDS approach is proposed to address theà  anonymization problem in distributed system.Ità  concentrates on privacy protection rather than scalabilityà  issues.This approach employs information gain only, butà  not its privacy loss. [1]  Indexing data structures speeds up the process ofà  anonymization of data and generalizing it, becauseà  indexing data structure avoids frequently scanning theà  whole data[15]. These approaches fails to work in parallelà  or distributed environments such as cloud systems sinceà  the indexing structures are centralized. Centralizedà  approaches are difficult in handling large-scale data setsà  well on cloud using just one single VM even if the VM hasà  the highest computation and storage capability.  Fung et.al proposed TDS approach which produces anà  anonymize data set with exploration problem on data. Aà  data structure taxonomy indexed partition [TIPS] isà  exploited which improves efficiency of TDS, it fails toà  handle large data set. But this approach is centralizedà  leasing to in adequacy of large data set.  Raj H, Nathuji R, Singh A, England P proposes cacheà  hierarchy aware core assignment and page coloring basedà  cache partitioning to provide resource isolation and betterà  resource management by which it guarantees security ofà  data during processing.But Page coloring approachà  enforces the performance degradation in case VMââ¬â¢sà  working set doesnââ¬â¢t fit in cache partition[14].  Ke Wang , Philip S. Yu considers the followingà  problem. Data holder needs to release a version of data thatà  are used for building classification models. But the problemà  is privacy protection and wants to protect against anà  external source for sensitive information.  So by adapting the iterative bottom-up generalizationà  approach to generalize the data from data mining.  IV. METHODOLOGY  Suppression: In this method, certain values of theà  attributes are replaced by an asterisk *. All or some valuesà  of a column may be replaced by *  Generalization: In this method, individual values ofà  attributes are replaced by with a broader category. Forà  example, the value 19 of the attribute Age may beà  replaced by  âⰠ¤ 20, the value 23 by 20   A. Bottom-Up Generalization  Bottom-Up Generalization is one of the efficient kanonymizationà  methods. K-Anonymity where theà  attributes are suppressed or generalized until each row isà  identical with at least k-1 other rows. Now database is saidà  to be k-anonymous. Bottom-Up Generalization (BUG)à  approach of anonymization is the process of starting fromà  the lowest anonymization level which is iterativelyà  performed. We leverage privacy trade-off as the searchà  metric. Bottom-Up Generalization and MR Bottom upà  Generalization (MRBUG) Driver are used. The followingà  steps of the Advanced BUG are ,they are data partition, runà  MRBUG Driver on data set, combines all anonymizationà  levels of the partitioned data items and then applyà  generalization to original data set without violating the kanonymity.  Fig.1 System architecture of bottom up approachà    Here a Advanced Bottom-Up Generalization approachà  which improves the scalability and performance of BUG.  Two levels of parallelization which is done byà  mapreduce(MR) on cloud environment. Mapreduce onà  cloud has two levels of parallelization.First is job levelà  parallelization which means multiple MR jobs can beà  executed simultaneously that makes full use of cloudà  infrastructure.Second one is task level parallelizationà  which means that multiple mapper or reducer tasks in aà  MR job are executed simultaneously on data partitions. Theà  following steps are performed in our approach, First theà  datasets are split up into smaller datasets by using severalà  job level mapreduce, and then the partitioned data sets areà  anonymized Bottom up Generalization Driver. Then theà  obtained intermediate anonymization levels are Integratedà  into one. Ensure that all integrated intermediate level neverà  violates K-anonmity property. Obtaining then the mergedà  intermediate anonymized dataset Then the driver isà  executed on original data set, and produce the resultantà  an   onymization level. The Algorithm for Advanced Bottomà  Up Generalization[15] is given below,  The above algorithm describes bottom-up generalization. Inà  ith iteration, generalize R by the best generalization Gbest .  B. Mapreduce  The Map framework which is classified into map andà  reduce functions.Map is a function which parcels out taskà  to other different nodes in distributed cluster. Reduce is aà  function that collates the task and resolves results intoà  single value.  Fig.2 MapReduce Framework  The MR framework is fault-tolerant since each node inà  cluster had to report back with status updates andà  completed work periodically.For example if a nodeà  remains static for longer interval than the expected,then aà  master node notes it and re-assigns that task to otherà  nodes.A single MR job is inadequate to accomplish task.  So, a group of MR jobs are orchestrated in one MR driverà  to achieve the task. MR framework consists of MR Driverà  and two types of jobs.One is IGPL Initialization and otherà  is IGPL Update. The MR driver arranges the execution ofà  jobs.  Hadoop which provides the mechanism to set globalà  variables for the Mappers and the Reducers. The bestà  Specialization which is passed into Map function of IGPLà  Update job.In Bottom-Up Approach, the data is initializedà  first to its current state.Then the generalizations process areà  carried out k -anonymity is not violated. That is, we have toà  climb the Taxonomy Tree of the attribute till required  Anonymity is achieved.  1: while R that does not satisfy  anonymity requirement do  2: for all generalizations G do  3: compute the IP(G);  4: end for;  5: find best generalization Gbest;  6: generalize R through Gbest;  7: end while;  8: output R;  V. Experiment Evaluation  To explore the data generalization from data mining inà  order to hide the detailed information, rather to discoverà  the patterns and trends. Once the data has been masked, allà  the standard data mining techniques can be applied withoutà  modifying it. Here data mining technique not only discoverà  useful patterns, but also masks the private informationà    Fig.3 Change of execution time of TDS and BUGà    Fig 3 shows the results of change in execution time ofà  TDS and BUG algorithm. We compared the execution timeà  of TDS and BUG for the size of EHR ranging from 50 toà  500 MB, keeping p=1. Presenting the bottom-upà  generalization for transforming the specific data to lessà  specific. Thus focusing on key issues to achieve qualityà  and scalability. The quality is addressed by trade-offà  information and privacy and an bottom-up generalizationà  approach.The scalability is addressed by a novel dataà  structure to focus generalizations.To evaluate efficiencyà  and effectiveness of BUG approach, thus we compareà  BUG with TDS.Experiments are performed in cloudà  environment.These approaches are implemented in Javaà  language and standard Hadoop MapReduce API.  VI. CONCLUSION  Here we studied scalability problem for anonymizing theà  data on cloud for big data applications by using Bottom Upà  Generalization and proposes a scalable Bottom Upà  Generalization. The BUG approach performed asà  follows,first Data partitioning ,executing of driver thatà  produce a intermediate result. After that, these results areà  merged into one and apply a generalization approach. Thisà  produces the anonymized data. The data anonymization isà  done using MR Framework on cloud.This shows thatà  scalability and efficiency are improved significantly overà  existing approaches.  REFERENCES  [1] Xuyun Zhang, Laurence T. Yang, Chang Liu, and Jinjun Chen,ââ¬Å"Aà  Scalable Two-Phase Top-Down Specialization Approach for Dataà  Anonymization Using MapReduce on Cloudâ⬠, vol. 25, no. 2,à  february 2014.  [2] Ke Wang, Yu, P.S,Chakraborty, S, ââ¬Å" Bottom-up generalization: aà  data mining solution to privacy protectionâ⬠  [3] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu, ââ¬Å"Privacy-Preservingà  Data Publishing: A Survey of Recent Developments,â⬠ ACMà  Comput. Surv., vol. 42, no. 4, pp.1-53, 2010.  [4] K. LeFevre, D.J. DeWitt and R. Ramakrishnan, ââ¬Å"Workload- Awareà  Anonymization Techniques for Large-Scale Datasets,â⬠ ACM Trans.à  Database Syst., vol. 33, no. 3, pp. 1-47, 2008.  [5] B. Fung, K. Wang, L. Wang and P.C.K. Hung, ââ¬Å"Privacy- Preservingà  Data Publishing for Cluster Analysis,â⬠ Data Knowl.Eng., Vol.68,à  no.6, pp. 552-575, 2009.  [6] B.C.M. Fung, K. Wang, and P.S. Yu, ââ¬Å"Anonymizing Classificationà  Data for Privacy Preservation,â⬠ IEEE Trans. Knowledge and Dataà  Eng., vol. 19, no. 5, pp. 711-725, May 2007.  [7] Hassan Takabi, James B.D. Joshi and Gail-Joon Ahn, ââ¬Å"Security andà  Privacy Challenges in Cloud Computing Environmentsâ⬠.  [8] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, ââ¬Å"Incognito:à  Efficient Full-Domain K-Anonymity,â⬠ Proc. ACM SIGMOD Intââ¬â¢là  Conf. Management of Data (SIGMOD ââ¬â¢05), pp. 49-60, 2005.  [9] T. IwuchukwuandJ.F. Naughton, ââ¬Å"K-Anonymization as Spatialà  Indexing: Toward Scalable and Incremental Anonymization,â⬠ Proc.à  33rdIntlConf. VeryLarge DataBases (VLDB07), pp.746-757, 2007  [10] J. Dean and S. Ghemawat, ââ¬Å"Mapreduce: Simplified Data Processingà  on Large Clusters,â⬠ Comm. ACM, vol. 51, no. 1, pp. 107-113,2008.  [11] Dean J, Ghemawat S. ââ¬Å"Mapreduce: a flexible data processing tool,â⬠à  Communications of the ACM 2010;53(1):72ââ¬â77. DOI:à  10.1145/1629175.1629198.  [12] Jiuyong Li, Jixue Liu , Muzammil Baig , Raymond Chi-Wingà  Wong, ââ¬Å"Information based data anonymization for classificationà  utilityâ⬠  [13]X. Xiao and Y. Tao, ââ¬Å"Anatomy: Simple and Effective Privacyà  Preservation,â⬠ Proc. 32nd Intââ¬â¢l Conf. Very Large Data Basesà  (VLDBââ¬â¢06), pp. 139-150, 2006.  [14] Raj H, Nathuji R, Singh A, England P. ââ¬Å"Resource management forà  isolation enhanced cloud services,â⬠ In: Proceedings of theà  2009ACM workshop on cloud computing security, Chicago, Illinois,à  USA, 2009, p.77ââ¬â84.  [15] K.R.Pandilakshmi, G.Rashitha Banu. ââ¬Å"An Advanced Bottom upà  Generalization Approach for Big Data on Cloudâ⬠ , Volume: 03, Juneà  2014, Pages: 1054-1059..  [16] Intel ââ¬Å"Big Data in the Cloud: Converging Technologiesâ⬠.  [17] Jiuyong Li, Jixue Liu Muzammil Baig, Raymond Chi-Wing Wong,à  Ã¢â¬Å"Information based data anonymization for classification utilityâ⬠.    
Subscribe to:
Post Comments (Atom)
 
 
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.