Big Data Preprocessing: Enabling Smart Data

  • 3h 42m
  • Diego Garcı́a-Gil, Francisco Herrera, Julián Luengo, Salvador García, Sergio Ramírez-Gallego
  • Springer
  • 2020

This book offers a comprehensible overview of Big Data Preprocessing, which includes a formal description of each problem. It also focuses on the most relevant proposed solutions. This book illustrates actual implementations of algorithms that helps the reader deal with these problems.

This book stresses the gap that exists between big, raw data and the requirements of quality data that businesses are demanding. This is called Smart Data, and to achieve Smart Data the preprocessing is a key step, where the imperfections, integration tasks and other processes are carried out to eliminate superfluous information. The authors present the concept of Smart Data through data preprocessing in Big Data scenarios and connect it with the emerging paradigms of IoT and edge computing, where the end points generate Smart Data without completely relying on the cloud.

Finally, this book provides some novel areas of study that are gathering a deeper attention on the Big Data preprocessing. Specifically, it considers the relation with Deep Learning (as of a technique that also relies in large volumes of data), the difficulty of finding the appropriate selection and concatenation of preprocessing techniques applied and some other open problems. Practitioners and data scientists who work in this field, and want to introduce themselves to preprocessing in large data volume scenarios will want to purchase this book. Researchers that work in this field, who want to know which algorithms are currently implemented to help their investigations, may also be interested in this book.

About the Authors

Julián Luengo received the M.S. degree in computer science and the Ph.D. from the University of Granada, Granada, Spain, in 2006 and 2011 respectively. He currently acts as an Assistant Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada, Spain. His research interests include machine learning and data mining, data preparation in knowledge discovery and data mining, missing values, noisy data, data complexity and fuzzy systems. Dr. Luengo has been given some awards and honors for his personal work or for his publications in and conferences, such as IFSA-EUSFLAT 2009 Best Student Paper Award. He belongs to the list of the Highly Cited Researchers in the area of Computer Sciences (2015- 2018) (Clarivate Analytics).

Diego Garcı́a-Gil received the M.Sc. degree in computer science from the University of Granada, Granada, Spain, in 2015. He is currently pursuing the Ph.D. degree with the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His current research interests include machine learning, data mining, data preprocessing and Big Data.

Sergio Ramírez-Gallego received the M.Sc. degree in computer science from the University of Jaén, Jaén, Spain, in 2012. He obtained the Ph.D. degree with the Department of Computer Science and Artificial Intelligence, University of Granada, Spain in 2018. His current research interests include data mining, data preprocessing, big data, and cloud computing.

Salvador García received the B.S. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. Dr. García has published more than 80 papers in international journals (more than 60 in Q1), h-index 43, over 60 papers in international conference proceedings (data from Web of Science). He has organized several special sessions and workshops related to data preprocessing and evolutionary learning in conferences such as “Hybrid Intelligent Systems”, “Intelligent Systems Design and Applications” and “International Joint-Conference of Neural Networks”. He has been associated with the international program committees and organizing committees of several regular international conferences including IEEE CEC, ICPR, ICDM, IJCAI, etc. As edited activities, he has co-edited two special issues in international journals and he is an associate editor of “Information Fusion” (Elsevier), “Swarm and Evolutionary Computation” (Elsevier) and “AI Communications” (IOS Press) journals, and he is co-Editor in Chief of the international journal “Progress in Artificial Intelligence” (Springer). He is a co-author of the books entitled “Data Preprocessing in Data Mining” and “Learning from Imbalanced Data Sets” published by Springer. His research interests include data science, data preprocessing, Big Data, evolutionary learning, Deep Learning, metaheuristics and biometrics.

Francisco Herrera (SM'15) received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada and Director of DaSCI Institute (Andalusian Research Institute in Data Science and Computational Intelligence). He has been the supervisor of 44 Ph.D. students. He has published more than 400 journal papers, receiving more than 66000 citations (Scholar Google, H-index 132). He is co-author of the books "Genetic Fuzzy Systems" (World Scientific, 2001) and "Data Preprocessing in Data Mining" (Springer, 2015), "The 2-tuple Linguistic Model. Computing with Words in Decision Making" (Springer, 2015), "Multilabel Classification. Problem analysis, metrics and techniques" (Springer, 2016), “Multiple Instance Learning. Foundations and Algorithms" (Springer, 2016) and “Learning from Imbalanced Data Sets” (Springer, 2018). He currently acts as Editor in Chief of the international journals "Information Fusion" (Elsevier) and “Progress in Artificial Intelligence (Springer). He acts as editorial member of a dozen of journals.

In this Book

  • Introduction
  • Big Data: Technologies and Tools
  • Smart Data
  • Dimensionality Reduction for Big Data
  • Data Reduction for Big Data
  • Imperfect Big Data
  • Big Data Discretization
  • Imbalanced Data Preprocessing for Big Data
  • Big Data Software
  • Final Thoughts: From Big Data to Smart Data