We live in a dynamic age where data science and AI have permeated all facets of society and are deemed pivotal components in the decision....
COVER STORY
We live in a dynamic age where data science and AI have permeated all facets of society and are deemed pivotal components in the decision-making process. Recall the last time you availed yourself of the services of Google Maps to navigate through chaotic metropolitan traffic, reach a destination using the optimal route navigation feature, or browse online shopping sites in search of an article of choice that meets the preselected features of the built-in filter. Or how can we forget the notorious yet much-coveted chatbots like ChatGPT and Bard—loathed by course instructors but adored by students like me—that have ushered in a new era of generative AI and serve as knowledge hubs accommodating facts and figures pertaining to any subject matter devised or developed to date? Whether it’s the last-minute recourse and refuge rendered by ChatGPT minutes before an assignment submission deadline or requesting an opinion on a matter, we’ve all been there and availed ourselves of the services of these chatbots.
In this atomic and fast-paced technological era, data science and AI have evolved into buzzwords and are deemed by the general populace to be ostensibly occupying a supreme position within the echelons of computing science and applied sciences. However, the algorithms that govern the complexities and operational prowess of data science and AI stem from primitive statistical and mathematical models that are modified according to the prerequisites of the operation in progress.
data science deals with the study and examination of large swathes of datasets, inclusive of both descriptive and quantitative structural characteristics, to exploit and derive relevant insights confined to the scope of the investigative data analysis. Before we delve deep into the nitty-gritty technical operational mechanisms and procedures, it is imperative to examine the timeline illustrating the evolution of the field of data science.
The origin of data science can be traced back to a paper titled Future of Data Analysis authored by eminent statistician John Tukey in 1962. The paper proposed integrating existing computer programs and statistical frameworks to render insights regarding datasets within the scope of data analysis. It also advocated for recognizing “Data Analysis” as a distinct scientific field, independent of conventional statistical frameworks, capable of enabling analysts and scientists to extract diversified insights from a uniform dataset.
Tukey further proposed novel methodologies that combined statistical tools with the critical thinking abilities of analysts, allowing for a multi-pronged examination of specific data to obtain diverse insights and trends depicted by datasets under review. He also highlighted the limitations and threats posed by “incomplete” or “messy data,” terminology now ingrained in the modern data science field.
In a nutshell, this paper served as a precursor to Tukey’s book Exploratory Data Analysis, which is regarded as a cornerstone of data science and laid the foundation for the modern field.
The term data science was coined by Peter Naur, an eminent computer scientist, in 1974 and was introduced in his paper titled Concise Survey of Computer Methods, which reflected his expertise in the field. Naur proposed that data science is multifaceted and a multidisciplinary field, leveraging domain-specific knowledge relevant to a particular dataset to yield insightful conclusions. He also advocated for the use of computational and statistical resources to enhance the efficiency of data processing and collection, thereby alleviating the cumbersome nature of these processes.
Though Naur’s understanding of the field and its applications differs significantly from the modern manifestation of data science, his paper served as a launching pad for future generations of data and computer scientists to explore the field’s enigmatic avenues.
To bridge the gap between various aspects of data science—ranging from theoretical statistics to computer science—and to establish connections between technical facets and industry-specific applications, the International Association for Statistical Computing was established. The association was conceived to foster the exchange of technical and domain-specific information among computer scientists, statisticians, and global policy-making entities.
In 1977, John Tukey proposed Exploratory Data Analysis (EDA), a groundbreaking framework aimed at reshaping the statistical community’s understanding of data analysis. It challenged the prevailing Confirmatory Data Analysis (CDA) approach, which focused on analyzing data solely to validate a pre-conceived hypothesis formulated by an analyst regarding a specific dataset.
In contrast, EDA adopts a holistic approach that encourages exploring all aspects and features of a dataset before formulating any hypotheses. The data preprocessing process, which involves transforming the dataset into a standardized structure through data cleansing procedures, also falls under the jurisdiction of EDA. This transformation aids analysts in drawing relevant hypotheses.
Additionally, data visualization—presenting various aspects of the data in an interactive and insightful manner—is a key component of EDA. In his book Exploratory Data Analysis, Tukey cautioned against combining EDA and CDA during analysis, warning that such a practice could lead to overfitting, biased data, and misleading insights.
In 1989, the scientific community was introduced to the term Data Mining during the Knowledge Discovery in Databases (KDD) workshop, held under the patronage of Gregory Piatetsky-Shapiro. This event laid the cornerstone for the development of Big Data in the following decades. Data mining refers to the process of identifying hidden patterns and features within large datasets, often termed Big Data, through the deployment of automated algorithms. This process focuses on uncovering key aspects of data to enable data-driven business and policy-making decisions.
Data mining utilizes a variety of algorithms, including classification, clustering, and regression. Classification algorithms, a subset of supervised machine learning, are used for predictive analysis. This involves a training phase, where the algorithm learns from a labeled dataset segment, and a testing phase, where it categorizes data elements into predefined classes or labels established during training.
Clustering algorithms, such as the K-Means algorithm, group unlabeled datasets into clusters by analyzing underlying trends and characteristics of the data. This process resembles the training-testing structure of classification algorithms but focuses on categorizing data based on similarity without predefined labels.
Data mining began to gain significant traction in the 1990s, marking a pivotal advancement in data analysis and decision-making processes.
The late 1990s witnessed the establishment of data science as a separate discipline encompassing a myriad of technical and academic skills. In 1997, Jeff Wu, a mathematician by profession, proposed renaming “statistics” to data science and labeling statisticians as Data Scientists. This bold move reflected the prevailing trend of the era—the Dot Com boom—and the growing interest of the scientific community in Big Data.
The term Big Data was officially coined by NASA-affiliated scientists in 1997 to describe datasets containing vast amounts of entries incompatible with conventional data analysis and management software. A significant portion of this Big Data was generated through the internet and websites, a trend that began gaining traction during the Dot Com surge.
The early 2000s marked significant efforts by academic and scientific communities to train aspiring data scientists. William S. Cleveland notably advocated for developing the technical expertise of data scientists and establishing data science as a distinct profession—a proposal endorsed by the National Science Board in 2005.
Academic bodies across the globe began to show keen interest in this dynamic and innovative field, initiating the publication of journals related to data science. Forerunners in this effort included the world-renowned Columbia University and think tanks like the Committee on Data for Science and Technology (CODATA).
In 2006, the deployment of a primitive version of Hadoop, an open-source Big Data management and processing platform, ushered in the era of Big Data management and analytics. Hadoop’s ability to perform parallel computing makes data processing more efficient, making it a top choice for entities leveraging the power of Big Data. Hadoop originated from an ambitious project called Nutch, aimed at indexing billions of web pages as part of a web search engine. Yahoo funded Hadoop’s development and initially used it for data and web management purposes. Tech leaders like Facebook soon followed, integrating Hadoop into their operations. Apache, the parent organization behind Hadoop, continued to release updated versions until 2018, enhancing the Hadoop ecosystem.
The development and commercialization of Amazon Web Services (AWS) in 2006 heralded the era of cloud computing and data storage. AWS services like S3 (Simple Storage Service) and EC2 (Elastic Compute Cloud) offer corporations secure, encrypted Big Data storage and provide developers with flexible environments for product testing and scalability without added complexity.
In 2010, Microsoft launched its cloud-based Big Data management platform Windows Azure (now known as Microsoft Azure), offering functionalities similar to AWS.
The 2010s witnessed the term Data Scientist evolve into a catchphrase, gaining popularity within the computing community. Job listings for this designation soared, prompting Harvard Business Review—not Harvard Business School—to label Data Scientist as “the sexiest job of the 21st century.”
Advancements in AI, such as chatbots like Alexa, Siri, and Google Assistant, leveraged technologies like Natural Language Processing (NLP), neural networks, and deep learning, fueling rapid growth in the AI sector. The launch of ChatGPT in late 2022 by OpenAI took the world by storm, quickly becoming one of the most widely used chatbots.
Currently, a global data science arms race is underway. Tech giants like Google and Microsoft, along with numerous startups, are developing AI chatbots trained on billions of parameters and extensive datasets. The release of DeepSeek, a chatbot developed by a Chinese-based entity, has intensified competition between global superpowers, sparking debates over AI and data science supremacy.
It is increasingly evident that the future belongs to those who can comprehend and process data from novel perspectives, as “data is the oil of the future.”