Getting Started Series

Getting Started with Data Science

Post by

Getting Started Series

Published

November 3, 2020

T

Tags:

No additional tags.

here is a large gap between exploratory data science and building an intelligent application that continually learns from the data it encounters to provide business value. In this ACM Select, we highlight content to ease the transition from research to production and illuminate the hurdles you may come across in your journey.

‍

‍

Overview

Data science: challenges and directions

First published in Communications of the ACM, Vol. 60, No. 8, July 2017.

In this overview article, Prof. Longbing Cao describes the processes of data science, its overlap with other disciplines, and the challenges present in data-driven decision making.

‍

‍

‍

Data Validation

Your machine learning model can break, degrade, and exhibit unwanted behaviour in numerous ways. The primary cause is issues and irregularities with your data, and data cleaning and validation help to minimize this.

‍

Putting Machine Learning into Production Systems

First published in ACM Queue, Vol. 17, Issue 4, October 7, 2019.

Adrian Colyer gives an overview of two papers concerned with data validation techniques and provides insight into data skew and drift, where the data you trained the model on is no longer representative of the data your system is seeing in real-world operation.

‍

‍

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

First presented at DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, June 2019.

Training your model on biased data results in a biased model. This paper describes methods for ensuring that your training data is accurate and free from bias.

‍

‍

Model Interpretability

Explanations of why a model arrived at its result help understand whether a machine learning model employed true evidence or the bias that widely exists in training data. Model interpretability is this ability to interpret the results of a model.

‍

Techniques for interpretable machine learning

First published in Communications of the ACM, Vol. 63, No. 1, December 2019.

Interpretability can be classified as intrinsic or post-hoc, both of which can be further broken down into global and local. This article describes these classifications, and also discusses the larger goal of democratizing model explanations for end-users than only for research intuitions.

‍

‍

‍

Bias

Algorithms are increasingly helping organize all aspects of our personal and professional lives; but one must be careful to avoid instances of pre-existing societal bias seeping into your models as they make real-world decisions.

‍

Algorithms, Platforms, and Ethnic Bias

First published in Communications of the ACM, Vol. 62, No. 11, November 2019.

In this article, Martin Kenney, a Distinguished Professor at UC Davis, describes types of bias, how they arise from training data, choosing and interpreting models to minimize bias, and the fine line between accuracy and fairness that a data scientist must walk.

‍

‍

‍

Putting It All Together: A Case Study of AI Bots

A Decade of Social Bot Detection

First published in Communications of the ACM, Vol. 63, No. 10, October 2020.

To generate business value, your model will need to be operationalised as part of a broader system. However, such systems aren’t always used for good. In this article, social media researcher Stefano Cresci looks at the influx of AI ‘bots’, how they impact people’s online interactions, and approaches to combat them.

Prabhav Agrawal

Prabhav Agrawal is a Machine Learning Engineer in Facebook AI’s Speech team. He has 5+ years experience researching and creating AI powered products across leading companies such as Apple and Microsoft. At Apple, he led the efforts for creating Siri Voice experiences across devices including iPhone, Apple Watch and HomePod as part of the Text-to-Speech team. At Microsoft, he focused on improving search relevance and infrastructure for Bing Search, and also co-led the project for including Dictation as part of MS Office and Windows, starting from a hackathon prototype. Prabhav earned his Master's in Computer Science from University of California San Diego and his Bachelor's in Electrical Engineering from Indian Institute of Technology Delhi.

Sophie Watson

Sophie is a Data Scientist at Red Hat, where she helps customers use machine learning to solve business problems in the hybrid cloud.

Kashyap Tumkur

Kashyap Tumkur is a Software Engineer at Verily Life Sciences, the healthcare and life sciences arm of Alphabet. Previously, he was a graduate student researcher at the Department of Bioinformatics at the University of California, San Diego, where he earned his Master’s in Computer Science with a specialization in Artificial Intelligence. Kashyap seeks to promote equitable applications of computing and broader access to computing opportunities, and is a member of the ACM’s Future of Computing Academy and US Technology Policy Committee, and a Global Shaper, Oakland Hub, of the World Economic Forum.

THere's More

Recommended Selects

See all selects

Sep

29

//

2022

Getting Started Series

Getting Started with Internet of Things: IoT Applications

This Selects finalizes with an example application domain of Industrial Internet ofThings (IIoT), and a source to delve into state-of-the-art IoT research trends.

Aug

30

//

2022

Getting Started Series

Getting Started with Internet of Things: Computing and Communication

The selection includes easy to read articles describing and motivating the IoT, and later deep dives into the major aspects of IoT such as communication protocols, edge-to-cloud continuum, AI and data analytics, and security/privacy.

Aug

2

//

2022

Computing in Practice Series

Trustworthy AI in Healthcare #02

AI needs to be trustworthy. Trustworthiness means that healthcare organizations, doctors, and patients should be able to rely on the AI solution as being lawful, ethical, and robust.