ata science has propelled us towards better decision making in numerous fields including science & technology, healthcare, and manufacturing. This ACM select highlights several resources about the data science fundamentals and best practices, and compares across different frameworks and tools to help you apply data science in your field.
This is the second installation in our data science series, the first select article was printed here.
We invite you to consider participating in ACM’s activities on these topics, be it through our professional community, global policy activities, ongoing work in professional ethics, and/or through our chapters, SIGs, local meetups and/or conferences.
We value your feedback and look forward to your guidance on how we can continue to improve ACM Selects together. Your suggestions and opinions on how we can do better are welcome via email through selects-feedback@acm.org.
Best Practices in the Field
Rules of Machine Learning
Martin Zinkevich, a Research Scientist at Google, lays out best practices for developing data science and machine learning systems in production. It is a great collection of rules of thumb, heuristics and pitfalls which can help in bringing more structure and clarity while building such systems.
Data Science and Prediction
The article emphasizes the importance of predictive modeling in data science because that makes new knowledge actionable for decision making rather than being a source of explanation of the past events. It then highlights that to become a good data scientist, an integrated skill-set spanning mathematics, machine learning and software engineering, along with good problem formulation and solving skills is required.
Tools and Technologies Involved
Applied Linear Algebra Methods for Data Science
Data science often relies on having access to vast amounts of data. This data is likely to be high dimensional and can contain some information which is not relevant to answering the question at hand. In this paper, efficient algorithms for reducing the dimensionality of data are introduced. Often called `Feature engineering’ techniques, these algorithms are a critical stage of any data science workflow.
Python vs R for Data Science
Python and R are the two leading languages used for carrying out data science. Want to understand the differences between the two, so you can figure out where to focus your attention? This article is for you!
Scikit-learn: Machine Learning Without Learning the Machinery
Scikit-learn is a Python library which provides the tooling and frameworks to build up data science pipelines. From transforming data, to training models, Scikit-learn’s modular approach makes it simple to compare a range of techniques on your data set.
Research and Coursework for Deep Dive
The Data Science Life Cycle: A Disciplined Approach to Advancing Data Science as a Science
In this article, Victoria Sodden, an Associate Professor at University of Southern California, motivates the interdisciplinarity and scope of data science as a discipline. Sodden proposes an intellectual framework Data Science Life Cycle to describe the various steps and processes involved, and highlights the coursework to build a skillset in each of those components.
Computing competencies for Undergraduate Data Science Curricula.
This report from the ACM Data Science task force lays out the topics any comprehensive Data Science undergraduate course should cover. If you want to identify gaps in your knowledge, or figure out what to learn next, this report is a great place to start.