Data Science

This Guide is in support of the research and teaching needs of the School of Data Science at UVA

Getting Started and Further Learning

General Data Science Resources and Research Workflows

UVA Library Research Data Services Workshops
Learn about R, Python, Git/GitHub, and more.
UVA Research Computing Learning Portal
Research Computing's short courses, workshops, and tutorials on a variety of data science topics and programming environments.
Coursera Data Science Courses
Extensive course options to learn data science tools and techniques. (Yes, you can audit the courses for free, although Coursera has done a good job of hiding this option.)

O'Reilly for Higher Education

O’Reilly includes tech and business content from more than 250 publishers - along with videos, case studies, expert-curated learning paths and self-assessments.
From the drop-down menu, choose: Not listed? Click here. Then enter UVa email address to access content.
Instructors: If you are using books on O’Reilly as course textbooks, please be aware that O’Reilly can and does pull titles mid-semester without notice to the Library.

Download options
Content can only be downloaded through the mobile app
Download format
In-app
Sage Research Methods Online

Sage Research Methods is a methods library with more than 1000 books, reference works, journal articles, and instructional videos by world-leading academics from across the social sciences, including the largest collection of qualitative methods books available online from any scholarly publisher.

Download options
Chapter/content section downloads*
Download format
PDF
Special notes
*Use the horizontal scroll bar to find the "PDF" button after "Embed"
LinkedIn Learning- For Students and Faculty
LinkedIn Learning is a leading online learning company that helps anyone learn business, software, technology and creative skills to achieve personal and professional goals. Members have access to the lynda.com video library of engaging, top-quality courses taught by recognized industry experts.

TAPoR
Curated lists represent the commonly used or widely respected groups of digital tools. Primary focus is digital humanities, but is very useful overall.
Data Science Resources
This repo provides anyone interested in learning data science a wealth of open source, industry-best learning materials and learning tracks.
Flowing Data
Blog/newsletter about data visualization. By Nathan Yau.

The Art of Data Science by Roger Peng; Elizabeth Matsui
ISBN: 9781365061462

Publication Date: 2016-06-08

This book describes, simply and in general terms, the process of analyzing data. The authors have extensive experience both managing data analysts and conducting their own data analyses, and have carefully observed what produces coherent results and what fails to produce useful insights into data. This book is a distillation of their experience in a format that is applicable to both practitioners and managers in data science.
Elements of Data Analytic Style by Jeff Leek
Publication Date: 2015

Data analysis is at least as much art as it is science. This book is focused on the details of data analysis that sometimes fall through the cracks in traditional statistics classes and textbooks.
Computing Skills for Biologists by Stefano Allesina; Madlen Wilmes
ISBN: 9780691183961

Publication Date: 2019-01-15

A concise introduction to key computing skills for biologists While biological data continues to grow exponentially in size and quality, many of today's biologists are not trained adequately in the computing skills necessary for leveraging this information deluge. In Computing Skills for Biologists, Stefano Allesina and Madlen Wilmes present a valuable toolbox for the effective analysis of biological data. Based on the authors' experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book's examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform. Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century. Excellent resource for acquiring comprehensive computing skills Both novice and experienced scientists will increase efficiency by building automated and reproducible pipelines for biological data analysis Code examples based on published data spanning the breadth of biological disciplines Detailed solutions provided for exercises in each chapter Extensive companion website
R for Data Science by Garrett Grolemund; Hadley Wickham
ISBN: 9781491910399

Publication Date: 2017-01-10

Learn how to use R to turn raw data into insight, knowledge, and understanding. This book introduces you to R, RStudio, and the tidyverse, a collection of R packages designed to work together to make data science fast, fluent, and fun. Suitable for readers with no previous programming experience, R for Data Science is designed to get you doing data science as quickly as possible. Authors Hadley Wickham and Garrett Grolemund guide you through the steps of importing, wrangling, exploring, and modeling your data and communicating the results. You'll get a complete, big-picture understanding of the data science cycle, along with basic tools you need to manage the details. Each section of the book is paired with exercises to help you practice what you've learned along the way. You'll learn how to: Wrangle--transform your datasets into a form convenient for analysis Program--learn powerful R tools for solving data problems with greater clarity and ease Explore--examine your data, generate hypotheses, and quickly test them Model--provide a low-dimensional summary that captures true "signals" in your dataset Communicate--learn R Markdown for integrating prose, code, and results

Getting Up to Speed with Math and Statistical Analysis

Getting up to speed in math (the top links are all recommended by SDS faculty):

Calculus in Context
An introduction to Calculus, recommended by SDS Faculty. The book is focused on "calculus as it is used in contemporary science."
Mathematics for Machine Learning
Recommended by SDS faculty. "The book is not intended to cover advanced machine learning techniques because there are already plenty of books doing this. Instead, we aim to provide the necessary mathematical skills to read those other books."
From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science
"The materials here form a textbook for a course in mathematical probability and statistics for computer science students. (It would work fine for general students too.) " This resource is the open (and older) version of the textbook Probability and Statistics for Data Science: Math + R + Data. The latter book is recommended by SDS faculty.
Coursera: Mathematics for Machine Learning Specialization
"For a lot of higher level courses in Machine Learning and Data Science, you find you need to freshen up on the basics in mathematics - stuff you may have studied before in school or university, but which was taught in another context, or not very intuitively, such that you struggle to relate it to how it’s used in Computer Science. This specialization aims to bridge that gap, getting you up to speed in the underlying mathematics, building an intuitive understanding, and relating it to Machine Learning and Data Science." This specialization includes courses in linear algebra, multivariate calculus, and Principal Component Analysis (PCA).
Coursera: Statistical Inference
"This course presents the fundamentals of inference in a practical approach for getting things done. After taking this course, students will understand the broad directions of statistical inference and use this information for making informed choices in analyzing data."
O'Reilly: Essential Math for Data Science Playlist
"Learn fundamental linear algebra, calculus, probability, and statistics using Python—vital skills for data science—with resources from Hadrien Jean." Note: you must create an O'Reilly account in order to access content. Visit O'Reilly, from the drop-down menu, choose: 'Not listed? Click here.' Then enter UVA email address to access content.

Review what you know:

Linear Algebra Review and Reference
Review of linear algebra from Stanford CS229 Machine Learning course.
Review of Matrix multiplication, Diagonal matrices, Inverse matrix.
From TAMU MATH 304-505: Linear Algebra course.

More recommended resources for math, statistics, and data analysis:

UCLA’s IDRE's Data Analysis Examples
This is a collection of examples using R, SAS, SPSS, and Stata, illustrating the application of different statistical analysis techniques.
Practical Data Science for Stats
Preprints focusing on the practical side of data science workflows and statistical analysis. Curated by Jennifer Bryan and Hadley Wickham.
Statistical Modeling, Causal Inference, and Social Science
Andrew Gelman's blog about statistics.
Simply Statistics
By Rafa Irizarry, Roger Peng, and Jeff Leek.
R-Bloggers
A digest of R news and tutorials contributed by R users.
Mad (Data) Scientist
Norman Matloff’s personal blog on R and data science

An Introduction to Statistical Learning by Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani
ISBN: 9781071614174

Publication Date: 2021-07-30

An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, deep learning, survival analysis, multiple testing, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra. This Second Edition features new chapters on deep learning, survival analysis, and multiple testing, as well as expanded treatments of naïve Bayes, generalized linear models, Bayesian additive regression trees, and matrix completion. R code has been updated throughout to ensure compatibility.
Introduction to Probability by Joseph K. Blitzstein; Jessica Hwang
ISBN: 9781466575578

Publication Date: 2014-07-24

Developed from celebrated Harvard statistics lectures, Introduction to Probability provides essential language and tools for understanding statistics, randomness, and uncertainty. The book explores a wide variety of applications and examples, ranging from coincidences and paradoxes to Google PageRank and Markov chain Monte Carlo (MCMC). Each chapter ends with a section showing how to perform relevant simulations and calculations in R, a free statistical software environment.
Probability by Jim Pitman
ISBN: 9780387979748

Publication Date: 1999-05-21

This is a text for a one-quarter or one-semester course in probability, aimed at students who have done a year of calculus. The book is organised so a student can learn the fundamental ideas of probability from the first three chapters without reliance on calculus. Later chapters develop these ideas further using calculus tools. The book contains more than the usual number of examples worked out in detail. The most valuable thing for students to learn from a course like this is how to pick up a probability problem in a new setting and relate it to the standard body of theory. The more they see this happen in class, and the more they do it themselves in exercises, the better. The style of the text is deliberately informal. My experience is that students learn more from intuitive explanations, diagrams, and examples than they do from theorems and proofs. So the emphasis is on problem solving rather than theory.

Getting Up to Speed with Coding

Getting up to speed in programming (the top links are all used by specific courses/bootcamps at UVA):

Surfing the Data Pipeline with Python
Written by Jon Kropko, a professor at SDS. "The pipeline refers to all of the steps needed to go from raw, messy, original data to data that is ready for any kind of analysis. In the real world, data is almost never ready to be analyzed without a great deal of work to prepare the data first. The goal of this book is to make this huge part of the job easier, faster, less frustrating, and more enjoyable for you. The techniques we will discuss are not the only ways to accomplish a task, but they represent fast and straightforward ways to do the work."
Programming for Data Science Bootcamp
Written by Raf Alvarado, professor at SDS. "This text is designed to provide all the content necessary to take the Programming for Data Science Bootcamps course at UVA’s School of Data Science."
The Coder's Apprentice: Learning Programming with Python 3
"The Coder's Apprentice is a course book, written by Pieter Spronck, that is aimed at teaching Python 3 to students and teenagers who are completely new to programming. Contrary to many of the other books that teach Python programming, this book assumes no previous knowledge of programming on the part of the students, and contains numerous exercises that allow students to train their programming skills." Used by UVA's Intro to Programming CS courses.

More recommending programming and coding resources:

Coursera: Python for Everybody Specialization
"Learn to program and analyze data with Python. Develop programs to gather, clean, analyze, and visualize data."
The Python Tutorial
The tutorial from the Python Software Foundation.
SQLite Python Tutorial

Python Data Science Handbook by Jake VanderPlas
ISBN: 9781491912058

Publication Date: 2016-12-10

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all--IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you'll learn how to use: IPython and Jupyter: provide computational environments for data scientists using Python NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python Matplotlib: includes capabilities for a flexible range of data visualizations in Python Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms
Learn Python 3 the Hard Way by Zed A. Shaw
ISBN: 9780134692883

Publication Date: 2017-06-27

You Will Learn Python 3! Zed Shaw has perfected the world's best system for learning Python 3. Follow it and you will succeed--just like the millions of beginners Zed has taught to date! You bring the discipline, commitment, and persistence; the author supplies everything else. In Learn Python 3 the Hard Way, you'll learn Python by working through 52 brilliantly crafted exercises. Read them. Type their code precisely. (No copying and pasting!) Fix your mistakes. Watch the programs run. As you do, you'll learn how a computer works; what good programs look like; and how to read, write, and think about code. Zed then teaches you even more in 5+ hours of video where he shows you how to break, fix, and debug your code--live, as he's doing the exercises. Install a complete Python environment Organize and write code Fix and break code Basic mathematics Variables Strings and text Interact with users Work with files Looping and logic Data structures using lists and dictionaries Program design Object-oriented programming Inheritance and composition Modules, classes, and objects Python packaging Automated testing Basic game development Basic web development It'll be hard at first. But soon, you'll just get it--and that will feel great! This course will reward you for every minute you put into it. Soon, you'll know one of the world's most powerful, popular programming languages. You'll be a Python programmer. This Book Is Perfect For Total beginners with zero programming experience Junior developers who know one or two languages Returning professionals who haven't written code in years Seasoned professionals looking for a fast, simple, crash course in Python 3
Automate the Boring Stuff with Python by Albert Sweigart
ISBN: 9781593275990

Publication Date: 2015-04-14

If you ve ever spent hours renaming files or updating hundreds of spreadsheet cells, you know how tedious tasks like these can be. But what if you could have your computer do them for you? In Automate the Boring Stuff with Python, you ll learn how to use Python to write programs that do in minutes what would take you hours to do by hand no prior programming experience required. Once you ve mastered the basics of programming, you ll create Python programs that effortlessly perform useful and impressive feats of automation to: Search for text in a file or across multiple files Create, update, move, and rename files and folders Search the Web and download online content Update and format data in Excel spreadsheets of any size Split, merge, watermark, and encrypt PDFs Send reminder emails and text notifications Fill out online forms Step-by-step instructions walk you through each program, and practice projects at the end of each chapter challenge you to improve those programs and u