Skip to Main Content
George Mason University | University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Learn Python for Data

Resources to learn and use the Open Source Programming Environment Python for Data Science.

First

Almost all tutorials on doing Data Science, Statistical Analysis, or Machine Learning in Python assume that you already know how to use Python.
Some may also assume knowledge of the Pandas library, be sure to check.

  • NumPy
  • SciPy
  • pandas
  • statsmodels
  • scikit-learn
"R is a language dedicated to statistics. Python is a general-purpose language with statistics modules. R has more statistical analysis features than Python, and specialized syntaxes. However, when it comes to building complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of Python is an invaluable asset." - Gaël Varoquaux

Statistics

These are resources for using Python to do basic descriptive and inferential statistics as used by academic researchers and statisticians. For additional information on statistical modeling, see the materials on Machine Learning. 

Python for Biostatistics

Popular Machine Learning Libraries

Scikit-Learn

https://scikit-learn.org/

  • Excellent machine learning library containing a huge catalogue of models and algorithms.
  • Supports classification, regression, clustering, data cleaning, and feature engineering.
  • Used by both machine learning beginners and experts.

Recommended: Scikit-Learn Coding Examples & A Gentle Introduction to Scikit-Learn

MOOC Course Materials: Scikit-learn Course by the developers

Videos: Introduction to Machine Learning with SciKit-learn (DataSchool) - Free registration or watch on YouTube

From Scikit-Learn: 

TensorFlow

Website: https://www.tensorflow.org/

  • Open source Python library for deep learning and neural networks from Google.
  • More complex models, a complicated base syntax, and a steeper learning curve.
  • Supports transfer and access to pre-trained models learning through TensorFlow Hub.

Recommended: TensorFlow Coding Examples

 

Keras

Website: https://keras.io/

  • Deep learning API that interfaces with TensorFlow.
  • Offers a much more Pythonic syntax that makes programming deep neural networks easier in TensorFlow.
  • Extremely popular—second only to Scikit-Learn, according to Keras.

Recommended: Keras Coding Examples

PyTorch

Website: https://pytorch.org/

  • Open source Python library for deep learning and neural networks from Facebook.
  • Allows for building complex models and does have a bit of a learning curve, but the syntax is more Pythonic than base TensorFlow.
  • Supports access to pre-trained models, extensions, and modules via the PyTorch Ecosystem.

Recommended: Practical Deep Learning for Coders Course by fast.ai (Free!)

Books

Data Science

These books focus on data management, and sometimes analysis. 

Has sections on NumPy, Pandas, Matplotlib. Covers data management and exploration (not statistical modeling or testing). See Jupyter notebooks that make up the entire book at the author's Github repository.

Introduction to exploratory data analysis, as tends to be done in the social and health sciences. Covers NumPy, pandas, SciPy, MatplotLib, and some statsmodels for Regression and time series. 

Code in both R & Python and on Github. Assumes familiarity with Python and statistics. Covers exploratory data analysis, sampling distributions, significance testing, regression, classification, and both supervised and unsupervisd learning. Uses scikit-learn, supplemented by statsmodels. 

Machine Learning, including Deep Learning

These books cover TensorFlow and Keras. 

Does not require prior knowledge of machine learning. Offers both hands-on experience with machine learning as well as the concepts behind the algorithms, how to use them, and how to avoid common pitfalls. Covers classification and regression, data pre-processing, applications of machine learning, and neural networks. 

For people who are comfortable with Python and Machine learning, and need a quick reference for the code to use. Covers loading and wrangling data, preparing different data types, and analyses from linear regression through neural networks.

Other Tutorials