Skip to Main Content
George Mason University | University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Software: Learn Python for Data

Resources to learn and use the Open Source Programming Environment Python for Data Science.

To get up and running with Python you need 3 elements:

  1. Python
  2. Packages
  3. An Interface

Different tutorials will suggest or illustrate specific setups. Either choose a setup and then pick a matching tutorial, or choose a tutorial and follow their instructions for setup. Many tutorials have step-by-step instructions.

In the Cloud

  • Anaconda Cloud - Free basic plan allows for Jupyter Notebook or IDLE console.
  • Python Anywhere by Anaconda- Free basic plan just allows the IDLE console. $5/month+ for Jupyter Notebook support.
  • Google Colab - Uses a version of Jupyter Notebook only. Requires a google account.
  • GitHub Codespaces - Free plan limits storage and core hours. Uses Jupyter Notebook
  • Kaggle Code - Uses Kaggle Notebooks (similar to Jupyter Notebooks). Can use without logging in.
  • Replit - Free basic plan allows for 3 public projects and limited resources. Use scripts or the console in a clean interface.

Installing Python

1. Python

You can download and install just Python (from Python.org) or get a different distribution, which includes by default additional packages/functions and interfaces for Steps 2 and 3 (but you can also download and install them yourself). The most popular distributions include:

  • Anaconda - Comes with all the most popular packages and Jupyter Notebook, so you can avoid the two steps below.
  • Miniconda - A minimal version of Anaconda with the conda package manager
  • WinPython - Portable Python for Windows Computers for scientific purposes
  • Python - The basic, no frills Python, using the Python Package Index

Installing Packages

2. Packages

Python comes with a base set of functions, but most functionality is provided through packages (each containing many functions). Each package focuses on different kinds of actions. These may also be called libraries, which is a collection of packages.

The Anaconda distribution (above) comes with the most used and common Data Science packages already installed so you can skip this step.

  • NumPy - for numerical computation in arrays
  • SciPy - for scientific computation
  • Pandas - for representing tabular data
  • StatsModels - for statistical analysis
  • SciKit-Learn - for machine learning

see the Visualization and Data Collection tabs for lists of relevant packages

 

To install packages

There are two package managers: pip and Anaconda (or conda for short). If you installed the Anaconda or Miniconda distribution, you should follow the instructions on the conda tab. Otherwise, or if conda doesn't work, follow the instructions on the pip tab as that comes with every Python installation.

See the pip and conda tabs above for more information

Installing Packages with pip

First, open your terminal. On Windows, you will usually use Command Prompt by default on OS X and Linux, this will be Terminal by default. Assuming you installed Python into your system's PATH variable, you will be able to install packages with pip using the install argument as such:

pip install numpy
pip install pandas
pip install scikit-learn
pip install scipy
pip install matplotlib
pip install seaborn

Sometimes it is recommended to add an upgrade argument to your pip install. This allows you to (1) ensure you are installing the latest version of the package or (2) update an already installed package to a new version:

pip install --upgrade numpy

When you have several packages to install at once, such as in our example of 6 data science packages, it is inefficient to install each one after the next. To speed this up, you can simply list the packages sequentially in one install call:

pip install --upgrade numpy pandas scikit-learn scipy matplotlib seaborn

For a full list of pip commands, reference the pip documentation.

Installing Packages With Conda

Installing packages with Conda is very similar to pip, though with some caveats. First, we will call conda from the command line to initiate an install:

conda install numpy

However, due to how conda distributes packages, it is often better to set a specific channel when installing with conda:

conda install -c conda-forge numpy 

We can tell conda we want to prioritize the conda-forge channel this way:

conda config --add channels conda-forge
conda config --set channel_priority strict

# Now we can simply call:
conda install <name_of_package>

We can also install packages in bulk:

conda install numpy pandas scikit-learn scipy matplotlib seaborn

Reference the conda documentation for more commands and advanced features.

Installing an Interface / Environment

3. An Interface

Some Python distributions come with one or more of these, so check to see if it is already installed. These are just the most popular, but there are others.