Introduction

Python is an attractive option for data analysis: it's OSS and multiplatform, excellent for file and data manipulation and it many scientific and numerical computation libraries. The "big win" is that you can potentially use Python as your only language for data analysis (i.e. read in your data files, cleanup data, perform various statistical tests, and output graphs and other visualizations).

If you go this route, you'll want to be familiar with this stack

- IPython, an interactive version of Python

- Notebook, optionally, a web-based environment for IPython that combines markdown, code and output

- Numpy, the primary scientific computing library for Python

- Pandas, data analysis library for Python

- Matplotlib, 2D plotting library for graphs and visualizations

Typically, you'll use Python to cleanup your data and format it into structures that Numpy and Pandas can work with directly. Matplotlib is used to generate graphs if needed.

Notebook is optional, but a great way to package together documentation, code, and results together in a single file. Great for sharing results.

Resources

In addition to the home pages of the respective package (above), I've found the following resources to be really helpful:

Wes McKinney, Python for Data Analysis. Written by the author of Pandas, it includes a high-level introduction to Python and Numpy for data analysis. Cheap, available electronically or paper, and very, very well-written.

YouTube Video: Wes McKinney talking about Python for Data Analysis

YouTube Video: Alfred Essa, Introduction to Python for Data Analysis (very short)

Getting Started

Each of the packages mentioned above can be downloaded and installed separately, but it's much simpler to download a single distribution that contains everything.

Enthought Canopy (formerly known as the Enthought Python Distribution, or EPD) - what I use

Anaconda - no experience

Canopy, for instance, is a single file download that bundles the appropriate version of IPython, all libraries, and configures your environment. It also includes a package manage to install other libraries, keep things up to date etc.

Note that Canopy is a commercial product but it has a free academic license. Make sure to signup using your @uwaterloo.ca email address.

Samples

See Wes McKinney's blog

-- JeffAvery - 2013-07-05

Topic revision: r1 - 2013-07-05 - JeffAvery
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback