Numpy array with dimensions¶
dimarray is a package to handle numpy arrays with labelled dimensions and axes. Inspired from pandas, it includes advanced alignment and reshaping features and as well as missing-value (NaN) handling.
The main difference with pandas is that it is generalized to N dimensions, and behaves more closely to a numpy array. The axes do not have fixed names (‘index’, ‘columns’, etc...) but are given a meaningful name by the user (e.g. ‘time’, ‘items’, ‘lon’ ...). This is especially useful for high dimensional problems such as sensitivity analyses.
A natural I/O format for such an array is netCDF, common in geophysics, which relies on the netCDF4 package, and supports metadata.
dimarray is distributed under a 3-clause (“Simplified” or “New”) BSD license. Parts of basemap which have BSD compatible licenses are included. See the LICENSE file, which is distributed with the dimarray package, for details.
A ``DimArray`` can be defined just like a numpy array, with additional information about its dimensions, which can be provided via its axes and dims parameters:
>>> from dimarray import DimArray >>> a = DimArray([[1.,2,3], [4,5,6]], axes=[['a', 'b'], [1950, 1960, 1970]], dims=['variable', 'time']) >>> a dimarray: 6 non-null elements (0 null) 0 / variable (2): 'a' to 'b' 1 / time (3): 1950 to 1970 array([[ 1., 2., 3.], [ 4., 5., 6.]])
Indexing now works on axes
>>> a['b', 1970] 6.0
Or can just be done a la numpy, via integer index:
>>> a.ix[0, -1] 3.0
Basic numpy transformations are also in there:
>>> a.mean(axis='time') dimarray: 2 non-null elements (0 null) 0 / variable (2): 'a' to 'b' array([ 2., 5.])
Can export to pandas for pretty printing:
>>> a.to_pandas() time 1950 1960 1970 variable a 1 2 3 b 4 5 6
- python 2.7
- numpy (tested with 1.7, 1.8, 1.9, 1.10.1)
- netCDF4 (tested with 1.0.8, 1.2.1) (netCDF archiving) (see notes below)
- matplotlib 1.1 (plotting)
- pandas 0.11 (interface with pandas)
- cartopy 0.11 (dimarray.geo.crs)
Download the latest version from github and extract from archive Then from the dimarray repository type (possibly preceded by sudo):
python setup.py install
Alternatively, you can use pip to download and install the version from pypi (could be slightly out-of-date):
pip install dimarray
Notes on installing netCDF4¶
- On Ubuntu, using apt-get is the easiest way (as indicated at https://github.com/Unidata/netcdf4-python/blob/master/.travis.yml):
sudo apt-get install libhdf5-serial-dev netcdf-bin libnetcdf-dev
- On windows binaries are available: http://www.unidata.ucar.edu/software/netcdf/docs/winbin.html
- From source. Installing the netCDF4 python module from source can be cumbersome, because
it depends on netCDF4 and (especially) HDF5 C libraries that need to be compiled with specific flags (http://unidata.github.io/netcdf4-python). Detailled information on Ubuntu: https://code.google.com/p/netcdf4-python/wiki/UbuntuInstall
All suggestions for improvement or direct contributions are very welcome. You can ask a question or start a discussion on the mailing list or open an issue on github for precise requests. See links.
The ecosystem of labelled arrays¶
A brief overview of the various array packages around. The listing is chronological. dimarray is strongly inspired from pandas and larry packages.
numpy provides the basic array object, transformations and so on. It does not include axis labels and has limited support for missing values (NaNs). An extension, numpy.ma, adds a mask attributes and skip NaNs in transformations.
larry was pioneer as labelled array, it skips nans in transforms and comes with a wealth of built-in methods. It is very computationally-efficient via the use of bottleneck. For now it does not support naming dimensions.
pandas is an excellent package for low-dimensional data analysis, supporting many I/O formats and axis alignment (or “reindexing”) in binary operations. It is mostly limited to 2 dimensions (DataFrame), or up to 4 dimensions (Panel, Panel4D).
iris looks like a very powerful package to manipulate geospatial data with metadata, netCDF I/O, performing grid transforms etc..., but it is quite a jump from numpy’s ndarray in term of syntax and requires a bit of learning.
dimarray, like iris, considers dimension names as a fundamental property of an array, and as such supports netCDF I/O format. It makes use of it in binary operations (broadcasting), transforms and indexing. It includes some of the nice features of pandas (e.g. axis alignment, optional nan skipping) but extends them to N dimensions, with a behaviour closer to a numpy array. Some geo features are planned (weighted mean for latitude, indexing modulo 360 for longitude, basic regridding) but dimarray should remain broad in scope.
xray has many similarities with dimarray, but has a stronger focus on following the Common Data Model, i.e. its primary object is the Dataset whereas dimarray is centered on the multidimensional array, with the Dataset being more of an extension (a dictionary of DimArrays). Moreover, xray is tightly integrated with pandas (it uses pandas Index as coordinates), whereas dimarray simply provides methods to exchange data from and to pandas objects, but does not actually rely on pandas. This makes dimarray+dependencies somewhat more lightweight than xray. In term of speed (indexing...), things are comparable - since all computationally-intensive operations in dimarray rely on fast numpy. xray is a very thorough project (e.g. unit tests etc...) with a commited author, so if your primary focus is to work with netCDF, access openDAP data and so on, you may want to go that way. Try it out!
spacegrids is a promising new package with focus on geospatial grids. It intends to streamline a number of operations such as derivations, integration, regridding by proposing an algebra on between arrays and axes (grids). It also includes a project management utility for netCDF files.