What is HDF5?
HDF5 (Hierarchical Data Format version 5) is a container format designed for storing massive scientific datasets with complex hierarchical structures. It can store heterogeneous data (arrays, tables, images, metadata) in a file-system-like structure with groups and datasets. HDF5 supports multi-dimensional arrays up to 32 dimensions, chunking for partial I/O, compression (gzip, LZF, Blosc), and parallel I/O for supercomputers.
HDF5 is the standard format in scientific computing, used by NASA for satellite data, CERN for particle physics, genomics research (10X Genomics), machine learning (Keras/TensorFlow model storage), climate modeling, and medical imaging. Popular with Python (h5py, PyTables), MATLAB, R, and Julia for handling large datasets that don't fit in memory. Used when you need efficient random access to terabyte-scale arrays.
History
The HDF Group at the National Center for Supercomputing Applications (NCSA) developed HDF5 to handle large scientific datasets with complex structures.
Key Milestones
- 1988: HDF4 predecessor created
- 1998: HDF5 1.0 released
- 2006: NASA adopts for Earth science
- 2013: Keras uses for model weights
- 2018: HDF5 1.10 performance boost
- Present: Scientific computing standard
Key Features
Core Capabilities
- Hierarchical: Group/dataset structure
- Multi-Dimensional: Up to 32 dimensions
- Chunking: Partial array access
- Compression: gzip, LZF, Blosc
- Metadata: Rich attribute storage
- Parallel I/O: MPI support
Common Use Cases
Research
Scientific datasets
Machine Learning
Neural network weights
Aerospace
NASA satellite data
Genomics
DNA sequencing data
Advantages
- Handles massive datasets efficiently
- Complex hierarchical structures
- Partial data access (chunking)
- Excellent compression support
- Rich metadata capabilities
- Parallel I/O for HPC
- Wide scientific software support
Disadvantages
- Complex format specification
- Steep learning curve
- Can corrupt with improper closure
- Not human-readable
- Overkill for simple data
- Requires specialized libraries
Technical Information
Format Specifications
| Specification | Details |
|---|---|
| File Extension | .hdf5, .h5, .hdf |
| MIME Type | application/x-hdf5 |
| Max Dimensions | 32 dimensions |
| Compression | gzip, szip, LZF, Blosc |
| Parallel I/O | MPI support |
| License | BSD-style (free) |
Common Tools
- Python: h5py, PyTables, Keras (model storage)
- MATLAB: Built-in HDF5 support
- R: rhdf5, hdf5r packages
- Tools: HDFView (GUI), h5dump, h5ls