What is PARQUET?
Apache Parquet is a columnar storage format that organizes data by column rather than by row, making it highly efficient for analytical queries that only need specific columns. Unlike CSV or JSON which store data row-by-row, Parquet groups all values of a column together, enabling better compression and faster querying.
Used extensively in big data ecosystems (Spark, Hive, Presto, AWS Athena), Parquet achieves 10x compression compared to CSV and allows queries to skip irrelevant columns entirely. It's the de facto standard for data lakes and analytics pipelines, supporting complex nested data structures and schema evolution.
History
Parquet was created by Twitter and Cloudera to bring efficient columnar storage from Google Dremel to the Hadoop ecosystem as an open standard.
Key Milestones
- 2013: Parquet project initiated
- 2015: Parquet 1.0 released
- 2016: Apache top-level project
- 2018: Wide cloud platform adoption
- 2020: Parquet 2.0 with improvements
- Present: Industry standard for data lakes
Key Features
Core Capabilities
- Columnar Storage: Column-oriented layout
- Excellent Compression: 10-100x vs CSV
- Column Pruning: Read only needed columns
- Predicate Pushdown: Skip irrelevant data
- Schema Evolution: Add columns over time
- Nested Structures: Complex data types
Common Use Cases
Data Analytics
OLAP, business intelligence
Data Lakes
S3, Azure Data Lake
Machine Learning
Training data storage
Big Data
Spark, Hive, Presto
Advantages
- Exceptional compression ratios
- Fast analytical queries
- Column-level operations
- Storage cost savings
- Wide ecosystem support
- Schema evolution support
- Cloud-native format
Disadvantages
- Not human-readable
- Slower for row-based operations
- Write overhead for small updates
- Requires specialized tools
- Not ideal for transactional data
- Learning curve
Technical Information
Format Specifications
| Specification | Details |
|---|---|
| File Extension | .parquet |
| MIME Type | application/vnd.apache.parquet |
| Format Type | Columnar data |
| Compression | Snappy, GZIP, LZO, Zstd |
| Schema | Self-describing |
| License | Apache 2.0 |
Common Tools
- Processing: Apache Spark, Apache Hive, Presto
- Cloud: AWS Athena, BigQuery, Snowflake
- Python: PyArrow, Pandas, Dask
- Viewers: parquet-tools, DBeaver