What is ORC?
ORC (Optimized Row Columnar) is a columnar storage format designed for Hadoop workloads. It stores data in columns rather than rows, enabling incredibly fast analytical queries that only read necessary columns. ORC includes built-in indexes, lightweight bloom filters, min/max statistics per column, and aggressive compression (ZLIB, Snappy, LZO). ORC files are split into stripes (typically 64 MB) for parallel processing.
ORC is widely used in Apache Hive data warehouses, Apache Spark analytics, Presto/Trino queries, and cloud data lakes (AWS, Azure, GCP). It's the default format for many Hive tables due to excellent performance on large datasets. ORC typically outperforms Parquet for Hive-based workloads. Used by major tech companies (Facebook, Netflix, LinkedIn) for petabyte-scale data warehousing and analytics.
History
Facebook and Hortonworks developed ORC to overcome limitations of RCFile and provide optimized columnar storage for Hive and Hadoop analytics.
Key Milestones
- 2013: ORC introduced in Hive 0.11
- 2015: Became Apache top-level project
- 2016: ORC v1 specification finalized
- 2018: ACID transactions support
- 2020: Cloud data lake adoption
- Present: Hadoop ecosystem standard
Key Features
Core Capabilities
- Columnar Storage: Column-based layout
- Compression: 10-100x smaller files
- Predicate Pushdown: Skip irrelevant data
- Indexes: Min/max stats, bloom filters
- ACID: Transaction support
- Fast Queries: Optimized for analytics
Common Use Cases
Data Warehouses
Hive table storage
Analytics
Spark/Presto queries
Data Lakes
Cloud storage (S3, ADLS)
Big Data
Petabyte-scale storage
Advantages
- Exceptional compression ratios
- Lightning-fast query performance
- Built-in indexes and statistics
- ACID transaction support
- Optimized for Hive
- Excellent predicate pushdown
- Production-proven at scale
Disadvantages
- Write performance slower
- Less ecosystem support than Parquet
- Binary format (not human-readable)
- Primarily Hadoop-focused
- Requires schema knowledge
- Complex file format
Technical Information
Format Specifications
| Specification | Details |
|---|---|
| File Extension | .orc |
| MIME Type | application/octet-stream |
| Storage | Columnar |
| Compression | ZLIB, Snappy, LZO, LZ4 |
| Stripe Size | Default 64 MB |
| ACID | Full transaction support |
Common Tools
- Hadoop: Apache Hive, Apache Spark, Presto/Trino
- Cloud: AWS Athena, Azure Synapse, Google BigQuery
- Tools: orc-tools CLI, Apache Drill
- Languages: Java, Python (pyorc), C++