What is ORC?

ORC (Optimized Row Columnar) is a columnar storage format designed for Hadoop workloads. It stores data in columns rather than rows, enabling incredibly fast analytical queries that only read necessary columns. ORC includes built-in indexes, lightweight bloom filters, min/max statistics per column, and aggressive compression (ZLIB, Snappy, LZO). ORC files are split into stripes (typically 64 MB) for parallel processing.

ORC is widely used in Apache Hive data warehouses, Apache Spark analytics, Presto/Trino queries, and cloud data lakes (AWS, Azure, GCP). It's the default format for many Hive tables due to excellent performance on large datasets. ORC typically outperforms Parquet for Hive-based workloads. Used by major tech companies (Facebook, Netflix, LinkedIn) for petabyte-scale data warehousing and analytics.

Did you know? ORC can compress data 10-100x smaller than raw text files!

History

Facebook and Hortonworks developed ORC to overcome limitations of RCFile and provide optimized columnar storage for Hive and Hadoop analytics.

Key Milestones

  • 2013: ORC introduced in Hive 0.11
  • 2015: Became Apache top-level project
  • 2016: ORC v1 specification finalized
  • 2018: ACID transactions support
  • 2020: Cloud data lake adoption
  • Present: Hadoop ecosystem standard

Key Features

Core Capabilities

  • Columnar Storage: Column-based layout
  • Compression: 10-100x smaller files
  • Predicate Pushdown: Skip irrelevant data
  • Indexes: Min/max stats, bloom filters
  • ACID: Transaction support
  • Fast Queries: Optimized for analytics

Common Use Cases

Data Warehouses

Hive table storage

Analytics

Spark/Presto queries

Data Lakes

Cloud storage (S3, ADLS)

Big Data

Petabyte-scale storage

Advantages

  • Exceptional compression ratios
  • Lightning-fast query performance
  • Built-in indexes and statistics
  • ACID transaction support
  • Optimized for Hive
  • Excellent predicate pushdown
  • Production-proven at scale

Disadvantages

  • Write performance slower
  • Less ecosystem support than Parquet
  • Binary format (not human-readable)
  • Primarily Hadoop-focused
  • Requires schema knowledge
  • Complex file format

Technical Information

Format Specifications

Specification Details
File Extension .orc
MIME Type application/octet-stream
Storage Columnar
Compression ZLIB, Snappy, LZO, LZ4
Stripe Size Default 64 MB
ACID Full transaction support

Common Tools

  • Hadoop: Apache Hive, Apache Spark, Presto/Trino
  • Cloud: AWS Athena, Azure Synapse, Google BigQuery
  • Tools: orc-tools CLI, Apache Drill
  • Languages: Java, Python (pyorc), C++