ORC Format - Optimized Row Columnar

What is ORC?

ORC (Optimized Row Columnar) is a columnar storage format designed for Hadoop workloads. It stores data in columns rather than rows, enabling incredibly fast analytical queries that only read necessary columns. ORC includes built-in indexes, lightweight bloom filters, min/max statistics per column, and aggressive compression (ZLIB, Snappy, LZO). ORC files are split into stripes (typically 64 MB) for parallel processing.

ORC is widely used in Apache Hive data warehouses, Apache Spark analytics, Presto/Trino queries, and cloud data lakes (AWS, Azure, GCP). It's the default format for many Hive tables due to excellent performance on large datasets. ORC typically outperforms Parquet for Hive-based workloads. Used by major tech companies (Facebook, Netflix, LinkedIn) for petabyte-scale data warehousing and analytics.

Did you know? ORC can compress data 10-100x smaller than raw text files!

History

Facebook and Hortonworks developed ORC to overcome limitations of RCFile and provide optimized columnar storage for Hive and Hadoop analytics.

Key Milestones

2013: ORC introduced in Hive 0.11
2015: Became Apache top-level project
2016: ORC v1 specification finalized
2018: ACID transactions support
2020: Cloud data lake adoption
Present: Hadoop ecosystem standard

Key Features

Core Capabilities

Columnar Storage: Column-based layout
Compression: 10-100x smaller files
Predicate Pushdown: Skip irrelevant data
Indexes: Min/max stats, bloom filters
ACID: Transaction support
Fast Queries: Optimized for analytics

Common Use Cases

Data Warehouses

Hive table storage

Analytics

Spark/Presto queries

Data Lakes

Cloud storage (S3, ADLS)

Big Data

Petabyte-scale storage

Advantages

Exceptional compression ratios
Lightning-fast query performance
Built-in indexes and statistics
ACID transaction support
Optimized for Hive
Excellent predicate pushdown
Production-proven at scale

Disadvantages

Write performance slower
Less ecosystem support than Parquet
Binary format (not human-readable)
Primarily Hadoop-focused
Requires schema knowledge
Complex file format

Technical Information

Format Specifications

Specification	Details
File Extension	.orc
MIME Type	application/octet-stream
Storage	Columnar
Compression	ZLIB, Snappy, LZO, LZ4
Stripe Size	Default 64 MB
ACID	Full transaction support

Common Tools

Hadoop: Apache Hive, Apache Spark, Presto/Trino
Cloud: AWS Athena, Azure Synapse, Google BigQuery
Tools: orc-tools CLI, Apache Drill
Languages: Java, Python (pyorc), C++

Work with ORC

Convert Files All Formats

Related Formats

Tip: Use ORC for Hive tables - it's optimized specifically for Hive workloads!