news 2026/5/10 5:54:46

Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

CSVs cost you time, disk space, and money. Here are five alternatives every data scientist must know.

Dario Radečić

Sep 20, 2021

7 min read

Share

Photo by jaikishan patel on Unsplash

Everyone and their grandmother know what a CSV file is.But is it the optimal way to store data?Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.

If you’re storing large volumes of data, opting for CSVs will cost you both time and money.

Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.

Let’s set up the environment before going over the file formats.


Getting started – Environment setup

You’ll need a couple of libraries to follow along. The best practice is to install them inside a virtual environment, so that’s exactly what you’ll do. The following code snippet creates a new virtual environment through Anaconda and installs every required library:

conda create --name file_formats python=3.8 conda activate file_formats conda install -c conda forge numpy pandas fastavro pyarrow feather-format jupyter jupyterlab

Once the installation finishes, you can execute the following command to start a JupyterLab session:

jupyter lab

The next step is to import the libraries and create an arbitrary dataset. You’ll make one with 5 columns and 10M rows:

import numpy as np import pandas as pd import feather import pickle import pyarrow as pa import pyarrow.orc as orc from fastavro import writer, reader, parse_schema np.random.seed = 42 df_size = 10_000_000 df = pd.DataFrame({ 'a': np.random.rand(df_size), 'b': np.random.rand(df_size), 'c': np.random.rand(df_size), 'd': np.random.rand(df_size), 'e': np.random.rand(df_size) }) df.head()

Here’s how it looks like:

Image 1 – Random dummy dataset (image by author)

You now have everything needed to start experimenting with different data formats. Let’s cover ORC first.


ORC

ORC stands forOptimized Row Columnar. It’s a data format optimized for reads and writes in Hive. As Hive is painfully slow, folks at Hortonworks decided to develop the ORC file format to speed it up.

In Python, you can use theread_orc()function from Pandas to read ORC files. Unfortunately, there’s no alternative function for writing ORC files, so you’ll have to use PyArrow.

Here’s an example of writing Pandas DataFrames:

table = pa.Table.from_pandas(df, preserve_index=False) orc.write_table(table, '10M.orc')

And here’s the command for reading ORC files:

df = pd.read_orc('10M.orc')

You can learn more about ORC here:

Stop Using CSVs for Storage – This File Format is Faster and Lighter


Avro

Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. It stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.

Avro has an API for every major programming language, but it doesn’t support Pandas by default.

Here’s the set of commands for saving a Pandas DataFrame to an Avro file:

# 1. Define the schema schema = { 'doc': 'Float data', 'name': 'Data', 'namespace': 'data', 'type': 'record', 'fields': [ {'name': 'a', 'type': 'float'}, {'name': 'b', 'type': 'float'}, {'name': 'c', 'type': 'float'}, {'name': 'd', 'type': 'float'}, {'name': 'e', 'type': 'float'}, ] } parsed_schema = parse_schema(schema) # 2. Convert pd.DataFrame to records - list of dictionaries records = df.to_dict('records') # 3. Write to Avro file with open('10M.avro', 'wb') as out: writer(out, parsed_schema, records)

Reading Avro files is no picnic either:

# 1. List to store the records avro_records = [] # 2. Read the Avro file with open('10M.avro', 'rb') as fo: avro_reader = reader(fo) for record in avro_reader: avro_records.append(record) # 3. Convert to pd.DataFrame df = pd.DataFrame(avro_records)

You can learn more about Avro here:

CSV Files for Storage? Absolutely Not. Use Apache Avro Instead


Parquet

Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.

The best news is – Pandas has full support for Parquet files.

Here’s the command for writing a Pandas DataFrame to a Parquet file:

df.to_parquet('10M.parquet')

And here’s the equivalent for reading:

df = pd.read_parquet('10M.parquet')

You can learn more about Parquet here:

CSV Files for Storage? No Thanks. There’s a Better Option


Pickle

You can use thepicklemodule to serialize objects and save them to a file. Likewise, you can then deserialize the serialized file to load them back when needed. Pickle has one major advantage over other formats – you can use it to store any Python object. One of the most widely used functionalities is saving machine learning models after the training is complete.

The biggest downside is that Pickle is Python-specific, so cross-language support isn’t guaranteed. It could be a deal-breaker for any project requiring data communication between Python and R, for example.

Here’s how to write a Pandas DataFrame to a Pickle file:

with open('10M.pkl', 'wb') as f: pickle.dump(df, f)

You’ll only have to change the file mode when reading a Pickle file:

with open('10M.pkl', 'rb') as f: df = pickle.load(f)

You can learn more about Pickle here:

Stop Using CSVs for Storage – Pickle is an 80 Times Faster Alternative


Feather

Feather is a data format for storing data frames. It’s designed around a simple premise – to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.

You can use thefeatherlibrary to work with Feather files in Python. It’s the fastest available option currently.

Here’s the command for saving Pandas DataFrames to a Feather file:

feather.write_dataframe(df, '10M.feather')

And here’s the command for reading:

df = feather.read_dataframe('10M.feather')

You can learn more about Feather here:

Stop Using CSVs for Storage – This File Format Is 150 Times Faster


Comparison time – Which data format should you use?

Many highly optimized file formats are useless if you need to change or even view the data on the fly. If that’s not the case, you should generally avoid CSVs.

Below is a comparison between CSV and every other mentioned data format in write time. The goal was to save previously created 10Mx5 dataset locally:

Image 2 – Write time comparison in seconds (CSV: 34.7; ORC: 9.66; Avro: 9.58; Parquet: 2.06; Pickle: 0.5; Feather: 0.304) (image by author)

The differences are astronomical. Feather is about 115 times faster than CSV for storing identical datasets. Even if you decide to go with something more compatible such as Parquet, there’s still a 17 times decrease in write time.

Let’s talk about the read times next. The goal is to compare how much time it takes to read identical datasets in different formats:

Image 3 – Read time comparison in seconds (CSV: 3.83; ORC: 3.02, Avro: 27.6; Parquet: 1.23; Pickle: 0.193; Feather: 1.16) (image by author)

CSVs aren’t so terrible here. Apache Avro is the absolute worse due to required parsing. Pickle is the fastest one, so it looks like the most promising option if you’re working only in Python.

And finally, let’s compare the file sizes on the disk:

Image 4 – File size comparison in MB (CSV: 963.5; ORC: 379.5; Avro: 200.3; Parquet: 401.4; Pickle: 400; Feather: 400.1) (image by author)

Things don’t look good for CSVs. The file size reduction goes from 2.4x to 4.8x, depending on the file format.

To summarize, if you store gigabytes of data daily, choosing the correct file format is crucial. If you’re working only in Python, you can’t go wrong with Pickle. If you need something a bit more versatile, go with any other mentioned format.


What are your thoughts on these CSV alternatives? Which one(s) do you use if viewing and editing data on the fly isn’t required?Let me know in the comments below.

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 11:40:54

冥想第一千七百三十七天(1737)

1.很快就又周四了,感觉像周二的感觉,今天休息日,总算稍微松口气的一天。下班了去游泳。 2.感谢父母,感谢朋友,感谢家人,感谢不断进步的自己。

作者头像 李华
网站建设 2026/5/10 1:41:47

手撕call/apply/bind:从ES6用法到手写实现,吃透this指向核心

🔥 手撕call/apply/bind:从ES6用法到手写实现,吃透this指向核心 在JavaScript中,this 指向的绑定是前端开发绕不开的核心知识点,而 call、apply、bind 作为改变函数执行上下文的“三剑客”,既是解决 this 丢…

作者头像 李华
网站建设 2026/5/10 9:19:49

HTR3355 I2C/SMBus IO扩展器全方位解析

在电子系统集成化、小型化的发展趋势下,GPIO端口资源紧张已成为众多硬件设计的共性难题。HTR3355作为一款专为I2C和SMBus总线设计的高性能IO扩展器,凭借其宽电压适配、灵活配置、功能完备等核心优势,能够高效弥补主控芯片IO端口不足的短板&am…

作者头像 李华
网站建设 2026/5/10 0:59:05

【完整源码+数据集+部署教程】食品包装识别系统源码分享[一条龙教学YOLOV8标注好的数据集一键训练_70+全套改进创新点发刊_Web前端展示]

一、背景意义 随着全球经济的快速发展和生活水平的提高,食品消费市场日益繁荣,食品包装的多样性和复杂性也随之增加。食品包装不仅是保护食品的物理屏障,更是传递品牌形象和消费者信息的重要载体。在这一背景下,食品包装的识别与…

作者头像 李华
网站建设 2026/5/9 12:16:47

PageHelper插件

介绍 PageHelper是第三方提供的Mybatis框架中的一款功能强大、方便易用的分页插件,支持任何形式的单标、多表的分页查询。 官网:Mybatis-PageHelper/wikis/zh/HowToUse.md at master ruiqianyi/Mybatis-PageHelper GitHub 代码实现 在pom.xml引入依…

作者头像 李华