In the last article, I talked about our process of finding the best way to load SSTable data into the GPU for data analysis. We looked at various ways to convert Cassandra data into a format usable by RAPIDS and decided to create sstable to arrow, a custom implementation that parses and writes sstable to arrow format. In this post, we will further discuss sstable-to-arrow, its capabilities, limitations, and how to use it in analytical use cases.
Sstable-to-arrow is written in C++17. It uses Kaitai structure. The library specifies the layout of SSTable files in an anti-Clara fashion. The Kaitai Structural Compiler then compiles these declarations into C++ classes that can be included in source code to parse SSTables into in-memory data. Then, it takes the data and converts each column in the table to an arrow vector. Sstable to arrow can then send arrow data to any client, where the data can be converted to cuDF and used for GPU analysis.
SStable-to-arrow can only read one SSTable at a time. To process multiple SSTables, the user must configure cuDF for each SSTable and use the GPU to merge them based on the last write wins semantics.
sstable-to-arrow exposes internal Cassandra timestamps and tombstone markers so that they can be merged at the cuDF layer.
Some data, including partition key and cluster column names, cannot actually be inferred from SSTable files because they require schemas to be stored in system tables.
Cassandra stores data in memtables and commitlogs before flushing to SSTables, so analytics performed using only sstable-to-arrow may be outdated/not real-time.
Currently, the parser only supports Cassandra OSS 3. 11 Documents written.
The system is set to scan entire SSTables (rather than reading specific partitions). If we do predicate pushdown, more work needs to be done.
The following CQL types are not supported: counter, frozen, and user-defined types.
varint s can only store up to 8 bytes. Attempting to read a table with a large varint will crash.
The parser can only read tables with up to 64 columns.
The parser loads each SSTable into memory, so currently it cannot handle large SSTables that exceed the machine's memory capacity.
Decimal s are converted to 8-byte floating point values because neither C++ nor Arrow have native support for arbitrary precision integers or decimal points for Java's BigIntege or BigDecimal classes. This means that operations on decimal columns will use floating point arithmetic, which may be imprecise.
Sets are treated as lists because Arrow has no equivalent for sets.
Roadmap and future developments
The ultimate goal of this project is to include some form of meter reading functionality in the RAPIDS ecosystem, similar to cudf.read_csv. Performance is also an evolving area, and I'm currently investigating how to read SSK tables further in parallel to get the most out of the GPU. I'm also working on addressing or improving the aforementioned limitations, in particular expanding support for different CQL types and enabling programs to handle large datasets.
How to use sstable-to-arrow
You can run sstable-to-arrow with Docker.
This will listen for connections on port 9143. It expects the client to send the message first and then the data in the following format:
Number of arrow tables transferred as 8-byte big-endian unsigned integers
For each table:
Its size (in bytes) is an 8-byte big-endian unsigned integer.
The contents of the table are in arrow IPC stream format.
Then you can use any client to get data from the port. To get started with the sample Python client, if your system does not support CUDA, follow these steps:
If your system supports CUDA, it is recommended to use the following command to create a conda environment. Before starting the sstable-to-arrow server, you also need to pass the -x flag to convert all types that do not support cuDF to hexadecimal strings.
To experiment with other datasets, you will need the original SSTable file on your computer. You can download a sample IoT data in this Google Drive folder. You can also generate IoT data using the generate-data script in the repository, or create tables manually using CQL and the Cassandra Docker image (see Cassandra Quick Start for more information). Make sure to use a Docker volume to share the SSTable file with the container:
You can also pass the -h flag to get information about other options. If you wish to build the project from source, follow the steps in the GitHub repository.
SSTable to Parquet
Sstable-to-arrow is also able to save SSTable data as Parquet files, which is a common format for storing columnar data. Also, it doesn't support deduplication yet, so it simply outputs the sstable and all metadata to the given parquet file.
You can run this by passing the -p flag, followed by the path where you want to store the parquet files:
Sstable-to-arrow is an early but promising approach to utilizing Cassandra data for GPU-based analytics. The project is available on GitHub and can be accessed as an alpha release via Docker Hub.
About the author
Alex Cai is interning at DataStax in 2021 and is a Harvard Class of 2025 student. He is passionate about computers, software and cognitive science, and in his spare time he enjoys reading, studying linguistics and playing with his cat.
Reviewing Editor: Guo Ting