Data Structure Design in CSV.jl
I'm currently working through a refactor of CSV.jl internals and once again am faced with some tricky questions surrounding which data structures to use. In csv reading, like many things, there are trade-offs between performance/efficiency and convenience: we want to read delimited files as fast as possible, but we also want the most natural, standard data structures. Some of the complexity in these decisions come from the csv spec (or lack thereof!) itself, but to fully understand that, let's talk through the primary task of csv reading.
The aim for CSV.jl, is to take plain text data--columns separated by a delimiter, and rows separated by newlines--parse text values into native typed values, and output the data in columnar data structures, like a native Julia Vector
. Unfortunately, csv files do not contain any metadata to inform how many rows, columns, or what data types columns will be, as well as if they will have null values or not. This lack of metadata means we have to use tricks like guessing the # of rows in a file, "detecting" data types, and being flexible in case we detect one type and need to "promote" to a different data type later. The complexities are compounded when considering multi-threaded parsing because you either have to play the synchronization dance between threads, or give each thread a chunk of the file and "merge" their results afterwards (the current CSV.jl approach).
Given these processing considerations, here are some of the data structure questions I'm currently pondering:
Vector{Union{T, Missing}}
vs. SentinelVector{T}
: Base Julia includes an optimization that allows Arrays with isbits
Union elements to be stored inline, which means they can be extremely efficient when working with, for example, Union{Float64, Missing}
. I've been prototyping a new SentinelArrays.jl package that allows wrapping any AbstractArray
and specifying a special "sentinel" value of the array type that should return a different "special value", like missing
. For Float64
for example, we can use a non-standard NaN
bit pattern as a sentinel for misssing
and this works quite well: we're basically working with a plain Vector{Float64}
in terms of storage, but get the advantage of representing Union{Float64, Missing}
through our sentinel.
The pros of Vector{Union{T, Missing}}
is that it's Base julia, built-in, standard, and much of the data ecosystem has become familiar with the type and integrated well. The cons are that it takes up a little extra space (1 extra byte per element which signals whether an element has a Float64
or a missing
), and that there currently isn't a way to "truncate" the array like convert(Vector{Float64}, A::Vector{Union{Float64, Missing}})
without copying data. This is desirable because while parsing, we need to assume Union{T, Missing}
in case missing values are encountered, but might finish parsing and know that there were in fact none, in which case we'd like to just return Vector{Float64}
. With a SentinelVector
, this is trivial because we can just "unwrap" the underlying array and return that, given no sentinels were used while parsing. The disadvantages of SentinelArrays are just that they're non-standard, though the Julia ecosystem has evolved pretty well to just rely on the general AbstractArray
interface instead of needing specific array types.
The 2nd question is How to return/represent String columns? String columns are a bit of the odd duck with respect to parsing because you're not really "parsing" anything but just noting the start/end positions of a cell of the csv file. And indeed, one representation of a String column is just that: a custom array type that holds on to the original file byte buffer, and each "element" is just a byte position and length of each cell. Upon indexing, the String
can be fully materialized, but otherwise, this lazy-materializing structure provides a lot of efficiency. The disadvantage of this approach is needing to hold on to the original file buffer, which has caused confusion for users in the past when they try to modify/delete the file after parsing and get errors saying the file is still in use. Another disadvantage is trying to support the full AbstractArray
interface with this lazy structure, particularly with regards to mutating operations (push!
, append!
, etc.). The WeakRefStrings.jl package provides a structure that can mostly be used for this kind of task, but there are a few internal mismatches with how the position/lengths are represented. The alternative of just materializing a full Vector{String}
has the advantage of being extremely standard and easy to work with, but just expensive because each string cell in the file must be copied/allocated, even if it never ends up getting used.
The 3rd data structure question involves multi-threaded parsing: How to best return/represent columns when multiple threads are involved? As noted earlier, CSV.jl currently chunks up large files and lets each thread process a chunk separately: detecting types, guessing rows, the full parsing task. After each thread has finished, a "merge" operation is performed where columns will be promoted together, recoded, and ensure that each has a consistent type. CSV.jl currently defines its own CSV.Column2
type (admittedly not the greatest name 😜) that "chains" each threads' column together into a single "column". This seems to work pretty well in practice, with iteration still being extremely efficient, but there's a gotcha if you tried to do
for i = 1:length(column)
x = column[i]
# do stuff with x
end
That is, linear getindex
operations are slower (O(log N)
slow) because it has to determine which underlying chained array i
belongs to.
The most obvious alternative solution is to just vcat
all the thread columns together, but my worry there is we're essentially doubling the required memory for parsing, if only for the brief operation of appending columns together.
Anyway, these are some of the questions I've been sleeping on for the past few days; I don't have firm commitments to choose one solution over the other, but with the other internal refactoring, I thought it would be good to think through the ideals and see if we can improve things somehow. My ulterior motive in writing this all up in a blog post is to hopefully elicit some discussion/ideas around ways we can accomplish some of the things I've discussed here. As always, feel free to ping me on Twitter or the #data julialang slack channel to chat.