Decoding a Hosted CSV file in Elixir

Peter Brown
Agilion
Published in
4 min readApr 21, 2017

--

I’m currently working on a project that requires us to migrate a legacy database to a Heroku Postgres database using large CSV files. Due to the size of the files and Heroku’s limitations, we cannot read the files from the local file system.

When we started to think about ways to import the data, my first thought was to host the files on S3 and just run a mix task that streamed the file to a CSV decoder. It turns out that’s not as straightforward as I was hoping.

Our requirements:

  1. CSV files must be hosted remotely and accessible over the web to Heroku.
  2. CSV files may be very large so they cannot not downloaded to the Heroku filesystem or read into memory.

Streaming CSV Data

Normally if I was importing CSV data in Elixir, I would use a combo of File.stream!/3 and CSV.decode/2 to do something like this:

File.stream! is awesome except it only works with local files. Since we can’t use the local filesystem to store these CSV files, I started researching how I could stream a remote file to the CSV decoder.

I wanted to be able to pass a URL to a function and have it behave exactly like File.stream!.

In other words, I wanted something like this:

Streaming Resources

I stumbled across this outdated streaming example from Avdi Grimm that used Stream.resource/3.

Stream.resource is a bit of a Swiss army knife for creating streams in that it allows you to emit a sequence of values from any resource and do whatever you want with those values. It feels very similar to doing a map reduce where you use an accumulator to pass state from call to call.

Stream.resource takes three functions as arguments. The first function is called once at the beginning, the second function is called recursively with an accumulator until it returns {:halt, <any_value>}, and the third function is called once at the end, receiving <any_value>.

The example provided by the Elixir docs shows how you might use this to stream from a file:

Stream.resource/3 example from https://hexdocs.pm/elixir/File.html#stream!/3

In the example above, the first “start” function opens the file and returns the IO result. The second function reads a single line from the file and returns a tuple containing the line to be emitted as an array, and the accumulator (the IO result) that will get passed to the next call. When it gets to the end of the file, it returns {:halt, file} passing file to the third function which closes the file.

Reading the CSV

With everything above in mind, we need to be able to replace File.open!/1 with something that supports a hosted file that can be read over HTTP(s). Avdi’s example used Hackney, and after looking at some alternatives like HTTPoison’s async functionality, Hackney seemed like the simplest approach.

Hackney has a poorly documented :hackney.stream_body/1 function that allows us to stream chunks from a remote HTTP resource. It works in a similar way to IO.read/2, but instead of reading one line at a time, it reads chunks of data like Stream.chunk/4.

In order to work with stream_body, we need to first open a connection to a remote file with :hackney.get() and pass a reference to that connection to stream_body.

Here’s a simplified example of using get() with stream_body() to stream a single chunk:

One problem with using stream_body() is that the data streamed with each call is unlikely to end in a complete CSV row. Since the chunking is based on byte-size, most times it will return a few rows and end with a partial row. This is a problem because the CSV decoder requires each value streamed to it to be a complete CSV row.

To get around this problem, I am splitting out the last line of the data and passing that to the next function call as a “partial row”.

So for example if we had chunk that looked like:

data = "1,2,3,4\n5,6,7,8\n9,10"

I’m splitting apart the complete rows from the partial row, emitting just the complete rows, and sending the partial back with the accumulator:

rows = ["1,2,3,4,5","6,7,8,9,10"]
partial_row = "9,10"

The partial row to the next call so it can be prepending to the next chunk like so:

next_chunk = "11,12\n13,14,15,16\n17"
partial_row = "9,10"
data = "9,10,11,12\n13,14,15,16\n17"

Rinse, repeat.

All together now

The following is all the code I needed for streaming large CSV files. It will let us host the files on S3 or some other hosting provider and read them with an Elixir script running on Heroku.

I haven’t worked with files very much in Elixir so this was a fun exercise. I experimented with reading large files to see how far I could push it.

Here’s an example that takes a 300MB CSV file and just grabs the first 30 rows. It’s fun watching it run because it’s really fast and uses almost no memory because it only streams what it needs and then ends. There’s no need to download the whole file:

--

--