Skip to content

Column safety #144

@maxigit

Description

@maxigit

Is your feature request related to a problem? Please describe.
At the moment the type of a DataFrame doesn't provide any information regarding the columns it holds (names and/or type) and there is no way to verify as compile time that a column exists and has the right type.

I've been using Haskell and R do forecasting and I am planning to rewrite some of the R bit either to python (using panda dataframe) or Haskell. My experience so far has been than R was brilliant for prototyping, the lack of type safety wasn't a problem at first as I could get instant feedback on what was wrong. However finding problem later on ( I mean years later, I still use R script that I wrote in 2012) was more tricky mainly to the fact that I couldn't know what columns was in a dataframe.

A typical example would be reading 2 or 3 csv, joins them and the script crashing because for some reasons, one of the csv was empty, no column so later on in the pipeline one of the column has been lost and there is no easy way to find where the column should come from.

Unless I am missing something, the DataFrame in this package will have the same issue. We have indeed some type safety but not the one needed to solve this issue.

And the other and , plain Haskell with vector or list of records provides this safety.
Join between two vector of records can result for example as a These a b.
It becomes however quickly cumbersome and it is easy to have performance issue.

(I even created a package metamorphosis allowing using to Template Haskell, to merge records to a new one, but that also is a bit heavy).

Describe the solution you'd like
I don't have a fully fledged solution but I've thought of this problem quite a bit and I have a few idea.
Most of them would be being able to convert a typeless DataFrame to something with more information.

For example the type of the column can be added as a (type level) list of types as in DataFrame '[Int, Double, Double].
In a similar way, the name could be added using symbol as in DataFram '[("quantity", Int), ("price", Double)].

Another idea would be to add HasField constraints. It should then look like (HasField "quantity" d Int , HasField "price" d Double, d ~ DataFrame) => d

The most promising would to be to have something similar to justified-container where you can be given "proof" that the DataFrame as the columns you require.

Describe alternatives you've considered
Being able to parametrized (or convert) a DataFrame to something indexed by a record.
(Maybe like the Frame package). One of the problem here is that when doing join between to DataFrame, we need a 3 records (the union of both records joined). This could be made easier if Haskell supported row polymorphism. Maybe an alternative would be to add something in GHC to support it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions