Column  safety

**Is your feature request related to a problem? Please describe.**
At the moment the type of a  DataFrame doesn't provide any information regarding the columns it holds (names and/or type) and there is no way to verify as compile time that a column exists and has the right type.

I've been using Haskell and R do forecasting and I am planning to rewrite some of the R bit either to python (using panda dataframe) or Haskell. My experience so far has been than R was brilliant for prototyping, the lack of type safety wasn't a problem at  first as I could get instant feedback on what was wrong. However finding problem later on ( I mean years later, I still use R script that I wrote in 2012) was more tricky mainly to the fact that I couldn't know what columns was in a dataframe.

A typical example would be reading 2 or 3 csv, joins them and the script crashing because for some reasons, one of the csv was empty, no column so later on in the pipeline one of the column has been lost and there is no easy way to find where the column should come from.
 
Unless I am missing something, the DataFrame in this package will have the same issue. We have indeed some type safety but not the one needed to solve this issue.

And the other and , plain Haskell with vector or list of records provides this safety.
Join between two vector of records can result for example as a `These a b`.
It becomes however quickly cumbersome and it is easy to have performance issue.

(I even created a package `metamorphosis` allowing using to Template Haskell, to merge records to a new one, but that also is a bit heavy).

**Describe the solution you'd like**
I don't have a fully fledged solution but I've thought of this problem quite a bit and I have a few idea.
Most of them would be being able to convert a typeless DataFrame to something with more information.

For example the type of the column can be added as a (type level) list of types  as in `DataFrame '[Int, Double, Double]`.
In a similar way, the name could be added using symbol as in `DataFram '[("quantity", Int), ("price", Double)]`.

Another idea would be to add `HasField` constraints. It should then look like `(HasField "quantity" d Int , HasField "price" d Double, d ~ DataFrame) => d`

The most promising would to be to have something similar to justified-container where you can be given "proof" that the DataFrame as the columns you require.

**Describe alternatives you've considered**
Being able to parametrized (or convert) a DataFrame to something indexed by a record.
(Maybe like the Frame package). One of the problem here is that when doing join between to DataFrame, we need a 3 records (the union of both records joined). This could be made easier if Haskell supported row polymorphism. Maybe an alternative would be to add something in GHC to support it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column safety #144

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Column safety #144

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions