Skip to content

A very simple, strongly typed, scala framework for tabular data. A collection of tuples. A strongly typed scala csv reader/parser wrapping opencsv.

License

Notifications You must be signed in to change notification settings

mfulgo/product-collections

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

##product-collections

Build Status

product-collections is a Scala collection designed to hold Tuples.

use product-collections to manipulate tabular data while

  • retaining type safety.
  • writing idiomatic scala

product-collections is

  • minimalistic.
  • marries two existing scala constructs: Products, and Collections, in the obvious way.

product-collections has a very neat and typesafe CSV reader/parser: CsvParser[String,Int].parseFile("sample.csv")

I wrote product-collections in response to the data requirements of an internal project. I found the alternatives

  • too complex.
  • too heavy.
  • too academic.
  • insufficiently type safe.

A product-collection can be assembled either row by row or column by column. Data can be extracted either row by row or column by column.

Scaladoc

View the Scaladoc. The Scaladoc packages contain examples and REPL sessions.

The scaladoc on github is prefered to a locally generated variant: I've used a hacked version of scala to generate it. If you want a local copy you can clone the gh-pages branch.

Repl Session

This document contains fragments of a REPL session which may not be entirely consistent. The full repl session is available. You can reproduce the repl session by pasting the repl source in the doc directory.

###Using CollSeq ####Creating a CollSeq

Let the compiler infer the appropriate implementation:

scala> CollSeq(("A",2,3.1),("B",3,4.0),("C",4,5.2))
res1: org.catch22.collections.immutable.CollSeq3[String,Int,Double] = 
CollSeq((A,2,3.1),
        (B,3,4.0),
        (C,4,5.2))

Notice that the correct types are inferred for each column. Consistent Tuple length is guaranteed by the compiler. You can't have a CollSeq comprising mixed Product2 and Product3 types for example.

####Extracting columns:

A CollSeqN is also a ProductN (essentially a Tuple). To extract a column:

scala> CollSeq(("A",2,3.1),("B",3,4.0),("C",4,5.2))
res0: org.catch22.collections.immutable.CollSeq3[String,Int,Double] = 
CollSeq((A,2,3.1),
        (B,3,4.0),
        (C,4,5.2))

scala> res0._1
res1: Seq[String] = List(A, B, C)

####Extract a row

CollSeq is an IndexedSeq so you can extract a row in the normal manner:

scala> res1(1)
res4: Product3[java.lang.String,Int,Int] = (B,3,4)

####Add a column

You can use the flatZip method to add a column:

scala> res1.flatZip(res1._2.map(_ *2))
res14: org.catch22.collections.immutable.CollSeq4[String,Int,Double,Int] = 
CollSeq((A,2,3.1,4),
        (B,3,4.0,6),
        (C,4,5.2,8))

####Access the row 'above'

Using scala's sliding method you can access the preceeding n rows. Here we calculate the difference between the values in the 4th column:

scala> res14._4.sliding(2).toList.map(z=>z(1)-z(0))
res21: List[Int] = List(2, 2)

Append the result:

scala> res14.flatZip(0::res21)
res22: org.catch22.collections.immutable.CollSeq5[java.lang.String,Int,Int,Int,Int] = 
(A,2,3,4,0)
(B,3,4,6,2)
(C,4,5,8,2)

####Splice columns together

This uses the implicit conversions in the collections package object.

scala> CollSeq((1,2,3),(2,3,4),(3,4,5))
res0: org.catch22.collections.immutable.CollSeq3[Int,Int,Int] = 
CollSeq((1,2,3),
        (2,3,4),
        (3,4,5))

scala> res0._3 flatZip res0._1 flatZip res0._2
res2: org.catch22.collections.immutable.CollSeq3[Int,Int,Int] = 
CollSeq((3,1,2),
        (4,2,3),
        (5,3,4))

####Map

Map and similar methods (where possible) produce another CollSeq:

scala> CollSeq((3,1,2),
     |             (4,2,3),
     |             (5,3,4))
res0: org.catch22.collections.immutable.CollSeq3[Int,Int,Int] = 
CollSeq((3,1,2),
        (4,2,3),
        (5,3,4))

scala> res0.map(t=>(t._1+1,t._2-1,t._3.toDouble))
res1: org.catch22.collections.immutable.CollSeq3[Int,Int,Double] = 
CollSeq((4,0,2.0),
        (5,1,3.0),
        (6,2,4.0))

####Lookup a row

You can lookup values by constructing a Map:

scala> val data= CollSeq(("Zesa",10,20),
     | ("Eskom",5,11),
     | ("Sars",16,13))
data: org.catch22.collections.immutable.CollSeq3[String,Int,Int] = 
CollSeq((Zesa,10,20),
        (Eskom,5,11),
        (Sars,16,13))

scala> val lookup= data._1.zip(data).toMap
lookup: scala.collection.immutable.Map[String,Product3[String,Int,Int]] = 
Map(Zesa -> (Zesa,10,20), Eskom -> (Eskom,5,11), Sars -> (Sars,16,13))

scala> lookup("Sars")
res0: Product3[String,Int,Int] = (Sars,16,13)

###I/O

The CsvParser class (and its concrete sub-classes) allow you to easily read CollSeqs from the filesystem.

####Construct a Parser

scala> val parser=CsvParser[String,Int,Int,Int]
parser: org.catch22.collections.io.CsvParser4[String,Int,Int,Int] = org.catch22.collections.io.CsvParser4@1203c6e

####Read and Parse a file

scala> parser.parseFile("abil.csv",hasHeader=true,delimiter="\t")
res2: org.catch22.collections.immutable.CollSeq4[String,Int,Int,Int] = 
CollSeq((30-APR-12,3885,3922,3859),
        (02-MAY-12,3880,3915,3857),
        (03-MAY-12,3920,3948,3874),
        (04-MAY-12,3909,3952,3885),
        (07-MAY-12,3853,3900,3825),
        (08-MAY-12,3770,3851,3755),
        (09-MAY-12,3700,3782,3666),
        (10-MAY-12,3732,3745,3658),
        (11-MAY-12,3760,3765,3703),
        (14-MAY-12,3660,3750,3655),
        (15-MAY-12,3650,3685,3627),
        (16-MAY-12,3661,3663,3555),
        (17-MAY-12,3620,3690,3600),
        (18-MAY-12,3545,3595,3542),
        (21-MAY-12,3602,3608,3546),
        (22-MAY-12,3650,3675,3615),
        (23-MAY-12,3566,3655,3566),
        (24-MAY-12,3632,3645,3586),
        (25-MAY-12,3610,3665,3583),
        (28-MAY-12,3591,3647,3582),
     ...

####Read and parse a java.io.Reader

scala> val stringData="""10,20,"hello"
     |                   |20,30,"world"""".stripMargin
stringData: String = 
10,20,"hello"
20,30,"world"

scala> CsvParser[Int,Int,String].parse(new java.io.StringReader(stringData))
res6: org.catch22.collections.immutable.CollSeq3[Int,Int,String] = 
CollSeq((10,20,hello),
        (20,30,world))

####Parsing additional types To parse additional types (like dates) simply provide a converter as an implicit parameter. See the examples.

####Field parse errors To recover from field parse errors you provide a converter from String to Option[T]. See the examples. Update: as of 0.4.4.4-SNAPSHOT standard converters to Option[Int/Double/Boolean] are provided as standard.

###Examples

####Read Stock prices and calculate moving average An example REPL session. Let's read some stock prices and calculate the 5 period moving average:

scala> import java.util.Date
import java.util.Date

scala> implicit val dmy = new DateConverter("dd-MMM-yy")  // tell the parser how to read your dates
dmy: org.catch22.collections.io.DateConverter = org.catch22.collections.io.DateConverter@26d606

scala> val p=CsvParser[Date,Int,Int,Int,Int]  //Date, close, High, Low, Volume
p: org.catch22.collections.io.CsvParser5[java.util.Date,Int,Int,Int,Int] = org.catch22.collections.io.CsvParser5@1584d9

scala> val prices=p.parseFile("abil.csv", hasHeader=true, delimiter="\t")
prices: org.catch22.collections.immutable.CollSeq5[java.util.Date,Int,Int,Int,Int] = 
(Mon Apr 30 00:00:00 AST 2012,3885,3922,3859,4296459)
(Wed May 02 00:00:00 AST 2012,3880,3915,3857,3127464)
(Thu May 03 00:00:00 AST 2012,3920,3948,3874,3080823)
(Fri May 04 00:00:00 AST 2012,3909,3952,3885,2313354)
(Mon ....

scala> val ma= prices._2.sliding(5).toList.map(_.mean)
ma: List[Double] = List(3889.4, 3866.4, 3830.4, 3792.8, 3763.0, 3724.4, 3700.4, 3692.6, 3670.2, 3627.2, 3615.6, 3615.6, 3596.6, 3599.0, 3612.0, 3609.8, 3605.6, 3611.0, 3611.0, 3606.0, 3614.2, 3612.4, 3629.0, 3634.6, 3659.4, 3661.0, 3657.2, 3645.2, 3628.4, 3616.4, 3632.8, 3668.8, 3702.6, 3745.4, 3781.0, 3779.6, 3755.4, 3727.4, 3689.4, 3650.2, 3638.8, 3641.8, 3648.2, 3663.2, 3671.0, 3649.4, 3624.4, 3595.0, 3559.0, 3518.0, 3505.8, 3495.8, 3505.8, 3531.2, 3570.8, 3589.0, 3613.0, 3620.8, 3624.4, 3635.4, 3661.0, 3667.0, 3686.6, 3703.6, 3720.0, 3722.4, 3692.4, 3619.0, 3553.4, 3473.4, 3413.2, 3400.0, 3422.8, 3427.4, 3433.6, 3434.0, 3425.6, 3403.8, 3396.6, 3388.6, 3376.0, 3353.6, 3318.6, 3291.8, 3260.6, 3240.0, 3225.0, 3226.0, 3218.2, 3232.2, 3219.6, 3226.0, 3234.0, 3251.0, 3271.0, 3312.4, 3341....

scala> prices._1.drop(5).zip(ma) //moving average zipped with date
res0: Seq[(java.util.Date, Double)] = List((Tue May 08 00:00:00 AST 2012,3889.4), (Wed May 09 00:00:00 AST 2012,3866.4), (Thu May 10 00:00:00 AST 2012,3830.4), (Fri May 11 00:00:00 AST 2012,3792.8), (Mon May 14 00:00:00 AST 2012,3763.0), (Tue May 15 00:00:00 AST 2012,3724.4), (Wed May 16 00:00:00 AST 2012,3700.4), (Thu May 17 00:00:00 AST 2012,3692.6), (Fri May 18 00:00:00 AST 2012,3670.2), (Mon May 21 00:00:00 AST 2012,3627.2), (Tue May 22 00:00:00 AST 2012,3615.6), (Wed May 23 00:00:00 AST 2012,3615.6), (Thu May 24 00:00:00 AST 2012,3596.6), (Fri May 25 00:00:00 AST 2012,3599.0), (Mon May 28 00:00:00 AST 2012,3612.0), (Tue May 29 00:00:00 AST 2012,3609.8), (Wed May 30 00:00:00 AST 2012,3605.6), (Thu May 31 00:00:00 AST 2012,3611.0), (Fri Jun 01 00:00:00 AST 2012,3611.0), (Mon Jun 04 0...
scala> 
Example: read csv that has field parse errors

Note: this converter is now provided as standard in the distribution.

scala> import scala.util.Try
import scala.util.Try

scala> implicit object optionIntConverter extends GeneralConverter[Option[Int]]{
 | def convert(x:String)=Try(x.trim.toInt).toOption
 | }

defined module optionIntConverter

scala> CsvParser[String,Option[Int]].parseFile("badly-formed.csv")
res3: org.catch22.collections.immutable.CollSeq2[String,Option[Int]] = 
CollSeq((Jan,Some(10)),
        (Feb,None),
        (Mar,Some(25)))

inserting standard converters for String=> Option[Int] and other numeric types is under consideration.

#####(Contrived) Example: calculate an aircraft's moment in in-lb

scala> val aircraftLoading=CollSeq(("Row1",86,214),("Row4",168,314),("FwdCargo",204,378)) //Flight Station, Mass kg, Arm in
aircraftLoading: org.catch22.collections.immutable.CollSeq3[java.lang.String,Int,Int] = 
(Row1,86,214)
(Row4,168,314)
(FwdCargo,204,378)

scala> val pounds = aircraftLoading._2.map(_ * 2.2)  //convert kg -> lb
pounds: Seq[Double] = List(189.20000000000002, 369.6, 448.8)

scala> val moment = pounds.zip(aircraftLoading._3).map(x=>x._1*x._2)
moment: Seq[Double] = List(40488.8, 116054.40000000001, 169646.4)

scala> moment.sum
res1: Double = 326189.6

###Architecture

#####CollSeq CollSeq is a wrapper around IndexedSeq[Product]. CollSeq also implements Product itself.

#####CollSeqN CollSeqN are concrete implementations of CollSeq. They extend IndexedSeq[ProductN[T1,..,TN]] and implement ProductN. CollSeqN has only one novel method: flatZip (s:Seq[A]): CollSeqN+1[T1,..TN,A]

#####CsvParser CsvParser is a simple Csv reader/parser that returns a CollSeqN. There are concrete parsers implemented for each arity. The actual gruntwork is done by opencsv.

#####Implicit Conversions

Seq[Product1[T]] => CollSeq1[T]  
Seq[Product2[T1,T2]] => CollSeq2[T1,T2]
Seq[T] => CollSeq1[T]

The methods introduced are few: flatZip and _1 ... _N.

###Status

Stableish. The API has been stable since v0.0.1-SNAPSHOT. But no guarantees.

###Future

In no particular order:

  • Quantify how a Map of Tuples might be useful.
  • A Proper Stats implementation preferably as a library dependancy.
  • Missing values, NAs etc.
  • How to incorporate classes that implement ProductN (future case classes).
  • Column access by named method (using macros?)

###Include in your project

You can use an unmanaged jar: Scala-2.10 or Scala-2.9.2

####SBT

Add the following to your build.sbt file:

resolvers += "org.catch22" at "http://marklister.github.io/product-collections/"

libraryDependencies += "org.catch22" %% "product-collections" % "0.0.4.4-SNAPSHOT"

###Build

 git clone git://github.com/marklister/product-collections.git
 cd product-collections
 sbt
 > compile
 > test
 > console

###Build Dependencies

product-collections relies heavily on sbt-boilerplate. sbt-boilerplate is a cleverly designed yet simple code generating sbt-plugin.

product-collections uses a modified version of sbt-boilerplate. Depending on whether the modifications have been accepted upstream the project will either include a binary dependancy to to the original sbt-boilerplate or a source dependancy to my modified copy.

At present (and until my copy stabilizes) expect the source dependancy. Sbt should clone and build sbt-boilerplate transparently.

It is likely that later versions will require scala 2.10+ to build although generating a 2.9.x binary will still be possible. This is due to the use of Twitter's util-eval in sbt-boilerplate. A JSR223 based solution using Scala 2.11 is also under investigation. Thanks Johannes Rudolph and thanks Twitter!

###Runtime Dependencies

###Sample Projects

See product-collections-example. Note the example is only 25 lines of code; it loads stock prices from csv and plots these prices against the 250 period moving average.

###Pull Requests

Pull requests are welcome. Please keep in mind the KISS character if you extend the project. Feel free to discuss your ideas on the issue tracker.

####Scala 2.11 Scala 2.11 should re-introduce case classes as ProductNs. This, along with macros suggests product-collections may, in future allow accessing columns by name.

Please use the Github issue tracker to ask questions, discuss pull requests etc.

###Licence

Two clause BSD Licence.

###Alternatives

Shapeless

HLists are similar in concept. Shapeless allows one to abstract over arity.

Saddle

Backed by arrays. Heavily specialized. Matrix operations.

About

A very simple, strongly typed, scala framework for tabular data. A collection of tuples. A strongly typed scala csv reader/parser wrapping opencsv.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%