Scalable algorithms in data mining. (I am shifting this project to feluca and will refactor it. so this project is deprecating)
dami is writen in Java. Our goal is to make algorithms that can handle hundreds of millions of data with a limited memory PC
Currently we have :
-
utility: Buffered vectors pool for dataset IO, High performance and simple text parser. (More tests need)
-
classification: SGD for logistic regressions
-
recommendation: SlopeOne, SVD, RSVD, itemneighborhood-SVD (see movielens_converter.py)
-
significant test: swap randomization
-
graph: Pagerank.
Future:
- similarity: simhash
2012/10/22 Release Notes:
- L1 & L2 logistic regression
- memory cost estimation
- simple commandline integration for LR
2012/7/22 Release Notes:
- Asynchronous vector buffer for dataset IO
- High performance and simple text parser(only for digital related chars)
- small refactoring.
2012/7/12 Release Notes:
- code refactoring for recommendation and IO
- To run RMSE for recommendation, you first need to see
movielens_convert.py
for converting and/or splitting movielens data, and seeCFDataConverter
andTestSVD
To achieve computation efficiency and memory utilization, two ways we have just adopted.
1: Using "id" as index of array for fetching data.
2: Only maintaining model in memory and saving data to converted bytes for IO
So it's highly recommemded you use continuous ids for the algorithms :)
My Chinese blog : http://blog.csdn.net/lgnlgn
E-mail : gnliang10 [at] 126.com