主要参考josephmisiti提供的链接。语言方面主要集中在R、python、matlab等。 此外,reddit上也整理了很多资料。
目录
some usefull books
The following is a list of free, open source books on machine learning, statistics, data-mining, etc.
Machine-Learning / Data Mining
- An Introduction To Statistical Learning - Book + R Code
- Elements of Statistical Learning - Book
- Probabilistic Programming & Bayesian Methods for Hackers - Book + IPython Notebooks
- Thinking Bayes - Book + Python Code
- Information Theory, Inference, and Learning Algorithms
- Gaussian Processes for Machine Learning
- Data Intensive Text Processing w/ MapReduce
- Reinforcement Learning: - An Introduction
- Mining Massive Datasets
- A First Encounter with Machine Learning
- Pattern Recognition and Machine Learning
- Machine Learning & Bayesian Reasoning
- Introduction to Machine Learning
- A Probabilistic Theory of Pattern Recognition
- Introduction to Information Retrieval
- Forecasting: principles and practice
- Practical Artificial Intelligence Programming in Java
- Introduction to Machine Learning
- Reinforcement Learning
- Machine Learning
- A Quest for AI
- Introduction to Applied Bayesian Statistics and Estimation for Social Scientists
- Bayesian Modeling, Inference and Prediction
Naturual Language Processing
Probability & Statistics
- Thinking Stats - Book + Python Code
- From Algorithms to Z-Scores - Book
- The Art of R Programming - Book (Not Finished)
- All of Statistics
- Introduction to statistical thought
- Basic Probability Theory
- Introduction to probability
- Principle of Uncertainty
- Probability & Statistics Cookbook
- Advanced Data Analysis From An Elmentary Point of View
Linear Algebra
- Linear Algebra Done Wrong
- Linear Algebra, Theory, and Applications
- Convex Optimization
- Applied Numerical Computing
- Applied Numerical Linear Algebra
A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.
Java
Natural Language Processing
- [CoreNLP] (http://nlp.stanford.edu/software/corenlp.shtml) - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words
- [Stanford Parser] (http://nlp.stanford.edu/software/lex-parser.shtml) - A natural language parser is a program that works out the grammatical structure of sentences
- [Stanford POS Tagger] (http://nlp.stanford.edu/software/tagger.shtml) - A Part-Of-Speech Tagger (POS Tagger
- [Stanford Name Entity Recognizer] (http://nlp.stanford.edu/software/CRF-NER.shtml) - Stanford NER is a Java implementation of a Named Entity Recognizer.
- [Stanford Word Segmenter] (http://nlp.stanford.edu/software/segmenter.shtml) - Tokenization of raw text is a standard pre-processing step for many NLP tasks.
- Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for “tree regular expressions”).
- Stanford Phrasal: A Phrase-Based Translation System
- Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translation system, written in Java.
- Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens, which roughly correspond to “words”
- Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions.
- Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterative fashion
- Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to perform analysis on datasets
- Twitter Text Java - A Java implementation of Twitter’s text processing library
- MALLET - A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- OpenNLP - a machine learning based toolkit for the processing of natural language text.
- LingPipe - A tool kit for processing text using computational linguistics.
- ClearTK - ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA.
- Apache cTAKES - Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text.
General-Purpose Machine Learning
- JSAT - Numerous Machine Learning algoirhtms for classification, regresion, and clustering.
- MLlib in Apache Spark - Distributed machine learning library in Spark
- Mahout - Distributed machine learning
- Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of k classes.
- Weka - Weka is a collection of machine learning algorithms for data mining tasks
- Meka - An open source implementation of methods for multi-label classification and evaluation (extension to Weka).
- ORYX - Simple real-time large-scale machine learning infrastructure.
- H2O - ML engine that supports distributed learning on data stored in HDFS.
- WalnutiQ - object oriented model of the human brain
- ELKI - Java toolkit for data mining. (unsupervised: clustering, outlier detection etc.)
- Neuroph - Neuroph is lightweight Java neural network framework
- java-deeplearning - Distributed Deep Learning Platform for Java, Clojure,Scala
Speech Recognition
- CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library.
Data Analysis / Data Visualization
- Hadoop - Hadoop/HDFS
- Spark - Spark is a fast and general engine for large-scale data processing.
- Impala - Real-time Query for Hadoop
Julia
General-Purpose Machine Learning
- PGM - A Julia framework for probabilistic graphical models.
- DA - Julia package for Regularized Discriminant Analysis
- Regression - Algorithms for regression analysis (e.g. linear regression and logistic regression)
- Local Regression - Local regression, so smooooth!
- Naive Bayes - Simple Naive Bayes implementation in Julia
- Mixed Models - A Julia package for fitting (statistical) mixed-effects models
- Simple MCMC - basic mcmc sampler implemented in Julia
- Distance - Julia module for Distance evaluation
- Decision Tree - Decision Tree Classifier and Regressor
- Neural - A neural network in Julia
- MCMC - MCMC tools for Julia
- GLM - Generalized linear models in Julia
- Online Learning
- GLMNet - Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
- Clustering - Basic functions for clustering data: k-means, dp-means, etc.
- SVM - SVM’s for Julia
- Kernal Density - Kernel density estimators for julia
- Dimensionality Reduction - Methods for dimensionality reduction
- NMF - A Julia package for non-negative matrix factorization
- ANN - Julia artificial neural networks
Natural Language Processing
- Topic Models - TopicModels for Julia
- Text Analysis - Julia package for text analysis
Data Analysis / Data Visualization
- Graph Layout - Graph layout algorithms in pure Julia
- Data Frames Meta - Metaprogramming tools for DataFrames
- Julia Data - library for working with tabular data in Julia
- Data Read - Read files from Stata, SAS, and SPSS
- Hypothesis Tests - Hypothesis tests for Julia
- Gladfly - Crafty statistical graphics for Julia.
-
Stats - Statistical tests for Julia
- RDataSets - Julia package for loading many of the data sets available in R
- DataFrames - library for working with tabular data in Julia
- Distributions - A Julia package for probability distributions and associated functions.
- Data Arrays - Data structures that allow missing values
- Time Series - Time series toolkit for Julia
- Sampling - Basic sampling algorithms for Julia
Misc Stuff / Presentations
- DSP - Digital Signal Processing (filtering, periodograms, spectrograms, window functions).
- JuliaCon Presentations - Presentations for JuliaCon
- SignalProcessing - Signal Processing tools for Julia
- Images - An image library for Julia
Lua
General-Purpose Machine Learning
- Torch7
- cephes - Cephes mathematical functions library, wrapped for Torch. Provides and wraps the 180+ special mathematical functions from the Cephes mathematical library, developed by Stephen L. Moshier. It is used, among many other places, at the heart of SciPy.
- graph - Graph package for Torch
- randomkit - Numpy’s randomkit, wrapped for Torch
-
signal - A signal processing toolbox for Torch-7. FFT, DCT, Hilbert, cepstrums, stft
- nn - Neural Network package for Torch
- nngraph - This package provides graphical computation for nn library in Torch7.
- nnx - A completely unstable and experimental package that extends Torch’s builtin nn library
- optim - An optimization library for Torch. SGD, Adagrad, Conjugate-Gradient, LBFGS, RProp and more.
- unsup - A package for unsupervised learning in Torch. Provides modules that are compatible with nn (LinearPsd, ConvPsd, AutoEncoder, …), and self-contained algorithms (k-means, PCA).
- manifold - A package to manipulate manifolds
- svm - Torch-SVM library
- lbfgs - FFI Wrapper for liblbfgs
- vowpalwabbit - An old vowpalwabbit interface to torch.
- OpenGM - OpenGM is a C++ library for graphical modeling, and inference. The Lua bindings provide a simple way of describing graphs, from Lua, and then optimizing them with OpenGM.
- sphagetti - Spaghetti (sparse linear) module for torch7 by @MichaelMathieu
- LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit
- kernel smoothing - KNN, kernel-weighted average, local linear regression smoothers
- cutorch - Torch CUDA Implementation
- cunn - Torch CUDA Neural Network Implementation
- imgraph - An image/graph library for Torch. This package provides routines to construct graphs on images, segment them, build trees out of them, and convert them back to images.
- videograph - A video/graph library for Torch. This package provides routines to construct graphs on videos, segment them, build trees out of them, and convert them back to videos.
- saliency - code and tools around integral images. A library for finding interest points based on fast integral histograms.
- stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence
- sfm - A bundle adjustment/structure from motion package
- fex - A package for feature extraction in Torch. Provides SIFT and dSIFT modules.
- OverFeat - A state-of-the-art generic dense feature extractor
- Numeric Lua
- Lunatic Python
- SciLua
- Lua - Numerical Algorithms
- Lunum
Demos and Scripts
- Core torch7 demos repository.
- linear-regression, logistic-regression
- face detector (training and detection as separate demos)
- mst-based-segmenter
- train-a-digit-classifier
- train-autoencoder
- optical flow demo
- train-on-housenumbers
- train-on-cifar
- tracking with deep nets
- kinect demo
- filter-bank visualization
- saliency-networks
- Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)
- Music Tagging - Music Tagging scripts for torch7
- torch-datasets - Scripts to load several popular datasets including:
- BSR 500
- CIFAR-10
- COIL
- Street View House Numbers
- MNIST
- NORB
- Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment
Matlab
Computer Vision
- Contourlets - MATLAB source code that implements the contourlet transform and its utility functions.
- Shearlets - MATLAB code for shearlet transform
- Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed to represent images at different scales and different angles.
- Bandlets - MATLAB code for bandlet transform
Natural Language Processing
- NLP - An NLP library for Matlab
General-Purpose Machine Learning
- Training a deep autoencoder or a classifier on MNIST digits - Training a deep autoencoder or a classifier on MNIST digits[DEEP LEARNING]
- t-Distributed Stochastic Neighbor Embedding - t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
- Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab.
- LibSVM - A Library for Support Vector Machines
- LibLinear - A Library for Large Linear Classification
- Machine Learning Module - Class on machine w/ PDF,lectures,code
- Caffe - A deep learning framework developed with cleanliness, readability, and speed in mind.
- Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab.
- Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithms described in the book Pattern Recognition and Machine Learning by C. Bishop.
Data Analysis / Data Visualization
- matlab_gbl - MatlabBGL is a Matlab package for working with graphs.
- gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGL’s mex functions.
Python
Computer Vision
- SimpleCV - An open source computer vision framework that gives access to several high-powered computer vision libraries, such as OpenCV. Written on Python and runs on Mac, Windows, and Ubuntu Linux.
Natural Language Processing
- NLTK - A leading platform for building Python programs to work with human language data.
- Pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
- Quepy - A python framework to transform natural language questions to queries in a database query language
- TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
- YAlign - A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.
- jieba - Chinese Words Segmentation Utilities.
- SnowNLP - A library for processing Chinese text.
- loso - Another Chinese segmentation library.
- genius - A Chinese segment base on Conditional Random Field.
- nut - Natural language Understanding Toolkit
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
General-Purpose Machine Learning
- Bayesian Methods for Hackers - Book/iPython notebooks on Probabilistic Programming in Python
- Featureforge A set of tools for creating and testing machine learning features, with a scikit-learn compatible API
- MLlib in Apache Spark - Distributed machine learning library in Spark
- scikit-learn - A Python module for machine learning built on top of SciPy.
- SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book “Artificial Intelligence, a Modern Approach”. It focuses on providing an easy to use, well documented and tested library.
- astroML - Machine Learning and Data Mining for Astronomy.
- graphlab-create - A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.
- BigML - A library that contacts external servers.
- pattern - Web mining module for Python.
- NuPIC - Numenta Platform for Intelligent Computing.
- Pylearn2 - A Machine Learning library based on Theano.
- hebel - GPU-Accelerated Deep Learning Library in Python.
- gensim - Topic Modelling for Humans.
- PyBrain - Another Python Machine Learning Library.
- Crab - A flexible, fast recommender engine.
- python-recsys - A Python library for implementing a Recommender System.
- thinking bayes - Book on Bayesian Analysis
- Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python. [DEEP LEARNING]
- Bolt - Bolt Online Learning Toolbox
- CoverTree - Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree
- nilearn - Machine learning for NeuroImaging in Python
- Shogun - The Shogun Machine Learning Toolbox
- Pyevolve - Genetic algorithm framework.
- Caffe - A deep learning framework developed with cleanliness, readability, and speed in mind.
- breze - Theano based library for deep and recurrent neural networks
- pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.
- mrjob - A library to let Python program run on Hadoop.
- SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments.
- neurolab - https://code.google.com/p/neurolab/
- Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper: Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle and Ryan P. Adams. Advances in Neural Information Processing Systems, 2012.
Data Analysis / Data Visualization
- SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
- NumPy - A fundamental package for scientific computing with Python.
- Numba - Python JIT (just in time) complier to LLVM aimed at scientific Python by the developers of Cython and NumPy.
- NetworkX - A high-productivity software for complex networks.
- Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
- Open Mining - Business Intelligence (BI) in Python (Pandas web interface)
- PyMC - Markov Chain Monte Carlo sampling toolkit.
- zipline - A Pythonic algorithmic trading library.
- PyDy - Short for Python Dynamics, used to assist with workflow in the modeling of dynamic motion based around NumPy, SciPy, IPython, and matplotlib.
- SymPy - A Python library for symbolic mathematics.
- statsmodels - Statistical modeling and econometrics in Python.
- astropy - A community Python library for Astronomy.
- matplotlib - A Python 2D plotting library.
- bokeh - Interactive Web Plotting for Python.
- plotly - Collaborative web plotting for Python and matplotlib.
- vincent - A Python to Vega translator.
- d3py - A plottling library for Python, based on D3.js.
- ggplot - Same API as ggplot2 for R.
- Kartograph.py - Rendering beautiful SVG maps in Python.
- pygal - A Python SVG Charts Creator.
- PyQtGraph - A pure-python graphics and GUI library built on PyQt4 / PySide and NumPy.
- pycascading
- Petrel - Tools for writing, submitting, debugging, and monitoring Storm topologies in pure Python.
- Blaze - NumPy and Pandas interface to Big Data.
- emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
- windML - A Python Framework for Wind Energy Analysis and Prediction
- vispy - GPU-based high-performance interactive OpenGL 2D/3D data visualization library
Misc Scripts / iPython Notebooks / Codebases
- pattern_classification
- thinking stats 2
- hyperopt
- numpic
- 2012-paper-diginorm
- ipython-notebooks
- decision-weights
- Sarah Palin LDA - Topic Modeling the Sarah Palin emails.
- Diffusion Segmentation - A collection of image segmentation algorithms based on diffusion methods
- Scipy Tutorials - SciPy tutorials. This is outdated, check out scipy-lecture-notes
- Crab - A recommendation engine library for Python
- BayesPy - Bayesian Inference Tools in Python
- scikit-learn tutorials - Series of notebooks for learning scikit-learn
- sentiment-analyzer - Tweets Sentiment Analyzer
- sentiment_classifier - Sentiment classifier using word sense disambiguation.
- group-lasso - Some experiments with the coordinate descent algorithm used in the (Sparse) Group Lasso model
- jProcessing - Kanji / Hiragana / Katakana to Romaji Converter. Edict Dictionary & parallel sentences Search. Sentence Similarity between two JP Sentences. Sentiment Analysis of Japanese Text. Run Cabocha(ISO–8859-1 configured) in Python.
- mne-python-notebooks - IPython notebooks for EEG/MEG data processing using mne-python
- pandas cookbook - Recipes for using Python’s pandas library
- climin - Optimization library focused on machine learning, pythonic implementations of gradient descent, LBFGS, rmsprop, adadelta and others
- Allen Downey’s Data Science Course - Code for Data Science at Olin College, Spring 2014.
- Allen Downey’s Think Bayes Code - Code repository for Think Bayes.
- Allen Downey’s Think Complexity Code - Code for Allen Downey’s book Think Complexity.
- Allen Downey’s Think OS Code - Text and supporting code for Think OS: A Brief Introduction to Operating Systems.
Kaggle Competition Source Code
- wiki challange - An implementation of Dell Zhang’s solution to Wikipedia’s Participation Challenge on Kaggle
- kaggle insults - Kaggle Submission for “Detecting Insults in Social Commentary”
- kaggle_acquire-valued-shoppers-challenge - Code for the Kaggle acquire valued shoppers challenge
- kaggle-cifar - Code for the CIFAR-10 competition at Kaggle, uses cuda-convnet
- kaggle-blackbox - Deep learning made easy
- kaggle-accelerometer - Code for Accelerometer Biometric Competition at Kaggle
- kaggle-advertised-salaries - Predicting job salaries from ads - a Kaggle competition
- kaggle amazon - Amazon access control challenge
- kaggle-bestbuy_big - Code for the Best Buy competition at Kaggle
- kaggle-bestbuy_small
- Kaggle Dogs vs. Cats - Code for Kaggle Dovs vs. Cats competition
- Kaggle Galaxy Challenge - Winning solution for the Galaxy Challenge on Kaggle
- Kaggle Gender - A Kaggle competition: discriminate gender based on handwriting
- Kaggle Merck - Merck challenge at Kaggle
- Kaggle Stackoverflow - Predicting closed questions on Stack Overflow
- kaggle_acquire-valued-shoppers-challenge - Code for the Kaggle acquire valued shoppers challenge
- wine-quality - Predicting wine quality
Ruby
Natural Language Processing
- Treat - Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
- Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language. It includes a generic language-independant front end, a module for mapping language codes into language names, and a module which contains various English-language utilities.
- Stemmer - Expose libstemmer_c to Ruby
- Ruby Wordnet - This library is a Ruby interface to WordNet
- Raspel - raspell is an interface binding for ruby
- UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing
- Twitter-text-rb - A library that does auto linking and extraction of usernames, lists and hashtags in tweets
General-Purpose Machine Learning
- Ruby Machine Learning - Some Machine Learning algorithms, implemented in Ruby
- Machine Learning Ruby
- jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby.
- CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications.
- Neural Networks and Deep Learning - Code samples for my book “Neural Networks and Deep Learning” [DEEP LEARNING]
Data Analysis / Data Visualization
- rsruby - Ruby - R bridge
- data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visualisation with Ruby
- ruby-plot - gnuplot wrapper for ruby, especially for plotting roc curves into svg files
- plot-rb - A plotting library in Ruby built on top of Vega and D3.
- scruffy - A beautiful graphing toolkit for Ruby
- SciRuby
- Glean - A data management tool for humans
- Bioruby
- Arel
Misc
- Big Data For Chimps
- Listof - Community based data collection, packed in gem. Get list of pretty much anything (stop words, countries, non words) in txt, json or hash. Demo/Search for a list
R
General-Purpose Machine Learning
- h2o - A framework for fast, parallel, and distributed machine learning algorithms at scale – Deeplearning, Random forests, GBM, KMeans, PCA, GLM
- Clever Algorithms For Machine Learning
- Machine Learning For Hackers
- nnet - nnet: Feed-forward Neural Networks and Multinomial Log-Linear Models
- rpart - rpart: Recursive Partitioning and Regression Trees
- randomForest - randomForest: Breiman and Cutler’s random forests for classification and regression
- lasso2 - lasso2: L1 constrained estimation aka ‘lasso’
- gbm - gbm: Generalized Boosted Regression Models
- e1071 - e1071: Misc Functions of the Department of Statistics (e1071), TU Wien
- tgp - tgp: Bayesian treed Gaussian process models
- rgp - rgp: R genetic programming framework
- arules - arules: Mining Association Rules and Frequent Itemsets
- frbs - frbs: Fuzzy Rule-based Systems for Classification and Regression Tasks
- e1071 - e1071: Misc Functions of the Department of Statistics (e1071), TU Wien
- rattle - rattle: Graphical user interface for data mining in R
- ahaz - ahaz: Regularization for semiparametric additive hazards regression
- arules - arules: Mining Association Rules and Frequent Itemsets
- bigrf - bigrf: Big Random Forests: Classification and Regression Forests for Large Data Sets
- bigRR - bigRR: Generalized Ridge Regression (with special advantage for p » n cases)
- bmrm - bmrm: Bundle Methods for Regularized Risk Minimization Package
- Boruta - Boruta: A wrapper algorithm for all-relevant feature selection
- bst - bst: Gradient Boosting
- C50 - C50: C5.0 Decision Trees and Rule-Based Models
- caret - caret: Classification and Regression Training
- CORElearn - CORElearn: Classification, regression, feature evaluation and ordinal evaluation
- CoxBoost - CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks
- Cubist - Cubist: Rule- and Instance-Based Regression Modeling
- e1071 - e1071: Misc Functions of the Department of Statistics (e1071), TU Wien
- earth - earth: Multivariate Adaptive Regression Spline Models
- elasticnet - elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA
- ElemStatLearn - ElemStatLearn: Data sets, functions and examples from the book: “The Elements of Statistical Learning, Data Mining, Inference, and Prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman
- evtree - evtree: Evolutionary Learning of Globally Optimal Trees
- frbs - frbs: Fuzzy Rule-based Systems for Classification and Regression Tasks
- GAMBoost - GAMBoost: Generalized linear and additive models by likelihood based boosting
- gamboostLSS - gamboostLSS: Boosting Methods for GAMLSS
- gbm - gbm: Generalized Boosted Regression Models
- glmnet - glmnet: Lasso and elastic-net regularized generalized linear models
- glmpath - glmpath: L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model
- GMMBoost - GMMBoost: Likelihood-based Boosting for Generalized mixed models
- grplasso - grplasso: Fitting user specified models with Group Lasso penalty
- grpreg - grpreg: Regularization paths for regression models with grouped covariates
- hda - hda: Heteroscedastic Discriminant Analysis
- ipred - ipred: Improved Predictors
- kernlab - kernlab: Kernel-based Machine Learning Lab
- klaR - klaR: Classification and visualization
- lars - lars: Least Angle Regression, Lasso and Forward Stagewise
- LiblineaR - LiblineaR: Linear Predictive Models Based On The Liblinear C/C++ Library
- LogicReg - LogicReg: Logic Regression
- maptree - maptree: Mapping, pruning, and graphing tree models
- mboost - mboost: Model-Based Boosting
- mvpart - mvpart: Multivariate partitioning
- ncvreg - ncvreg: Regularization paths for SCAD- and MCP-penalized regression models
- nnet - nnet: Feed-forward Neural Networks and Multinomial Log-Linear Models
- oblique.tree - oblique.tree: Oblique Trees for Classification Data
- pamr - pamr: Pam: prediction analysis for microarrays
- party - party: A Laboratory for Recursive Partytioning
- partykit - partykit: A Toolkit for Recursive Partytioning
- penalized - penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model
- penalizedLDA - penalizedLDA: Penalized classification using Fisher’s linear discriminant
- penalizedSVM - penalizedSVM: Feature Selection SVM using penalty functions
- quantregForest - quantregForest: Quantile Regression Forests
- randomForest - randomForest: Breiman and Cutler’s random forests for classification and regression
- randomForestSRC - randomForestSRC: Random Forests for Survival, Regression and Classification (RF-SRC)
- rattle - rattle: Graphical user interface for data mining in R
- rda - rda: Shrunken Centroids Regularized Discriminant Analysis
- rdetools - rdetools: Relevant Dimension Estimation (RDE) in Feature Spaces
- REEMtree - REEMtree: Regression Trees with Random Effects for Longitudinal (Panel) Data
- relaxo - relaxo: Relaxed Lasso
- rgenoud - rgenoud: R version of GENetic Optimization Using Derivatives
- rgp - rgp: R genetic programming framework
- Rmalschains - Rmalschains: Continuous Optimization using Memetic Algorithms with Local Search Chains (MA-LS-Chains) in R
- rminer - rminer: Simpler use of data mining methods (e.g. NN and SVM) in classification and regression
- ROCR - ROCR: Visualizing the performance of scoring classifiers
- RoughSets - RoughSets: Data Analysis Using Rough Set and Fuzzy Rough Set Theories
- rpart - rpart: Recursive Partitioning and Regression Trees
- RPMM - RPMM: Recursively Partitioned Mixture Model
- RSNNS - RSNNS: Neural Networks in R using the Stuttgart Neural Network Simulator (SNNS)
- RWeka - RWeka: R/Weka interface
- RXshrink - RXshrink: Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression
- sda - sda: Shrinkage Discriminant Analysis and CAT Score Variable Selection
- SDDA - SDDA: Stepwise Diagonal Discriminant Analysis
- svmpath - svmpath: svmpath: the SVM Path algorithm
- tgp - tgp: Bayesian treed Gaussian process models
- tree - tree: Classification and regression trees
- varSelRF - varSelRF: Variable selection using random forests
- caret - Unified interface to ~150 ML algorithms in R.
- SuperLearner and subsemble - Multi-algorithm ensemble learning packages.
- Introduction to Statistical Learning
Data Analysis / Data Visualization
- Learning Statistics Using R
- ggplot2 - A data visualization package based on the grammar of graphics.
Scala
Natural Language Processing
- ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries.
- Breeze - Breeze is a numerical processing library for Scala.
- Chalk - Chalk is a natural language processing library.
- FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.
Data Analysis / Data Visualization
- MLlib in Apache Spark - Distributed machine learning library in Spark
- Scalding - A Scala API for Cascading
- Summing Bird - Streaming MapReduce with Scalding and Storm
- Algebird - Abstract Algebra for Scala
- xerial - Data management utilities for Scala
- simmer - Reduce your data. A unix filter for algebird-powered aggregation.
- PredictionIO - PredictionIO, a machine learning server for software developers and data engineers.
- BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis.
General-Purpose Machine Learning
- Conjecture - Scalable Machine Learning in Scalding
- brushfire - decision trees for scalding
- ganitha - scalding powered machine learning
- adam - A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
- bioscala - Bioinformatics for the Scala programming language
- BIDMach - CPU and GPU-accelerated Machine Learning Library.
- Figaro - a Scala library for constructing probabilistic models.
- h2o-sparkling - H2O and Spark interoperability.
本文总阅读量次