September 14, 2016 at 8:52 am #23653
See details in the link http://www.kdnuggets.com/2016/09/poll-algorithms-used-data-scientists.html
Latest KDnuggets Poll asked
Which methods/algorithms you used in the past 12 months for an actual Data Science-related application? .
Here are the results, based on 844 voters. The top 10 algorithms and their share of voters are:
[caption id="" align="alignnone" width="483"] top algorithms used[/caption]
The average respondent used 8.1 algorithms, a big increase vs a similar poll in 2011.
Comparing with 2011 Poll Algorithms for data analysis / data mining we note that the top methods are still Regression, Clustering, Decision Trees/Rules, and Visualization. The biggest relative increases, measured by (pct2016 /pct2011 – 1) are for
- Boosting, up 40% to 32.8% share in 2016 from 23.5% share in 2011
- Text Mining, up 30% to 35.9% from 27.7%
- Visualization, up 27% to 48.7% from 38.3%
- Time series/Sequence analysis, up 25% to 37.0% from 29.6%
- Anomaly/Deviation detection, up 19% to 19.5% from 16.4%
- Ensemble methods, up 19% to 33.6% from 28.3%
- SVM, up 18% to 33.6% from 28.6%
- Regression, up 16% to 67.1% from 57.9%
Most popular among new options added in 2016 are
- K-nearest neighbors, 46% share
- PCA, 43%
- Random Forests, 38%
- Optimization, 24%
- Neural networks – Deep Learning, 19%
- Singular Value Decomposition, 16%
The biggest declines are for
- Association rules, down 47% to 15.3% from 28.6%
- Uplift modeling, down 36% to 3.1% from 4.8% (that is a surprise, given strong results published)
- Factor Analysis, down 24% to 14.2% from 18.6%
- Survival Analysis, down 15% to 7.9% from 9.3%
[caption id="" align="alignnone" width="462"] Algorithm use bias by Employment[/caption]
(and there is more… see the original link)
I am really surprised that regression is so high. Maybe that speaks to the large quantity of projects that are not big data, or simpler problems that are solved with a single line solution. (poke, poke 😉 …).
I was also surprised that K-Nearest Neighbors has risen in popularity. Pros: very local and non-linear, use all the vectors and not just the support vectors. Cons: curse of dimensionality – tough to scale with the number of input dimensions without adding noise heavily. I should check the literature to see if it has been focused on certain problem domains, or if there new methods of dealing with indexing or dimensionality reduction or weighting.
What are your thoughts? Respectful debate is always invited…