Saturday, April 9, 2016

SAS improved over the last 10 years, including offering a better sort, hash tables,

http://www.datasciencecentral.com/forum/topics/which-one-is-best-r-sas-or-python-for-data-science

Which one is best: R, SAS or Python, for data science?

This is of course the wrong question. I use R because I'm familiar with it, more than SAS or Python. And I use R mostly for graphics / visualization. Though things have changed, I consider R mostly as a tool to perform ad-hoc analysis or EDA (exploratory data analysis) rather than a component of enterprise analytic applications / production code running in batch mode or accessed via API's. Is there an enterprise version of R? Also R used to be limited by the amount of RAM, not sure how easy it is to go around this limitation. RHadoop is R for Hadoop, I suppose that's a possible solution for big data, though I'm not familiar with the product.
Picture from Kunal Jain's blog
I used SAS a while back, and I know it has significantly improved over the last 10 years, including offering a better sort, hash tables, and very fast SAS for really big data. If your client uses SAS, SAS is a great option. You also get support with SAS, more than with R.
My favorite would be Python, but since I code my own applications (as opposed to working with a team), I still use Perl for its automated memory allocation, nice string processing features (though many languages do as good as Perl now with NLP and regular expressions), and high flexibility. Clear, scalable, transportable code is more important than the choice of the language. But I definitely like programming (and scripting) languages more than R or SAS, because I develop proprietary techniques and don't like black boxes (you never know when they don't work, what kind of data make them fail - not an issue if you write your own code). Also speed of execution (fast C versus relatively slow Perl, R or Python) is not a big issue anymore with big data, as most of the computer time is not spent in running algorithms (if the algorithms are well optimized)  but instead in data transfers.
There are also many other tools for data mining, for instance RapidMiner or Mahout (Java code for machine learning). What about Excel? I actually use Perl to summarize data (big data processing), R for graphics, and Excel as the top layer.
What about you?

Views: 25874

Replies to This Discussion

The Data Science function is not fully formal yet in many large F100 / F1000 enterprises. There are few roles starting to appear on both business & IT domains of Data Science. Hence expecting to find a comprehensive enterprise class Data Science stack will be really hard. 
There's lot of hype and glamour surrounding Big Data, where both commercial and enterprise Hadoop stacks have been deployed and playing a successful role. Even the Big 3 (IBM, Microsoft & Oracle) are working on their flavors of Hadoop either organically or through partnerships. These Hadoop stacks enable conveniences for languages, connectors, APIs, storage management, DR, High Availability, High performance, real-time vs batch processing SLAs etc. But there's little or no emphasize on analytics / statistical modules bundled with them. Few exceptions are Mahout, MapR, but mostly a standard based open source extension.
As an Enterprise Architect, I prefer R (R Studio) for EDA with a more collaborative context on Git / SVN primarily for Data Scientists / Statisticians / Data Modellers. Even though I'm yet to do this, we are actively looking into Cloud based Hadoop stacks which has native support to run R large scale on Big Data sets. From an analytics consumption perspective, we are exploring Tableau (can really augment R graphs well) and our current landscape (MicroStrategy & Microsoft BI - Fancy Excel). 
Recently I discovered the language Julia (http://julialang.org) and It looks like quite promising. Does someone have any experience with Julia? What it his/her impression?
Thanks,
  
Here's an interesting comment. The full (very long) version can be found here. Not sure who wrote this, but it's not me.
Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:
  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.
In 2013, my toolbox looks like this:
  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.
I want to agree with Vincent's quote: a lot of heavy-duty processing tools have been developed for Python recently, and there are significant gains to be made from a single development environment in terms of language familiarity and ease of transferring data types. Heck, I was installing the Python scipy library the other day, and to get the install to run correctly I had to first install a number of Fortran development libraries.
SAS will probably always have a place in legacy systems and entrenched analytics platforms, but freshly developed analytics platforms and toolkits are likely to be picking up steam with Python. Especially since Python can also be used for general-purpose programming and even web development, it's no wonder that so many data analysts are picking up Python skills.
Another entry in the language wars. It's not going to end and there is no, one best language. This is a helpful article that has 330 reader comments as of this morning and spans from Fortran to aspirant Julia:
No one language is supreme. Use what you need to use to get the job done. If you are a manager in an enterprise, do an analysis, and select the tools that best facilitate meeting your goals.
A thought I always keep in mind is that the time it takes to learn a new tool is time I cannot be productive in the tools I know.
What about non-language tools like Knime. Very useful and very awesome.
As with Perl for text manipulation and pulling files, there is supreme utility in Bash scripts. It's not just Python v R v  SAS. You have to have a skill stack that comprised of many languages. I can't imagine a shop or individual that uses only Python or all of any other, one language. 
For the aspiring analysts that might be reading this and trying to make decisions about what to spend their time learning, the most important decision is the platform...go with Linux/Unix.
Another tool you may want to look at is Revolution R; it's a supported version of R that can handle larger data sets than standard R and RStudio.  I also highly recommend Enthought Canopy for Python development in a Windows environment -- it too is a supported tool.  Python is so much easier to manage in a Linux environment than windows but Enthought makes it much easier in Windows.  These are products that you'll pay for (there are free full versions available for academics and students).
Can I consider myself a data scientist if I don't use python, JavaScript, hadoop, Julia, C++?
Seems like these days employers want computer science grads who build websites/software developers rather than statistical, machine learning, data mining practitioners.
Clancy, there are many data scientists that do not work for an employer. Many employers (at least, their recruiters) have a very narrow vision of what data science is. Many highly successful data scientists are not employees, not even consultants, and have none of these skills (Python, JavaScript, Hadoop, Julia, C++) though they blend computer science, machine learning, statistics, software engineering, product development, computational marketing, and business hacking.
Might be a little late to add but this adds a little more data driven context to this topic. I performed a search on Google trends for the following keywords "r data science" , "python data science" and "sas data science". The growth trend for R and Python are similar.
SAS programmer here.  I used it on small data in clinical trials.  I think SAS is into big data too with Grid computing.  There is a "for dummies" book on that subject published in 2013.  It is a booklet and a quick read.  Authors are Tim Bates, Tom Keefer, and Steven Sober.  Given this, SAS can do all the layers you mention in one product.  


https://normaldeviate.wordpress.com/2013/01/19/bootstrapping-and-subsampling-p

https://www.datacamp.com/community/tutorials/functions-in-r-a-tutorial

http://www.r-bloggers.com/an-introduction-to-sas-for-r-programmers/art-i/


http://www.datasciencecentral.com/forum/topics/which-one-is-best-r-sas-or-python-for-data-science

Call R And Python From Base SAS | R-bloggers

www.r-bloggers.com/call-r-and-python-from-base-sas/

R bloggers
May 4, 2015 - Now, engineers at SAS have shared a method of calling R, Python and ... The first step is to install a Java class (shared on Github under an ... in R), which can then be used in a subsequent SAS PROC. ... A short while ago there was a discussion on linkedin about the use of SAS versus R for the enterprise.

DATA ANALYSIS: SAS vs. Python for data analysis

www.sasanalysis.com/2014/03/sas-vs-python-for-data-analysis.html

Mar 27, 2014 - SAS vsPython for data analysis. To perform data analysis efficiently, .... For SAS users - this might be a PROC GENMOD - which is already ...

Comparison with SAS — pandas 0.18.0 documentation

pandas.pydata.org/pandas-docs/stable/comparison_with_sas.html

Disk vs Memory; Data Interop ... Enter search terms or a module, class or function name. ... it is often convenient to specify it as a python dictionary, where the keys are the column ... SAS provides PROC IMPORT to read csv data into a data set.

Python Pandas vs SAS: Head to head data analysis (Part 1 ...

www.richardafolabi.com/.../python-pandas-vs-sas-head-to-head-data-ana...

Python Pandas vs SAS: Head to head data analysis (Part 1). October 16, 2015; 0 Comments; by @RichardAfolabi .... proc print data=work.newsalesemps;. run; ...

Tip: How to execute a Python script in SAS® Enterp... - SAS ...

https://communities.sas.com/t5/SAS.../Tip...Python...SAS.../223761

Running Python scripts within SAS Enterprise Miner enables you to use open source packages alongside the statistical, data mining and machine. ... This tip uses the Javaclass SASJavaExec.java and digitsdata_17_train.csv ... proc import.

Python - sasCommunity.org

www.sascommunity.org/planet/blog/category/python/

R, Python, and SAS: Getting Started with Linear Regression ...... *In SAS proc sql; select avg(weight) as Median from (select e.weight from class e, class d

SAS Vs. R Vs. Python - Which Analytics Tool Should I Learn?

www.analyticsvidhya.com › Big data

Mar 27, 2014 - SAS is easy to learn and provides easy option (PROC SQL) for .... I have seen this tool used in some university analytics classes along with R ...

Learning R Has Really Made Me Appreciate SAS ...

randyzwitch.com › Data Science

Jul 25, 2012 - proc summary data= hs0; var _numeric_; class prgtype; output .... Now that I know Python (and R), I can no longer use SAS at all, just too ...

Comparison of data analysis packages: R, Matlab, SciPy ...

https://brenocon.com/.../comparison-of-data-analysis-packages-r-matlab-...

Feb 23, 2009 - Python “immature”: matplotlib, numpy, and scipy are all separate libraries that ... SAS people complain about poor graphing capabilities. R vs. ...... nowSAS Component Language), and proc SQL and proc IML are different from ..... (I took my first programming class in 1977, and have programmed in more ...

2015 SAS vs. R Survey Results - Burtch Works

www.burtchworks.com/2015/05/21/2015-sas-vs-r-survey-results/

May 21, 2015 - 2015 SAS vs. ... SAS, but I'm taking an online R class now to stay current. .... Python can be used as both an analytic tool and a programming language ... It comes with a Proc R that let you run R code within the SAS language.


Call R and Python from base SAS


May 4, 2015
By 

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Since 2009, it has been possible to call R from SAS programs. However, this integration requires IML, an add-on matrix-object language for SAS which isn't available with all SAS installations and is separate from the standard SAS PROC execution model.
Now, engineers at SAS have shared a method of calling R, Python and other open-source tools using the Java connectivity provided in base SAS. The first step is to install a Java class (shared on Github under an Apache license), SASJavaExec.jar. Then, you can use the SAS Java Object in the DATA step to call out to a separately-authored R or Python. You should write the script to generate output in CSV format (using say write.table in R), which can then be used in a subsequent SAS PROC. Here's the example given for calling an R script:
R SAS
This example calls the R script digitsdata_svm.R, which calls the ksvm function from the kernlab package to perform a Support Vector Machine analysis. The R script in turns creates an output file predict_test_R.csv of classifications from the SVM.
You can find more details on how to call R and Python from the SAS data step at the link below.
SAS Support Communities: Open Source Integration Using the Base SAS Java Object (via Patrick Hall)


An Introduction to SAS for R Programmers

April 4, 2013
By 
(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert
Life decisions are usually much too complicated to be attributed to any single cause, but one important reason that I am here at Revolution today is that I ignored suggestions from well-meaning faculty back in graduate school to work more in SAS rather than doing everything in R. There was a heavy emphasis on SAS then: the faculty were worried about us getting jobs. This was before the rise of the data scientist and the the corporate model my professors had in mind was: PhD statisticians do statistics and everyone else writes SAS code. I would not be surprised if this is still not the prevailing model in traditional Statistics programs. My bet is there are statisticians everywhere who have yet to come to grips with the concept of a “data scientist”. 
Anyway, because of the great cosmic balance, or the bad karma that comes from ignoring well-intentioned advice and the fact that there are quite a few companies out there that want to convert their SAS code to R, I occasionally get to look at SAS code. In the process of interviewing candidates for this kind of work it struck me that there are many people coming to data science through the programming or machine learning routes who have some R knowledge as well as experience with Java, Python and C++ who have never worked with SAS.
To this group I offer the following very brief “Introduction to SAS for R Programmers”. So what is SAS exactly? Originally, SAS  stood for “Statistical Analysis System”. Indeed, towards the beginning of his invaluable book, “R for SAS and SPSS Users”, Bob Muenchen characterizes SAS as a system for statistical computation that has five main components:
  1. A data management system for reading, transforming and organizing data (The Data Step)
  2. A large number of procedures (PROCs) for statistical analysis and graphics
  3. The Output Delivery System for extracting output from PROCs and customizing printed output
  4. A macro language for programming in the data step and calling PROCS
  5. The Interactive Matrix programming language (IML) for developing new algorithms
    SAS is not a single programming language.
It is an entire ecosystem of products (not all seamlessly integrated) that contains at least two languages! While becoming a competent SAS programmer clearly requires mastering an impressive number of skills, quite a bit can be accomplished in SAS with a basic knowledge of the Data Step and the more common procedures (PROCs) in the base and Stat packages. Moreover, as it turns out, these two foundational components of SAS are the very two things that an R programmer is likely to find most strange about SAS.
There is really only one data structure in SAS, a file with rows of observations and columns of variables that always gets processed by means of an implied loop. A Data Step “program” starts with the first row of a SAS file executes all of the code it encounters until it comes to a run; statement then looks at the second row of the file and runs through the code again. The Data Step proceeds sequentially through the entire file in this fashion. An excellent presentationfrom Steven J. First illustrates the process nicely. See slides 36 through 45 for an example of SAS code with a very clear PowerPoint animation of how this all works. It is true that SAS programmers can work with arrays, but this is actually a computational sleight of hand. Arrays are actually special columns in a data set.
R programmers are used to an interactive computational experience. Within a session, at any point in time the objects that resulted from a previous computation are available as inputs to the next calculation. There is always a sense of moving forward. If you didn’t compute something as part of the last function you ran, just write another function and compute it now. In SAS, however, one uses the various PROCS to conjure the results in a methodical, premeditated way. For example, something like the following code would run a simple regression in SAS sending the results to the console.
proc reg data = myData;
model Y = X;
run: 
However, if you wanted to have the fitted values and residuals available for a further computation, you would have to rerun the regression specifying an output file and the keywords for computing the fitted values and residuals.
proc reg DATA = myData;
MODEL Y = X / stb clb;
OUTPUT OUT=OUTREG P=PREDCIT R=RESED;
run;
Kathy Welch a statistical consultant at the University of Michigan, provides a very clear example of this linear way of working.
Most SAS programming probably gets done by writing SAS macros. Look at Bob Muenchen’s book (or this article) for practical examples of R functions to replace SAS macros. For more advanced work,the SAS/Tool Kit (yet another add on) allows SAS probrammers to write custom procedures. But, from a R programmer’s perspective probably the most exciting SAS product is the IML System which provides the ability to call R from within an IML procedure. The documentation  provides an example of transferring data stored in SAS/IML vectors to R, running a model in R and then, importing the results back into SAS/IML vectors.
Actually, if you are an R programmer, all you might really want to do is import data from SAS to R. Thre are at least five ways to do this using functions from various open source R libraries. (Note that some of these methods require preparation steps to be done in SAS.) The document “An Introduction to S and The Hmisc and Design Libraries” on CRAN is also helpful. However, I recommend using rxImport feature in RevoScaleR package that ships withRevolution R Enterprise.
Importing a SAS file with rxImport looks like this:
rxImport(inData=data,outFile="sasFileName")
Not only is it a one step process that does not require having SAS installed on your system, but it reads .sas7bdat files directly into Revolution Analytics' .xdf file format. You can easily work with SAS files that are too large to fit into memory Once in .xdf file format the data can be worked on with RevoScaleR’s parallel external memory algorithms (PEMAs) or written to .csv files or data frames.

No comments:

Post a Comment