Until recently, large datasets were stored in and processed by relational databases such as Oracle. Data warehouses and data mining were the tools of choice. Over the last few years datasets became either too large or changed too fast to be processed by these tools. This new type of dataset, which exceeds the processing capacity of the standard tools, is referred to as 'Big Data.'
Big Data is generally defined by the terms "volume", "velocity" and "variety". Volume refers to the amount of data. Generally these datasets can be terabytes, or more. Also to be considered is whether the data is static - such as genomic data, or nonstatic - such as streaming data. Velocity refers to data generated over a period of time. For example, next-generation sequencing instruments can generate up to 2 terabytes of data per day. Finally, variety refers to the different sources of data. Data from streaming logs, web clicks, or instrument feeds can create significantly different types of data which cannot be stored or processed in a relational database.
Big Data Tools
'Analytics' refers to the mining of data in order to find meaningful patterns. Hadoop, developed at Yahoo in 2006, and now an Apache project, is currently one of the most popular Big Data processing systems. At the simplest level, Hadoop consists of a parallel, fault tolerant, distributed filesystem called HDFS, and a data processing system based on Google's MapReduce technology. Statistical software packages, such as SAS and R, can integrate with Hadoop to enhance the processing.
SAS is a collection of software solutions that enable users to do:
- Information retrieval and data management
- Report writing and graphics
- Statistical analysis, econometrics and data mining
- Business planning forecasting, and decision support
- Operations research and project management
- Quality improvement
- Applications development
- Data warehousing (extract, transform, load)
- Platform independent and remote computing
SAS is available in the Student Mall lab, Gateway Lab 2302 and Gateway Lab 2305, GITC 2315A and GITC 2315B, Library Information Commons, and CAPE.
For more general information on SAS visit the SAS Homepage.
- Descriptive statistics: cross tabulation, frequencies, descriptives, explore, descriptive ratio statistics.
- Bivariate statistics: means, t-test, ANOVA, correlation (bivariate, partial, distances), non-parametric tests.
- Prediction for numerical outcomes: linear regression.
- Prediction for identifying groups: factor analysis, cluster analysis (two-step, k-means, hierarchical), discriminant.
SPSS is available in Library room 1047, Student Mall room 36, Student Mall room 37, Student Mall room 39, Student Mall room 40, GITC 2302, GITC 2305, GITC 2315A, GITC 2315B, GITC 2315C, and GITC 2400
SPSS is available for download on the IST downloads page.
For more information on SPSS visit the SPSS homepage.
The R Project for Statistical Computing
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. R free software programming language and environment for statistical computing and graphics. The R environment features:
- An effective data handling and storage facility
- A suite of operators for calculations on arrays, in particular matrices
- A large, coherent, integrated collection of intermediate tools for data analysis
- Graphical facilities for data analysis and display either on-screen or on hardcopy
- A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities
R is available on all Linux AFS clients - e.g., the public-access oslN.njit.edu machines (N=1,2,..,31; N=50,51,..,84) in GITC 2315C and 2400.
For more information on R visit The R project's homepage.
Note: An excellent review of Big Data and Analytics is available as a free download from O'Reilly - http://oreilly.com/data/radarreports/big-data-now-2012.csp
Although technically not considered 'Big Data,' accommodations may be possible for research groups requiring storage for large datasets up to a few terabytes. Please contact email@example.com for more information.