Tuesday, April 12, 2016

olap dw on-line transaction processing (OLTP) within operational systems.

http://www.databasejournal.com/features/db2/analytics-yes-big-data-no.html

Analytics Yes; Big Data No!

Some companies have been slow to acquire big data applications. They discovered that modern hardware platforms and database management systems were more than adequate for most of their business analytics needs. Such needs share several common attributes, including analytics run against operational systems, where the analytics logic and engine were close to the object data. This meant that companies could avoid complex and high-volume data movement and extract-transform-load (ETL) strategies while executing queries against already existing, well-tuned databases.
In this article we introduce the concepts of strategic and tactical analytics, and how best to support these methods with your current IT infrastructure.

Business Intelligence and Analytics

The phrase “business intelligence and analytics” has its roots in that most basic of analytical tools, the spreadsheet. Organizations have used these for decades to store various categories of data and execute calculations to predict trends and analyze changes in data. Popular terms such as roll-up, aggregation, and cubes found their beginnings in functions available in spreadsheet software.
Basic data processing gave way to an explosion in information technology. Statisticians developed new mathematical methods and functions, and hardware and software tools grew in speed and power. On-line analytical processing (OLAP) came of age as analysts used these sophisticated tools to process large volumes of data in short periods of time. Eventually many of the most useful methods and practices were lumped together into a single phrase: business intelligence analytics.
As analytics advanced the most common usage was customer base segmentation. Which types of customers were buying what, in what areas, and at what times?  By analyzing historical sales data and aggregating by customer type and geography, companies could now predict customer buying habits. This allowed companies to set prices and provide the best combination of products and services to maximize profitability.
In larger companies, IT organizations needed to process a growing volume of data and do a lot of machine-intensive analysis. However, accessing operational data in this way tended to either lock information or use CPU and other resources to the detriment of operational systems. To prevent analytics from negatively impacting these systems (especially customer-facing systems), data was typically copied to non-operational platforms such as data warehouses and data marts. Once copied, analytical software could then execute against these data copies, preventing performance issues on the original, critical systems.
This trend resulted in the spawning of a multitude of analytical silos within the company. Much effort was spent in copying data, sometimes several times, across the enterprise. Now, with a growing user base, a huge expansion in data volume, and a need for fast return of results, the analytics space has split into two camps: strategic analysis against large data stores, and tactical analysis in real-time.

Strategic Analytics

Strategic, or high-level analysis, has moved in the direction of big data. Large amounts of data arriving at high velocity necessitated implementation of large data stores (big data) and hybrid hardware and software tools implementing massive parallel data movement and processing.  Analysts typically use these solutions to execute long-running, complex queries against a database partitioned by multiple dimensions, allowing for different “slicing” and aggregation to achieve the desired results. The data is typically historical, rather than real-time, and the results are examined and probed at leisure.
Examples of strategic analysis include:
  • Period-to-period sales analysis. Analysts compare year-over-year product sales data to determine purchasing patterns. This can be used to predict shortages of inventory, customer buying habits, and  geographic areas where new warehouses or stores could be built.
  • Data mining. Here, analysts display a summary of data by multiple dimensions, such as purchases of a set of similar products. Then they “drill down” into the data by requesting further aggregations such as purchases by store, purchases by date, or purchases by salesperson. The reverse operation is called “roll up”, where the analyst requests that the data be summarized at a higher level, such as by geographic region or by department.

Operational or Tactical Analytics

Contrast the above with tactical or operational analysis.  Here, analysis is executed against a smaller, usually real-time data source, and the results used by the requesting application to complete a transaction.  For example, consider analyzing a customer purchase in real-time. The request for the  purchase triggers an analytical engine that compares that purchase against the customer’s historical purchase patterns to detect possible fraud. Such in-transaction analytics requires speed, since the transaction is real-time.
Implementation of tactical analysis requires speed and minimal use of resources. Tactical analytics software is usually executed as part of a transaction, against current operational data. The software uses a configuration file to determine which analytics processes to run at what times and against what categories of transactions. Management can then control which analytics are executed on a real-time basis, perhaps reacting to news of possible fraud, or even reducing the occurrence of analytics due to system overloads or resource outages.
Copying data to a separate hardware platform and running the analytics there requires too much time and processing power to meet a real-time transaction service level. The requirement to reduce data movement means hosting the analytics solutions as close to the data as is feasible, possibly even integrating it with on-line transaction processing (OLTP) within operational systems.

Enter the Data Warehouse

Some data warehouses (DWs) are purpose-built as analytics databases. The DW most likely already contains the most useful dimension tables for warehouse processing, and usually contains data from important operational systems. The warehouse infrastructure is also home to many data movement processes such as extract, transform, load (ETL), data staging areas, operational data stores, data marts, and so forth.
Initial analytics starts with the fact and dimension tables in the data warehouse. However, this data is usually not real-time due to the large amount of data transformation and cleansing required when extracting from operational systems. A nightly batch load process is typical; consequently, analytics in the warehouse is not done on real-time data.
It may be possible to do tactical analytics in the staging area, since this data is closest to the origin of the data in operational systems. This presupposes that the warehouse data staging process executes frequently and gathers internally consistent data. For example, customer purchase transactional data can be extracted hourly, as long as the transactions are complete and reference customers, products, and services that are already defined in the DW. However, it is rare that the warehouse can be custom-tuned to act on near real-time operational data.

But What about Big Data?

Regrettably, growth of the DW in scope and volume may preclude its use in tactical analytics. Data is rarely real-time as time spent in data movement, transformation and storage creates unacceptable delays.
The data warehouse of today is best used for strategic analytics, and the IT organization should prepare itself for the inevitable integration of the current enterprise data warehouse environment with one or more big data solutions. These solutions usually consist of a hybrid hardware and software solution, commonly called an appliance.
The appliance contains a large data store with high-speed access to the data using some combination of proprietary data storage, massively parallel data movement, and query parallelism. For example, customer purchase transactions may be stored on hundreds of disks within a disk array. The appliance accepts analytical queries, parses them, then splits each query into hundreds of subqueries, one for each disk. The queries then execute simultaneously, with the appliance merging the results of each query and returning the final result to the requestor.

Summary

Business intelligence analytics today is seen to be a combination of tactical and strategic analytics: tactical analytics executed against real-time data for in-transaction use; strategic analysis run against large data stores and executing complex queries across multiple dimensions.
Tactical analytics are usually implemented first. As the company gains insight into its data, it begins to store historical information in a data warehouse. Query results can now be used to both upsize the current platform or justify a separate analytical environment. As your business analytics space  grows, off-load analytics that is not required in real-time to a separate platform. Plan for a separate platform for strategic analytics. As your strategic analytics needs grow, begin planning for a big data solution that will be integrated into your enterprise data warehouse environment.
See all articles by Lockwood Lyon


DB2 Archives

By Peter Fretty   March 24 2016 08:53 PDT
True data is rarely real time. However, that is rarely the point when we are looking to data to help identify trends and find anomalies. I would rather have a breadth of data resources available so that I could make more informed decisions especially as the organization has the opportunity to iterate with deeper questioning. Peter Fretty, IDG blogger for SAS Big Data Forum.


Dear Hiring Manager:

I would love to utilize my experience with “On-line analytical processing (OLAP)” to work on SAS ® Fraud Management Real-time scoring of all transactions for fast, accurate fraud detection.


SAS EFM  analysis is likely  executed against a  real-time data source, with the results used by the requesting application to complete a transaction, such as in the following case: consider analyzing a customer purchase in real-time. The request for the  purchase triggers an analytical engine that compares that purchase against the customer’s historical purchase patterns to detect possible fraud. Such in-transaction analytics requires speed, since the transaction is real-time.
The requirement to reduce data movement means hosting the analytics solutions as close to the data as is feasible, possibly even integrating it with on-line transaction processing (OLTP) within operational systems.

As shown in my resume, at Providian  I worked with three different database systems, in both distributed and data warehouse environments. With  distributed database system, we   focuses on high throughput of small queries changing data across the database while providing simplified horizontal scaling and maintaining data integrity.

with a centralized data warehouse, we  focuses on speed of complex analytical queries.

I also worked on ETL using a lot of Unix Shell scripting, and SAS application development for marketing and credit departments.

With those experiences and skills, I can pick up SAS EFM quickly, start working and contributing into your team’s projects.

Thanks for the consideration.

Gary Gao





Providian Financial Corporation
(One of the leading credit card issuers in the US, sold to Washington Mutual in 2005)
Marketing Data Team Manager
January 1998 – January 2004

Managed a marketing database team of 15 analysts and programmers with major accomplishments in the following areas:

(I) Worked with the marketing department to design major marketing campaigns based on data modeling, for example:
• Constructed a data warehouse and front-end tools to allow data mining and segmentation analysis.
• Employed a dataset with historical trade line credit card information from Equifax, which included account characteristics such as age, credit line, and utilization, and used their one-year default rate to model future rates of defaults and their contribution to portfolio losses over time.
• Mapped out business logic and algorithms for various aggressive growth strategies involving new accounts, balance transfers and outstanding balances, as well as strategies aimed at attrition retention and account development.

(II) Managed back-end database systems and development of SAS programs:
• Built and developed a team of DBAs, UNIX programmers, SAS programmers and migrated marketing department’s customer database from Sybase to Oracle.  Our system had to interact with existing tools such as Total System Services.
Supervised the weekly and monthly updating and maintenance of data load operations.
 Supervised SAS programmers to code business logic and algorithms (mentioned in the prior section) adjusting variables such as balance transfers, minimum payments, interest rates, to increase customer desire and their motivation for building and accessing their credit.

(III) Managed a separate team of SAS programmers dedicated to the Providian’s acquisition efforts of acquiring creditworthy consumers, including identifying and profiling of high-risk customers.



What is the difference between a distributed database system  and a centralized data warehouse?

answers for non-software engineers

Have this question too? Request Answers:
Request From Quora
We will distribute this question to writers, and notify you about new answers.
Olivier Cas
Olivier Cas, I work at Scalr, the Policy Driven Cloud Management Platform.
William Koplitz
William Koplitz, Mad scientist, surfer, BJJ Practitioner
Wai-Ming Lee
Wai-Ming Lee, Working mainly with Amazon Web Services since it came out for general release...
Can you answer this question?
3 Answers
Andrew Boring
Andrew Boring, Former Public Cloud Product Manager
10.1k Views  Andrew has 60+ answers in Cloud Computing
To begin with, let's break this up into the key terms:

Distributed vs centralized
Database vs Data Warehouse. 
(because we can have centralized database systems and distributed data warehouse systems, as well)


Let's start with the difference between a database vs data warehouse. 

A Database system involves tables of data (somewhat like an Excel spreadsheet: each table consists of colums and rows). Each table contains different, but related data, and the database system maps information from one table to the other. For example, one table in your table may have First and Last Names. The second table may have Zip Codes. On your first spreadsheet, instead of  having a zip code next to each name, it references a particular zip code on the second spreadsheet. When you ask the database for the data in a certain way, it will return the first and last name of the person you asked for, and pull the zip code for that person from the other spreadsheet.

A data warehouse is a complex business system that includes a database for storing data, and also a means of pulling in the data from an outside source and doing something with it before storing it. For example, maybe you have a business system that forecasts manufacturing based on sales data. You have a custom manufacturing application running internally, but all your customer data is kept in a third-party application like Salesforce.com, and your actual sales info is kept in a billing/records system. You data warehouse will import customer information from Salesforce, and then import the billing information, and then do something to organize and process that data before storing it in its database. Then, when the manufacturing application needs it, it can connect to the data warehouse rather than pulling separate info from the billing system or Salesforce.com. The data warehouse offers a central repository of data for a number of applications to simplify development and integration when a new application is needed.


Finally, the remaining two terms: distributed vs centralized. 

Centralized simply means everything is on one location. There is one system of record. It may be replicated to other systems so that they can continue functioning if the primary system goes down, but they all keep the same copies of data. 

Distributed systems tend to fragment the data and store pieces of it in different systems to spread the overall requests to multiple systems. This provides the advantage of scaling to handle more and more data.




Examples? Certainly!

I have a set of data, a simple sequence of numbers: "1234567890".

If I were storing this sequence of numbers in a database, I'll simply store it as a record and ask for it when I need it. 

To put this into a data warehouse, some system will need to import it in (or my database may push it into the data warehouse), it will do some sort of business logic on it, maybe extract some other data, associate it with a particular sales person, add the date it was entered, and THEN store it for later use.


In a centralized system, I'll store it on one server. Then, I'll replicate that exact information to Server #2 for safekeeping. If i update the sequence to "2345678901", then I'll have to update the second server as well. But if the first server goes down, I know I can get the same info from the second server.

It may look like this:
Server #1 = "1234567890"
Server #2 = "1234567890"

In a distributed system, maybe I'll break that sequence into pieces and store it in  five different servers:
Server #1 = "12"
Server #2 = "34"
Server #3 = "56"
Server #4 = "78"
Server #5 = "90"

In order to assemble the entire set, I have to ask all five servers for the info and reassemble. If I only need the first four numbers for some reason, I'll only ask the first two servers. If I need the last four numbers, I'll only ask the last two servers, and so on. This has some advantages with certain types of data, and allows some applications to grow much larger than with a single, centralized system.
Written Jul 21, 2013  View Upvotes  Answer requested by Shashi Adhikari
Simon Griffiths
Simon Griffiths, Enterprise software sales and marketing. Father.
4.5k Views
They are two very different things:
A distributed database is a system where different parts of the database are distributed at different locations. You need some clever technology to keep the parts of the database synchronised across locations.
A data warehouse is a database which is designed for analysing data and making decisions; centralised means its is located in one place.
I don't know of any data warehouses that are distributed.
Vinay Sahni
2.6k Views

A distributed database focuses on high throughput of small queries changing data across the database while providing simplified horizontal scaling and maintaining data integrity.

A centralized data warehouse focuses on speed of complex analytical queries

No comments:

Post a Comment