Solution of the problems:
Solution of Extraction Stage:
Direct trickle feed
Assuming that an application requires a true-real time data warehouse, the simplest approach is to continuously feed the data warehouse with new data from the source system. This can be done by either directly inserting or updating data in the warehouse fact tables, or by inserting data into separate fact tables in a real-time partition.
EAI vendors such as Tibco provide solutions for real-time data transport. For systems based on the latest Java technologies, Java Messaging Service (JMS) can be used to transmit each new data element from the source system to a lightweight listener application that in turn inserts the new data into the warehouse tables. For data that is received over the Internet, the data can be transmitted in XML via HTTP using the SOAP standard, and then loaded into the warehouse.
Once the data is near the warehouse, simply inserting the new data in real-time is not particularly challenging. The problem with this approach, which will probably not be readily apparent during development and initial deployment, is that it does not scale well. The same logic as to why data warehouses exist in the first place applies-complex analytical queries do not mix well with continuous inserts and updates. Constantly updating the same tables that are being queried by a reporting or OLAP tool can cause the data warehouse’s query performance to degrade.
Under moderate to heavy usage, either from concurrent warehouse user queries or from the incoming data stream, most relational database management systems will begin to temporarily block the incoming data transactions altogether, and the data will become stale. This may also cause the data loading application to fail altogether if the warehouse becomes unresponsive. Also as queries begin to slow and if the data becomes stale, the warehouse users may become frustrated and stop using the system altogether.
Two approaches to real-time data loading that help avert this scalability problem are described in the next section. Also the issue of real-time data warehouse scalability, and four additional approaches to building a scalable real-time warehouse system, are discussed in Section 4 of this paper.
Solution 1c: Trickle ; Flip
The “Trickle & Flip” approach helps avert the scalability issues associated with querying tables that are being simultaneously updated. Instead of loading the data in real-time into the actual warehouse tables, the data is continuously fed into staging tables that are in the exact same format as the target tables. Depending on the data modeling approach being used the staging tables either contain a copy of just the data for the current day, or for smaller fact tables can contain a complete copy of all the historical data.
Then on a periodic basis the staging table is duplicated and the copy is swapped with the fact table, bring the data warehouse instantly up-to-date. If the “integrated real-time partition through views” approach is being used, this operation may simply consist of changing the view definition to include the updated table instead of the old table. Depending on the characteristics of how this swap is handled by the particular RDBMS, it might be advisable to temporally pause the OLAP server while this flip takes place so that no new queries are initiated while the swap occurs.
This approach can be used with cycle times ranging from hourly to every minute. Generally best performance is obtained with 5-10 minute cycles, but 1-2 minute cycles (or even faster) are also possible for smaller data sets or with sufficient database hardware. It is important to test this approach under full load before it is brought into production to find the cycle time that works best for the application.
Solution 2a: Modeling as Usual with Direct Fact Table Feed
When using the Direct Trickle Feed or Trickle ; Flip approaches to real-time ETL discussed above, and when the real-time data is stored along with the historical data in the same fact tables, no special data modeling approaches are required. From the query tool’s perspective, there is nothing different about a real-time data warehouse modeled in this manner than from a non-real-time warehouse.
The main thing to consider when using this approach is caching. Many query, reporting, and OLAP tools will cache reports and result sets, and assume that these caches need to be refreshed only upon the nightly or weekly warehouse load. With a real-time data feed, this type of caching needs to be either disabled, or set to the same cycle as the table flip when using the Trickle & Flip approach. Since most query tools do not have a good mechanism for coordinating cache expiration with frequent external events, disabling caching for the reports that run on real-time data is usually the best approach. In the future it is likely the query tools will become more real-time aware, and understand that some tables are frequently changing and vary their caching approaches accordingly, but for now these settings need to be set carefully and by hand.
Solution 2b: Separate Real-time Partition
One approach to modeling real-time data is to store the real-time data in separate warehouse fact tables. Depending on the type of fact table, many query tools that support table partitioning will be smart enough to automatically retrieve the real-time data from the real-time tables when required. For tables where this will not work, alternate facts and attributes can be set up to point to the real-time data tables, and the query tool can be set up to drill across from reports containing the historical data to real-time data templates.
From the query-tool configuration and administration perspective, this data modeling approach is the most complex to engineer. This approach requires either the query tool or the end user to understand where the various types of data are located and how to access them. The success of this approach depends significantly on how well the query tool insulates the end user from the extra complexities required to access real-time information.
This data modeling approach is covered in detail by Ralph Kimball in his February 2002 “Real-time Partitions” article in Intelligent Enterprise (http://www.intelligententerprise.com/020201).
Solution 2c: Integrated Real-time through Views
Another real-time data modeling approach is to store the real-time data in different tables from historical data, but in the same table structure. Then by using database views, the historical and real-time data tables are combined together so that they look like one logical table from the query tool’s perspective. This helps alleviate many of the problems associated with the separate partition approach, as the query tool or end users do not need to join two tables.
This approach is similar to the direct fact table feed, except the feed is into smaller tables that can be more easily modified or replaced when necessary by the load process. Also these tables will generally be small enough to sit in the database’s memory cache, to alleviate some of the query contention issues involved with performing OLAP queries on changing data. The caching concerns discussed in Solution 2a still apply, and care needs to be taken to ensure that the query tool doesn’t return old cache results to users who are requesting real-time data.
Solution of OLAP:
The issue of query contention and scalability is the most difficult issue facing organizations deploying real-time data warehouse solutions. Data warehouses were separated from transactional systems in the first place because the type of complex analytical queries run against warehouses don’t “play well” with lots of simultaneous inserts, updates, or deletes.
Usually the scalability of data warehouse and OLAP solutions is a direct function of the amount of data being queried and the number of users simultaneously running queries. Given a fixed amount of data, the number of users on the system is proportional to query response time. Lots of concurrent usages causes reports to take longer to execute.
While this is still true in a real-time system, the additional burden of continuously loading and updating data further strains system resources. Unfortunately the additional burden of a continuous data load is not just equivalent to one or two additional simultaneously querying users due to the contention between data inserts and typical OLAP select statements. While it depends on the database, the contention between complex selects and continuous inserts tends to severely limit scalability. Surprisingly quickly the continuous data loading process may become blocked, or what used to be fast queries may begin to take intolerably long to return.
Simplify and Limit Real-time Reporting
Many real-time warehousing applications are relatively simple. Users that want to see up-to-the-second data may have relatively simple reporting requirements. If reports based on real-time data can be limited to simple and quick single-pass queries, many relational database systems will be able to handle the contention that is introduced. Frequently the most complex queries in a data warehouse will be accessing data across a large amount of time. If these queries can be based only on the non-changing historical data, contention with the real-time load is eliminated.
Another important consideration is to examine who really needs to be able to access the real-time information. While real-time data may be interesting to a large group of users within an organization, the needs of many users may be adequately met with non-real-time data, or with near-real-time solutions.
Also many users who may be interested in real-time data may be better served by an alert notification application that sends them an email or wireless message alerting them to real-time data conditions that meet their pre-defined thresholds. Designed properly, these types of systems can be scaled to 100 or 1000 times more users than could possibly run their own concurrent real-time warehouse queries.
Apply More Database Horsepower
There is always the option of adding more hardware to deal with scalability problems. More nodes can be added to a high-end SMP database system, or a stand-alone warehouse box can be upgraded with faster processors and more memory. While this approach may overcome short-term scalability problems, it is likely to only represent a band-aid approach. Real-time query contention often has more to do with the fundamental design of a RDBMS than with the system resources available.