DB 문서들
DSN 갤러리
DBMS Tutorials 915 게시물 읽기
 News | Q&A | Columns | Tutorials | Devel | Files | Links
No. 915
Introduction to data warehousing
2004-02-05 21:29
Intro to Data Warehousing

*   also see: Connolly and Begg, Chapters 30, 31, 32

(sources for the below include Rob and Coronel, "Database Systems", 3rd edition.

*   we've been talking about operational databases, and how they can be
best designed ---

*   but, it turns out that, for decision support systems, the rules for
the needed data repository are (supposedly, at least) different --- if
you wish to use a decision support system to help with data analysis,
your needs are different;

*   (KROENKE) "A data warehouse is a store of enterprise data that is
designed to facilitate management decision making."

*   decision support system (DSS) --- an arrangement of computerized
tools used to assist managerial decision making within a business;

*   usually considered to consist of 4 high-level main components;

	*   its data repository--- data store; includes both business
	data AND business MODEL data; 

	*   data extraction and filtering component;
	*   end-user query tool 

	*   end-user presentation tool;

*   What are the differences between "operational" data vs. "decision
support" data?

	*   timespan
		*   operational data represent current (atomic)
		transactions --- a purchase order, a sales invoice, etc. 
		Shorter time frame;

		*   DSS data tend to cover a longer time frame; not each
		sale, but last month's sales, last year's sales, last FIVE 
		years' sales; or, buying PATTERN of a customer or group 
		of customers;

		*   (DSS data tend to be more HISTORIC in nature ---
		won't have up-to-the-minute data there, don't need/expect to!)

	*   granularity
		*   operational: specific transactions; 

		*   DSS: at DIFFERENT levels of aggregation, ranging
		from HIGHLY summarized to near-atomic. (And in a way that 
		manager can change levels at a whim!)

	*   dimensionality
		*   arguably most distinguishing characteristic of DSS data?

		*   from data analyst's pt of view, data are always
		related in MANY different ways; and, they want to develop 
		the larger picture in all of those different ways; (how 
		did product X fare relative to product Z? during the past 
		6 months? In a particular region? Compared to different 
		groups of customers? Etc.!)

		*   operational data doesn't focus on EFFECTS of
		transactions over time;

*   DSS data-repository requirements:

	*   schema must support complex, NON-NORMALIZED data representations;

	*   DSS database is largely created by extracting data from (one or
	more?) operational databases; so, DATA EXTRACTION AND LOADING is a
	vital capability

	*   must support the analytical user with advanced data modeling 
	and data presentation tools;

	*   DSS db's tend to be very LARGE (gigabyte, terabyte ranges
	aren't uncommon!)  they typically GROW and don't shrink over time;

* SO --- we get the concept of a DATA WAREHOUSE --- an INTEGRATED,

	*   integrated: a centralized, consolidated db that INTEGRATES
	data derived from the ENTIRE organization (and maybe more data
	besides!)  yes, that DOES mean possibly from MULTIPLE operational

		* from multiple and diverse sources, possibly with diverse
		formats --- you can see the possibility for standardization 

	*   subject-oriented: arranged and optimized to provide answers
	to questions to coming from diverse functional units within a 
	company; DELIBERATELY lots of data redundancy, to optimize this 
	kind of question-answering;

	*   time-variant: DSS data include a time element --- not true
	(in same sense) for operational data; Warehouse data reflect what
	happened last week, last month, last 5 years, etc.

		*   also: once data are periodically uploaded to the DW,
		all time-dependent aggregations are recomputed;

	*   non-volatile: one data enter the DW, they are never (?!)

*   (KROENKE) Common components of a data warehouse:

	*   data extraction tools
	*   extracted data
	*   metadata of warehouse contents
	*   warehouse DBMS(s) and OLAP servers
	*   warehouse data management tools
	*   data delivery programs
	*   end-user analysis tools
	*   user training courses and materials
	*   warehouse consultants (?!)

*   (Kroenke, 7th) Categories of Requirements for a Data Warehouse
        *   Queries and Reports with Variable Structure
        *   User-specified data aggregation
        *   User-specified drill-down
        *   Graphical outputs
        *   Integration with domain-specific programs

*   (KROENKE) Challenges for data warehouses:

	*   inconsistent data

	*   tool integration (integration of various tools that the users need)
		*   these tools often use different paradigms --- 
		*   DBMS products: table oriented
		*   OLAP products: cube oriented
		*   spreadsheets: spreadsheet oriented
		*   financial planning products: plan oriented
		*   etc.

	*   missing warehouse data management tools (that is, a LACK of 'em?)

	*   ad hoc nature of requirements

*   Data Mart: a small, SINGLE-SUBJECT Data Warehouse SUBSET that
provides decision support to a small group of people; lower cost,
shorter implementation time;

	*   let a company "try out" the DW idea! 

*   Twelve Rules that Define a Data Warehouse
(what *is* it with these "12 rule" pronouncements, anyway?! 8-) )

from: "The Twelve Rules of Data Warehouse for a Client/Server World",
W. Inmon, C. Kelley, Data Management Review, Vol. 4, No. 5, May 1994.

1.  The Data Warehouse and operational environments are separated.

2.  The Data Warehouse data are integrated.

3.  The Data Warehouse contains historical data over a long time horizon.

4.  The Data Warehouse data are snapshot data captured at a given
point in time.

5.  The Data Warehouse data are subject-oriented.

6.  The Data Warehouse data are mainly read-only with periodic batch
updates from operational data. No online updates are allowed.

7.  The Data Warehouse development life cycle differs from classical
systems development. The Data Warehouse development is data-driven;
the classical approach is process-driven.

8.  The Data Warehouse contains data with several levels of detail:
current detail data, old detail data, lightly summarized, and highly
summarized data.

9.  The Data Warehouse environment is characterized by read-only
transactions to very large data sets. The operational environment is
characterized by numerous update transactions to a few data entities
at a time.

10.  The Data Warehouse environment has a system that traces data
sources, transformations, and storage.

11.  The Data Warehouse's metadata are a critical component of this
environment. The metadata identify and define all data elements. The
metadata provide the source, transformation, integration, storage,
usage, relationships, and history of each data element.

12.  The Data Warehouse contains a charge-back mechanism for resource
usage that enforces optimal use of the data by end-users.

*   SO --- how does OLAP relate to this?

*   OLAP: On-Line Analytical Processing;

*  these are TOOLS --- within a data analysis environment that supports
decision making, business modeling, and operations research

*  OLAP may provide one (of several?) FRONT END to a DATA WAREHOUSE!!!

	* may also use their own engine --- or be atop a DBMS? 
                ROLAP/MOLAP/HOLAP mess right now!)

                *   (Kroenke, 7th) "With OLAP, data is viewed in the
		frame of a table, or with three axes, in that of a cube. 
		OLAP makes no limit on the number of axes, hence you 
		will sometimes here the term OLAP hypercube."

                *   (Kroenke, 7th) "OLAP terminology is evolving and is
		currently ambiguous in an important way. The term cube
		is used both to describe semantic structure and also 
		to describe materializations of that underlying structure."

*   4 common main characteristics, of OLAP systems:

	1.   use multidimensional data analysis techniques
	2.   provide advanced DB support
	3.   provide easy-to-use user interfaces
	4.   support client/server architecture

*   IF TIME: (no, it doesn't really fit, but if we have time!)
Chapter 14: Database Administration

*   the thing is, even with my lack of real-world database experience, I'm
pretty sure that what people think of as a DBA doesn't always
encompass everything in this chapter --- and if someone wants to hire
you as a DBA, you should be aware of this, and should probably attempt
to find out what THEY think a DBA is
*   In my limited experience, if you're the DBA, you're the person people
call with questions/problems/etc. regarding the database;

*   HSU's Oracle DBA, Peter Johnson, gave me an initial redwood Oracle
account, and accounts for those who'd registered for the course;

	*    he gave me the level of permissions required to add new
	users (for students who'd newly added the class);

	*   all student users have a level of permission that allows
	them to, for example, set their own Oracle password (but student users
	cannot add other users);

	*   when we run into Oracle-related problems, or questions we
	can't figure out an answer to from the documentation, he might very
	well be the person we'd call;

*   According to the text, pp. 767-768, the DBA is often expected to
define, communicate, and enforce procedures that cover areas such as:

	*   (I'm a little uncomfortable with some of these --- some
	DBA's are managers over many others, and some are more
	technically-focused and merely perform technical database maintenance)

	*   these latter are more likely to perform day-to-day,
	"operational" database management tasks, like making sure backups are
	done, creating new forms and views upon request, adding and removing
	and modifying levels of user access, etc;

	*   these latter are less likely to be setting policy, telling
	everyone what database methodology is to be used, etc.;

	*   my impression, based on admittedly very little real-world
	experience!!!, is that the role of a "DBA" varies wildly)

*   (perhaps) End user database requirements-gathering
	*   what documentation is required?
	*   what forms must be used?

	*   (but, some DBA's will simply be hired to maintain an
	existing database; you see, they will not necessarily be involved in
	design, then;)

*   (perhaps) Database design and modeling
	*   what database design methodology is to be used
	(normalization? object-oriented? etc.)

	*   (again, if they are involved in the database's initial
	design and development at all)

*   (perhaps) Documentation and naming conventions
	*   the DBA is often in charge of the overall data dictionary
	for each database, and often(?) does set and/or enforce naming

	*   must be used in the definition of all data elements, sets,
	and programs that access the database;

	*   (may be more enforcing existing standards than creating them;)

*   design, coding, and testing of all database application programs
	*    DBA must define the standards for application program
	coding, documentation, and testing.

	*   the DBA standards and procedures are given to the
	applications programmers, and the DBA must enforce such standards.

	*   (they'll at least be working with the applications
	programmers, in that:
		*   those programmers will come to the DBA when they are
		having trouble with the database,

		* when they have corrupted the DB and need for it to
		be reloaded 8-) ,
		*   when they need changes in their access permissions, 
		*   when the application is ready to be launched, etc.)

*   (perhaps) Database software selection
	*   (they may also simply be hired to manage the existing
	software --- these decisions may be made already, or at other levels!)

*   database security and integrity
	*   (I think these are probably key tasks for any DBA!!!

	*   DBA will surely ENFORCE the policies governing security and
	integrity; text says they may also DEFINE them, but again *I think*
	that will vary widely;

	*   database security is especially crucial; security standards
	should be clearly defined and strictly enforced.

	*   security procedures should be able to handle many security
	scenarios, to ensure a minimum of security problems;

	* (how strict will one be about this? how dire will the
	circumstances be from a breach?!
		*   low? Don't worry about so much;

		*   high? You might not even have the db server
		accessible outside the company's INTRAnet!)

*   database backup and recovery
	*   (another biggie, I think, for just about anyone calling
	themselves a DBA!)

	*   these procedures had better exist, whoever defines them! And
	it may very well be the DBA, even in a more operationally-defined DBA;

	*   database backup and recovery procedures must include the
	information necessary to guarantee proper execution and management of
	the backups;

	*   TRUE STORY: There was someone hired away from one
	institution, to be the DBA at another; let's say this was more of a
	managerial type than a technical type; Peter Principle may have been
	at work here ---

		*   this fellow did not perform proper backups of a
		pharmacology database (this was for a research
		hospital) --- and there was a database failure;

		*   we're talking data lost -- including patient data,
		that hospitals HATE to lose!
		*   this person was FIRED (and not slowly, either!!)

	*   moral: if you're a DBA (or heck, in charge of ANY computer
	software or data!), YOU'D BETTER MAKE SURE YOU BACK IT UP

*   database maintenance and operation
	*   again, bread-and-butter for just about any DBA, I think;
	*   (whether operator job logs are required, or not)

*   end user training?
	*   I think that may vary widely --- although it may happen
	informally, when a new person is hired and they ask for their db
	password (and when they get it, ask: "uh --- what do I do now?")

*   To take some of the "biggie" areas above a little further

*   data security, privacy, and integrity

*   dba needs to make use of the security and integrity mechanisms
provided by the DBMS!

*   AUTHORIZATION MANAGEMENT thus may be an important DBA task ---

	*   User access management:

		* define each user to the database (or, to each
		appropriate database); at OS level, and DBMS level

		*   assign passwords to each user

		*   define user groups

		*   classifying users according to common access needs
		into user groups may help the DBA to more reasonably
		control and manage access privileges;

	*   assign access privileges to specific users, to access
	specified databases;

		*   describes type of authorized access --- read-only?
		some combination of READ, WRITE, DELETE?

		*   (in SQL? have GRANT and REVOKE commands to do this;)

		*   how fine can the restrictions be? on a table-by-table 
		basis, at least; although Peter Johnson was telling
		310 how, if you want to limit a user to certain
		columns, you need to make a view with those columns
		and limit them to access of that view...!  and, can
		limit to read access, or allow update access...

	*   physical access control

		*   secured entrances, password-protected workstations,
		electronic personnel badges, closed-circuit video,
		voice recognition, even!

		*   at Shell, the db was available on the INTRAnet only
		(not on a network connected to the Internet) --- and,
		you wore badges, and had to sign-in with a security
		guard to get into the building

	*   view definition
		*   DBA may define, to protect and control the scope of
		the data accessible to an authorized user;

		*   CREATE VIEW can be used for this;

		*   (accounts payable needs different information than
		manufacturing than sales, etc.)

	*   DBMS *utilities* access control

		*   may place limitations on DBMS' query and reporting
		tools! Only for authorized personnel;

		*   (others can only see specialized forms and views, or
		applications built on top of DBMS)

	*   DBMS usage monitoring
		*   quite common, I think --- expected to audit the use
		of the data in the database;

		*   some DBMS' allow the creation of an audit log, which
		automatically records a brief description of the
		database operations performed by all users;

		*   (may be for all db operations, or just failed
		operations, etc.)

		*   they may enable DBA to pinpoint access violations
		--- or see how data is being used, etc.

*   a few more backup and recovery notes

*   note that there are different kinds of backups --- full,
incremental, concurrent

	*   full: a complete copy;

	*   incremental: a backup for all data since the since the last
	backup date;

	*   concurrent: take place while the users are working on the db;

*   backups need to be properly kept, and identified;

*   they must be stored conveniently, but still safely;

*   may have insurance!

*   (and may have backups at different sites! To help in case of
disasters (fire, flood, etc.)

*   if the data is vital, you probably have a quite formal backup and
recovery set of plans, procedures, etc.!

*   we're going to have to stop there, for now.   


이 글에 대한 댓글이 총 1건 있습니다.


데이터 웨어하우스 소개

#지금까지는 운영계 시스템과  가장 추전할 만한 디자인 기법을 * 운영계 데이터는 현재의  소개하였다. 

#그러나 Decision Support Systems(DSS)의 경우에는 필요로 하는 데이터저장에 관한 한 또 다른 요건을 필요로 한다. 즉 당신이 데이터분석을 위한 DSS 시스템을 원한다면 (추가설명: 물건을 팔기 위한 시스템이 아닌, 얼마나 팔았는지의 정보를 얻을수 있는 시스템을 원한다면) 당신이 원하는 것은 또 달라진다.

#(KROENKE 말하기를) 데이터 웨어하우스란 것은 경영진들이 업무결정을 하는 것을 도와주도록 만들어진 기업체의 데이터저장소이다.

#Decision Support Systmes(DSS)  --- 사업을 함에 있어서 경영진들이 업무결정을 하도록 보조해 주는 전산화 된 시스템

#일반적으로 DW는 크게 4개의 요소로 구성되어 있다.

    * 데이터 저장소 - 테이터 창고; 사업으로 인해 생성되는 데이터와 그 사업모델에  대한 데이터 (추가설명: 메타 정보를 말하는 것으로 추측됨)

     *  데이터를 추출하고, 정제하고 변환하는 요소 (추가설명: 흔히 ETL 이라고 함)

     *  최종사용자가 사용하는 쿼리 툴 (추가설명:현업들은 SQL문을 모르므고 보통 GUI로 구성되어 쉽게 원하는 정보를 얻을 수 있는 툴은 사용함)

      *  현업이 원하는 정보를 얻을 수 있도록 하는 프리젠테이션 툴 (추가설명: 기간계 시스템에서의 웹페이지나 최종 인터페이스와 같은 것임)

# 그러면 "운영계" 데이터와 "결정을 도와주는" 데이터의 차이점은 무엇인가?

      * 시간차

           * 운영계 데이터는 주문서, 청구서 등 등현재의 전산처리 상황을 보여주지만  DW 데이터는 전월의 매출금액, 작년의 매출금액, 지난 5년간의 매출 금액 등 장기간의 전산정보를 보여준다.    

              * DW는 과거 지향적이다. (추가설명: 가끔은 실시간 정보를 원합니다.)

     * 기초단위(추가설명:granularity의 직역은 낟알, 즉 물리에서는 원소 쯤으로 )

                 *운영계: 특정 전산처리정보

                 * DSS: 다양한 요약수준의 집계 정보 ( 경영진들은 그 요약수준을 조절함)

            * 차원

                    *  DSS를 가장 잘 나타내주는 요소라고 할만 함

                    *  데이터분석가의 입장에서는 데이터는 항상 서로서로 얽혀 있고, 그들은  다양한 방법으로 사업을 분석하고자 한다(즉, 상품 X 의 요금이 어떤 식으로 상품 Z에 영향을 미치는가? 지난 6 개월 동안에는? 특정 지역에서는? 특정 고객들 집단에 따라서는? 등등)

                     * 운영계 데이터는 지속적인 전산처리의 결과에 대해 초점을 두지 않는다.              

허지숙님이 2004-02-10 18:27에 작성한 댓글입니다.
이 댓글은 2004-02-10 18:44에 마지막으로 수정되었습니다. Edit
1057Howto Install Tomcat 5 + Apache 2 in WinXP
923Apache Module mod_authn_dbi
916Client/Server Database System
915Introduction to data warehousing [1]
866SQL 의 기초 (3)
865SQL 의 기초 (2)
864SQL 의 기초 (1)
Valid XHTML 1.0!
All about the DATABASE... Copyleft 1999-2020 DSN, All rights reserved.
작업시간: 0.012초, 이곳 서비스는
	PostgreSQL v13.0으로 자료를 관리합니다