Intro to Data Warehousing
* also see: Connolly and Begg, Chapters 30, 31, 32
(sources for the below include Rob and Coronel, "Database Systems", 3rd edition.
* we've been talking about operational databases, and how they can be
best designed ---
* but, it turns out that, for decision support systems, the rules for
the needed data repository are (supposedly, at least) different --- if
you wish to use a decision support system to help with data analysis,
your needs are different;
* (KROENKE) "A data warehouse is a store of enterprise data that is
designed to facilitate management decision making."
* decision support system (DSS) --- an arrangement of computerized
tools used to assist managerial decision making within a business;
* usually considered to consist of 4 high-level main components;
* its data repository--- data store; includes both business
data AND business MODEL data;
* data extraction and filtering component;
* end-user query tool
* end-user presentation tool;
* What are the differences between "operational" data vs. "decision
support" data?
* timespan
* operational data represent current (atomic)
transactions --- a purchase order, a sales invoice, etc.
Shorter time frame;
* DSS data tend to cover a longer time frame; not each
sale, but last month's sales, last year's sales, last FIVE
years' sales; or, buying PATTERN of a customer or group
of customers;
* (DSS data tend to be more HISTORIC in nature ---
won't have up-to-the-minute data there, don't need/expect to!)
* granularity
* operational: specific transactions;
* DSS: at DIFFERENT levels of aggregation, ranging
from HIGHLY summarized to near-atomic. (And in a way that
manager can change levels at a whim!)
* dimensionality
* arguably most distinguishing characteristic of DSS data?
* from data analyst's pt of view, data are always
related in MANY different ways; and, they want to develop
the larger picture in all of those different ways; (how
did product X fare relative to product Z? during the past
6 months? In a particular region? Compared to different
groups of customers? Etc.!)
* operational data doesn't focus on EFFECTS of
transactions over time;
* DSS data-repository requirements:
* schema must support complex, NON-NORMALIZED data representations;
* DSS database is largely created by extracting data from (one or
more?) operational databases; so, DATA EXTRACTION AND LOADING is a
vital capability
* must support the analytical user with advanced data modeling
and data presentation tools;
* DSS db's tend to be very LARGE (gigabyte, terabyte ranges
aren't uncommon!) they typically GROW and don't shrink over time;
* SO --- we get the concept of a DATA WAREHOUSE --- an INTEGRATED,
SUBJECT-ORIENTED, TIME-VARIANT, NON-VOLATILE database that PROVIDES
SUPPORT for DECISION-MAKING.
* integrated: a centralized, consolidated db that INTEGRATES
data derived from the ENTIRE organization (and maybe more data
besides!) yes, that DOES mean possibly from MULTIPLE operational
databases;
* from multiple and diverse sources, possibly with diverse
formats --- you can see the possibility for standardization
headaches!
* subject-oriented: arranged and optimized to provide answers
to questions to coming from diverse functional units within a
company; DELIBERATELY lots of data redundancy, to optimize this
kind of question-answering;
* time-variant: DSS data include a time element --- not true
(in same sense) for operational data; Warehouse data reflect what
happened last week, last month, last 5 years, etc.
* also: once data are periodically uploaded to the DW,
all time-dependent aggregations are recomputed;
* non-volatile: one data enter the DW, they are never (?!)
removed!
* (KROENKE) Common components of a data warehouse:
* data extraction tools
* extracted data
* metadata of warehouse contents
* warehouse DBMS(s) and OLAP servers
* warehouse data management tools
* data delivery programs
* end-user analysis tools
* user training courses and materials
* warehouse consultants (?!)
* (Kroenke, 7th) Categories of Requirements for a Data Warehouse
* Queries and Reports with Variable Structure
* User-specified data aggregation
* User-specified drill-down
* Graphical outputs
* Integration with domain-specific programs
* (KROENKE) Challenges for data warehouses:
* inconsistent data
* tool integration (integration of various tools that the users need)
* these tools often use different paradigms ---
* DBMS products: table oriented
* OLAP products: cube oriented
* spreadsheets: spreadsheet oriented
* financial planning products: plan oriented
* etc.
* missing warehouse data management tools (that is, a LACK of 'em?)
* ad hoc nature of requirements
* Data Mart: a small, SINGLE-SUBJECT Data Warehouse SUBSET that
provides decision support to a small group of people; lower cost,
shorter implementation time;
* let a company "try out" the DW idea!
* Twelve Rules that Define a Data Warehouse
(what *is* it with these "12 rule" pronouncements, anyway?! 8-) )
from: "The Twelve Rules of Data Warehouse for a Client/Server World",
W. Inmon, C. Kelley, Data Management Review, Vol. 4, No. 5, May 1994.
1. The Data Warehouse and operational environments are separated.
2. The Data Warehouse data are integrated.
3. The Data Warehouse contains historical data over a long time horizon.
4. The Data Warehouse data are snapshot data captured at a given
point in time.
5. The Data Warehouse data are subject-oriented.
6. The Data Warehouse data are mainly read-only with periodic batch
updates from operational data. No online updates are allowed.
7. The Data Warehouse development life cycle differs from classical
systems development. The Data Warehouse development is data-driven;
the classical approach is process-driven.
8. The Data Warehouse contains data with several levels of detail:
current detail data, old detail data, lightly summarized, and highly
summarized data.
9. The Data Warehouse environment is characterized by read-only
transactions to very large data sets. The operational environment is
characterized by numerous update transactions to a few data entities
at a time.
10. The Data Warehouse environment has a system that traces data
sources, transformations, and storage.
11. The Data Warehouse's metadata are a critical component of this
environment. The metadata identify and define all data elements. The
metadata provide the source, transformation, integration, storage,
usage, relationships, and history of each data element.
12. The Data Warehouse contains a charge-back mechanism for resource
usage that enforces optimal use of the data by end-users.
* SO --- how does OLAP relate to this?
* OLAP: On-Line Analytical Processing;
* these are TOOLS --- within a data analysis environment that supports
decision making, business modeling, and operations research
activities;
* OLAP may provide one (of several?) FRONT END to a DATA WAREHOUSE!!!
* may also use their own engine --- or be atop a DBMS?
ROLAP/MOLAP/HOLAP mess right now!)
* (Kroenke, 7th) "With OLAP, data is viewed in the
frame of a table, or with three axes, in that of a cube.
OLAP makes no limit on the number of axes, hence you
will sometimes here the term OLAP hypercube."
* (Kroenke, 7th) "OLAP terminology is evolving and is
currently ambiguous in an important way. The term cube
is used both to describe semantic structure and also
to describe materializations of that underlying structure."
* 4 common main characteristics, of OLAP systems:
1. use multidimensional data analysis techniques
2. provide advanced DB support
3. provide easy-to-use user interfaces
4. support client/server architecture
* IF TIME: (no, it doesn't really fit, but if we have time!)
Chapter 14: Database Administration
* the thing is, even with my lack of real-world database experience, I'm
pretty sure that what people think of as a DBA doesn't always
encompass everything in this chapter --- and if someone wants to hire
you as a DBA, you should be aware of this, and should probably attempt
to find out what THEY think a DBA is
* In my limited experience, if you're the DBA, you're the person people
call with questions/problems/etc. regarding the database;
* HSU's Oracle DBA, Peter Johnson, gave me an initial redwood Oracle
account, and accounts for those who'd registered for the course;
* he gave me the level of permissions required to add new
users (for students who'd newly added the class);
* all student users have a level of permission that allows
them to, for example, set their own Oracle password (but student users
cannot add other users);
* when we run into Oracle-related problems, or questions we
can't figure out an answer to from the documentation, he might very
well be the person we'd call;
* According to the text, pp. 767-768, the DBA is often expected to
define, communicate, and enforce procedures that cover areas such as:
* (I'm a little uncomfortable with some of these --- some
DBA's are managers over many others, and some are more
technically-focused and merely perform technical database maintenance)
* these latter are more likely to perform day-to-day,
"operational" database management tasks, like making sure backups are
done, creating new forms and views upon request, adding and removing
and modifying levels of user access, etc;
* these latter are less likely to be setting policy, telling
everyone what database methodology is to be used, etc.;
* my impression, based on admittedly very little real-world
experience!!!, is that the role of a "DBA" varies wildly)
* (perhaps) End user database requirements-gathering
* what documentation is required?
* what forms must be used?
* (but, some DBA's will simply be hired to maintain an
existing database; you see, they will not necessarily be involved in
design, then;)
* (perhaps) Database design and modeling
* what database design methodology is to be used
(normalization? object-oriented? etc.)
* (again, if they are involved in the database's initial
design and development at all)
* (perhaps) Documentation and naming conventions
* the DBA is often in charge of the overall data dictionary
for each database, and often(?) does set and/or enforce naming
conventions;
* must be used in the definition of all data elements, sets,
and programs that access the database;
* (may be more enforcing existing standards than creating them;)
* design, coding, and testing of all database application programs
* DBA must define the standards for application program
coding, documentation, and testing.
* the DBA standards and procedures are given to the
applications programmers, and the DBA must enforce such standards.
* (they'll at least be working with the applications
programmers, in that:
* those programmers will come to the DBA when they are
having trouble with the database,
* when they have corrupted the DB and need for it to
be reloaded 8-) ,
* when they need changes in their access permissions,
* when the application is ready to be launched, etc.)
* (perhaps) Database software selection
* (they may also simply be hired to manage the existing
software --- these decisions may be made already, or at other levels!)
* database security and integrity
* (I think these are probably key tasks for any DBA!!!
* DBA will surely ENFORCE the policies governing security and
integrity; text says they may also DEFINE them, but again *I think*
that will vary widely;
* database security is especially crucial; security standards
should be clearly defined and strictly enforced.
* security procedures should be able to handle many security
scenarios, to ensure a minimum of security problems;
* (how strict will one be about this? how dire will the
circumstances be from a breach?!
* low? Don't worry about so much;
* high? You might not even have the db server
accessible outside the company's INTRAnet!)
* database backup and recovery
* (another biggie, I think, for just about anyone calling
themselves a DBA!)
* these procedures had better exist, whoever defines them! And
it may very well be the DBA, even in a more operationally-defined DBA;
* database backup and recovery procedures must include the
information necessary to guarantee proper execution and management of
the backups;
* TRUE STORY: There was someone hired away from one
institution, to be the DBA at another; let's say this was more of a
managerial type than a technical type; Peter Principle may have been
at work here ---
* this fellow did not perform proper backups of a
pharmacology database (this was for a research
hospital) --- and there was a database failure;
* we're talking data lost -- including patient data,
that hospitals HATE to lose!
* this person was FIRED (and not slowly, either!!)
* moral: if you're a DBA (or heck, in charge of ANY computer
software or data!), YOU'D BETTER MAKE SURE YOU BACK IT UP
APPROPRIATELY!
* database maintenance and operation
* again, bread-and-butter for just about any DBA, I think;
* (whether operator job logs are required, or not)
* end user training?
* I think that may vary widely --- although it may happen
informally, when a new person is hired and they ask for their db
password (and when they get it, ask: "uh --- what do I do now?")
* To take some of the "biggie" areas above a little further
* data security, privacy, and integrity
* dba needs to make use of the security and integrity mechanisms
provided by the DBMS!
* AUTHORIZATION MANAGEMENT thus may be an important DBA task ---
* User access management:
* define each user to the database (or, to each
appropriate database); at OS level, and DBMS level
* assign passwords to each user
* define user groups
* classifying users according to common access needs
into user groups may help the DBA to more reasonably
control and manage access privileges;
* assign access privileges to specific users, to access
specified databases;
* describes type of authorized access --- read-only?
some combination of READ, WRITE, DELETE?
* (in SQL? have GRANT and REVOKE commands to do this;)
* how fine can the restrictions be? on a table-by-table
basis, at least; although Peter Johnson was telling
310 how, if you want to limit a user to certain
columns, you need to make a view with those columns
and limit them to access of that view...! and, can
limit to read access, or allow update access...
* physical access control
* secured entrances, password-protected workstations,
electronic personnel badges, closed-circuit video,
voice recognition, even!
* at Shell, the db was available on the INTRAnet only
(not on a network connected to the Internet) --- and,
you wore badges, and had to sign-in with a security
guard to get into the building
* view definition
* DBA may define, to protect and control the scope of
the data accessible to an authorized user;
* CREATE VIEW can be used for this;
* (accounts payable needs different information than
manufacturing than sales, etc.)
* DBMS *utilities* access control
* may place limitations on DBMS' query and reporting
tools! Only for authorized personnel;
* (others can only see specialized forms and views, or
applications built on top of DBMS)
* DBMS usage monitoring
* quite common, I think --- expected to audit the use
of the data in the database;
* some DBMS' allow the creation of an audit log, which
automatically records a brief description of the
database operations performed by all users;
* (may be for all db operations, or just failed
operations, etc.)
* they may enable DBA to pinpoint access violations
--- or see how data is being used, etc.
* a few more backup and recovery notes
* note that there are different kinds of backups --- full,
incremental, concurrent
* full: a complete copy;
* incremental: a backup for all data since the since the last
backup date;
* concurrent: take place while the users are working on the db;
* backups need to be properly kept, and identified;
* they must be stored conveniently, but still safely;
* may have insurance!
* (and may have backups at different sites! To help in case of
disasters (fire, flood, etc.)
* if the data is vital, you probably have a quite formal backup and
recovery set of plans, procedures, etc.!
* we're going to have to stop there, for now.
|