Relational Database Tutorial and Tips For Beginners
Intro To Data Modeling
- By Alex Glaros
Welcome to my online intro training class on how to design files for a relational database system. You can learn how to design files to accurately reflect almost any business problem. This is called modeling the data.
Here's the dilemma: Real world business problems can't be represented with simple lists of information.
Relational databases can accurately model all kinds of problems. But, using them is not intuitive, at least not at first!
I find it somehow comforting that the relational database idea is not some passing-fad theory but actually based on relational algebra. Its based on solid math concepts that make your information work for you in the most efficient manner possible. Relational databases will always be with us! But don't worry, there won't be any complicated math!
I've taught many people how to design data files and discovered that there's two ways people learn. By: (1) using rules or (2) intuition. Beginners with math blocks can learn intuitively just as fast or faster than skilled computer programmers!
Most novices approach creating data files as though they are typing a list on a piece of paper. For some situations this works. Spread sheets for simple problems are good for this.
Here's an example where a simple file is sufficient: you want an inventory of all the valuables in your house. It could look something like this:
1999 Toyota Corolla $15,000
Gibson Guitar $600
RCA TV - serial #1234 $300
This is good and valid.
Let's look at another situation. You are hired to manage the records for a new library. For convenience of illustration, let's say that there's only a few patrons, each with a unique name and a few books. So, you get your trusty spread sheet out and set up a single file that looks like this:
Patron Name Patron Address Book Title Book Author Date Book Checked Out
Andrew 80 Pleasant St., Santa Cruz, CA Tennis! Scott Brook 3/3/2001
Andrew 80 Pleasant St., Santa Cruz, CA Soccer! Tom Brown 3/3/2001
Martina 7 D St, Chicago, IL American! Susan Thomas 2/2/2001
It works! Now the library can keep track of who checked out what books and when! But look a little closer. Andrew checked out two books, Tennis! and Soccer! Notice that you had to type in Andrew's address twice. Bummer! Now if Andrew checks out 20 books, you'll have to type in his whole address 20 times! Redundant! Ugh!
What's just as bad is what if Andrew changes his address and moves across the street from where he used to live. Then you have to update the addresses on all 20 records!
More bad news, every time you type in a book, you have to type in the author's name.
Somehow, our beautiful single file system has gotten out of control. There must be a better way.
Enter the relational database. It can handle this messy situation cleanly!
First, the relational concept says break up the single file into three files. Each one will be dedicated to serving one of the three types of information that that the single file was trying to handle.
One type of info was keeping track of who the patrons are, so a whole file is dedicated only to patron info. Looks like this:
Patron Name Patron Address
Andrew 80 Pleasant St., Santa Cruz, CA
Martina 7 D St, Chicago, IL
The other general type of info was the book info. It looks like this:
Book Title Book Author
Tennis! Scott Brook
Soccer! Tom Brown
American! Susan Thomas
Finally, we have a cool concept that handles the fact that a patron checked out a book. We create a third file that contains just enough info to identify a patron, the book that was checked out, and when. This clever file looks like this:
Patron Name Book Title Date Book Checked Out
Andrew Tennis! 3/3/2001
Andrew Soccer! 3/3/2001
Martina American! 2/2/2001
Its called an intersection file because its kind of an intersection between the patron file and the book file
Notice that you've only had to type in Andrew's address once (in the patron file), and now that he's ready to check out books, you use the intersection file, where you only have to type in his name, the book title and when it was checked out. You don't have to type in Andrew's address over and over again! Hooray!
Hmmm...interesting you say, but so far all I've done was create 3 files. What kind of weirdo system is this? The information is more scattered around than before!
Don't worry! There's software that does all the rest for you! Software like dBase, Oracle, Microsoft Access, FoxPro, etc. The software ties together all of the information from all three files and displays it so that its logically connected. Here's what it does: first, it matches all the patron names from the intersection file with the patron names from the patron file and brings the patron and intersection info together.
So far, the software is starting to collect the data from the patron and intersection files like this:
Patron Name from intersection file : Andrew
Patron Address from patron file : 80 Pleasant St., Santa Cruz, CA
Book Title from intersection file : Tennis!
Checkout Date from intersection file : 3/3/2001
The software saw that "Andrew" was in the intersection file and that "Andrew" was in the patron file, so it took all the matching data from both files relating to "Andrew" and brought it together as shown in the table above. But its not done yet.
Then the software gets the book names from the intersection file and matches that with the book names from the book file and adds the author info from there to the information already collected in the table above. Then, the info looks like this:
(Below is a computer screen display of what you would see)
Patron Name from intersection file Patron Address from patron file Book Title from intersection file Book Author from book file Checkout Date from intersection file
Andrew 80 Pleasant St., Santa Cruz, CA Tennis! Scott Brook 3/3/2001
The software saw that "Tennis!" was in the intersection file and that "Tennis!" was in the book file, so it took all the data from both files relating to "Tennis" and brought it together.
So now, the software has done all the work matching the info from the three files and it displays the results for you in a logical manner. It then repeats all of the above for Andrew's other checked out book and Martina's checked out book.
Okay, you say, but, isn't it a hassle to keep 3 files open like that at the same time to enter the data?
The relational database software comes to the rescue again by making data entry easy. The software knows that you're working on more than one file at a time and that the files have related information, for example, it knows that each patron name in the intersection file, will have a match in the patron file.
Therefore, the software sets up your data entry so that you only have to type in the information once and it automatically copies the info that you typed to the related files. But because the software knows what you're up to, it can do some other very helpful things.
For example, let's say you mistyped Andrew's name (maybe you typed "Andy" by mistake) while adding a record to the intersection file. The software would see that there's no "Andy" in the patron file and give you a warning. This greatly cuts down on errors.
Same thing if you were typing in a book a patron was checking out and you misspelled the book name. The software would tell you "That book doesn't exist in the book file!"
Remember that Andrew's (or any patron's) address only appears once, and that's in the patron file. So if Andrew moves, his address information only has to be updated one time in one place, and no matter how many books he checks out, his address will always be current.
How many files do I need to create?
Actually, each situation requires a different number of files. Sometimes one file, sometimes 300 or more. Here's an oversimplified way to design your files.
First, gather all information regarding your project. Let's say, you're creating a computer system to run a college. Get the college info (name of college, building locations, class room locations, etc., address, phone numbers, etc.), all info kept regarding students (not the actual data for every student, but the types of info in general, for example, take note that the college maintains this info: student name, student address, student date of birth, etc.), professor information, student course info (subject taken, semester taken, professor name, etc.).
Collect a sample of each type of info, for example, student report card, professor schedule, any and all documents, existing computer files, etc.
Okay, now you have giant pile of every piece of info regarding the college. Next step, is figuring out what the entities are. An oversimplified way of describing what entities are is to say that they can be a person, place or thing. Here's what some of the entities would be for our college example. The college, the students, the professors, the buildings, student grades, courses, course schedule, etc. Separate out all of the entities.
Next step, find all of the properties of the entities and list them under the entities. For example, student address is a property of student entity. Here's more examples:
student social security number
student date of birth
course unique id
department course is taught in
professor social security number
professor home address
professor office location
professor phone number
So we started out with this giant pile of information, and now we're picking up each piece from the big pile and dropping them in little entity piles where they belong, until the big pile is gone. How about the course schedule? What properties would that have? Well, we can see that the course schedule entity, is an intersection file like in the library example.
Here's what the course schedule would look like:
course unique id (matches id from the course file)
semester (from the semester file)
classroom location (matches id from the classroom file)
class time (E.g., Mondays at 10 am)
professor social security number (matches SSN from the professor file)
Notice, like in the library example, that you don't have to redundantly type in lots of information in the course schedule file to uniquely identify a specific class. For example, you don't have to type in the course description over and over again, because the software pulls that from the course file! The course unique id is enough to let you know exactly which course it is and the software connects the two files when you need to see all of the info! Also, you don't have to type in the professor's name over and over again, as that is pulled from the professor file. Just the professors' social security numbers are all that are required in the course schedule file.
You may need many intersection files. For example, the grades file would be another intersection file.
student social security number
course unique id
grade (e.g., A, B, C, D, F)
For the grades file, notice that you don't have to type in the student's name or address over and over again. Same with the the course; just unique course id is enough so you don't have to type in the course description over and over again thousands of times for each student taking it! You only need to type in the course description once, in the course file. The software will pull the course description in and merge it with the grades and display it on a computer screen or printout when someone wants to see it.
Looking at data modeling like this, its possible to develop an intuitive feel for setting up your files with the goal of removing unnecessary redundancy. Anybody can get this! Just remember that one of the properties for each file must uniquely identify each instance of the entity. For example, the professor file absolutely must contain some property that uniquely identifies each professor, e.g., social security number. Same with the building file, it absolutely must contain a property that uniquely identifies it from all other buildings in the college, e.g., building name. This is called putting the data in third normal form, which you can study more formally in the books on the side bars.
. Each instance of the entity is called a "record". For example, in the professor file, John Doe and his properties make up a record, Susan Smith and her properties constitute another record.
. Each property is called a "field". E.g., name field, address field, date of birth field, social security number field, etc.
The property that uniquely identifies each instance is called the "key". . For example, the key for the professor records is professor social security number.
. Designing data files is called "data modeling".
. Different software programs sometimes use a different naming system.
The software connects the various files by matching keys or other fields. E.g., it will match professor social security number from the course schedule file with the social security number for the professor in the professor file. For each use that you want to make of the data, you find the files that contain your information and merge it onto your computer screen or paper printout. This merge is called a "join". The join doesn't create permanent new files, it just temporarily matches the information from the various files that you're interested in and displays them on your computer screen or paper printout.
Here's an example: the dean wants a printed report of every classroom in the college with its street address. You don't need the student or professor files for this. Here's two files that you would need:
There is a right way and a wrong way to design files. Notice that you do not want to have a "street address" field in the classroom file because that can already be derived from the building file. If you put the street address field in the classroom file, you would be unnecessarily repeating the street address over and over again. You know which building the classroom is in, and the building file has the street address in it. Are you feeling your intuition working yet? You will. Once you start seeing actual examples of your experiments, let's say the street address is repeated hundreds of times in the class room file, then you kind of start seeing that the street address really belongs in the building file. When designing your files, put some actual data in them and look for high repetitions of the same information. See if your intuition is telling you that maybe a field in one file doesn't belong there but really belongs in another file.
Continuing on with the task that the dean has assigned you, you would "join" both files, matching them by the building name, which is a property of both files. Building name is the key to the records of the building file. The result would look like this:
NAME STREET CLASSROOM
Art Building 100 Van Ness Ave., S.F. Room 300
Science Building 234 Van Ness Ave., S.F. Room 120
But the above would not be another file. Its just derived from the other files and displayed on the computer screen or paper printout at the time the viewer wants to see it.
Don't create unnecessary files (for example, a file that looks like the one above) if the information can be derived from the already existing files.
Sometimes the key of a record is made up of more than one field (property). For example, in the course schedule file above, you can't uniquely identify the schedule for a specific course without including course unique id, semester, classroom location, class time. Think about it, the exact same class could be given 3 or 4 times in the same classroom, so you absolutely need to include the time that the class meets to distinguish it from the same course being taught in the same room but at another time.
When designing your data files, notice that they do not look like the paper documents in the college! For example, the course schedule has a lot of info trimmed off compared to the paper printout version because it can derive what it doesn't have from the related course, building and professor files. This confuses a lot of beginners but, you get used to designing the data files like this. The intersection files have just the bare minimum required to uniquely identify the entities that they will be pulling the rest of the info from.
When designing files, get rid of unnecessary redundancy! Some redundancy is necessary though, for example in the course schedule file, the semester that the courses are taught in will have to be repeated over and over again, to make sure that students and professors know which semester each class will be held in. We're talking about getting rid of unnecessary redundancy, not necessary redundancy.
A single file is called a "flat" file, but the collection of related files is called a "database".
When creating reports from your college database, you may want to know how to retrieve only certain records that you are searching for. To solve this problem, software languages have been invented to manipulate and retrieve data selectively. SQL (Structured Query Language) is one such language. In essence, using a software language, you can frame questions to your system of files that you have designed, like this: "Please match all of the course, and course schedule records, and all of the professor records using the "professor social security number" and "unique course id" to match them with, then list only those courses in the sociology department this semester."
By finding a unique key for each file, then adding its attributes (fields) and removing redundancy you've designed the files so that they can be grabbed or "joined" by the software in the most flexible and most mathematically efficient manner possible! The fields that match in two or more files are like bridges that connect the files.
Now all kinds of reports can be generated on the fly by just joining the files and printing them out. Even unusual requests for reports can be easily generated on the fly, for example, lets say the dean wants a printed out report of every student that has had a specific professor. The way the files are designed, this is a snap. You would use your computer language, like SQL, to select only the professor that you are looking for. Then take that record from the professor file and join that with the course schedule file using the professor social security number as the matching field to build a bridge between the two files. Then added to that, you would join the student file using the student social security number as the matching field to build a bridge to the already joined files.
In general, you first model your data, design your files so that each entity has a unique key and then add the rest of the fields that are attributes of that entity. Then the rest is easy, just create desired reports from your data by joining your files. This means matching the keys from relevant files with the matching fields in the related files. You'll be amazed how very complicated problems can be easily solved using relational databases.
For more detailed studies, check out the books in the side bar or click here to get all books related to data modeling at Amazon.com data_modeling(http://www.amazon.com/exec/obidos/external-search?tag=pottytraining&keyword=data%20modeling&mode=blended)
Got suggestions on how to clarify or improve this? Send me an e-mail! AlexG@surfermall.com