Hate?
Let’s say it is a love / hate thing, but I can’t help myself. I am a script nerd and probably always will be, but even I have to “evolve” or at least adapt to what the new software cycle brings. So preparing myself for the future, I based remote administration on Oracle Enterprise Manager Grid Control. I even updated the stuff to the latest re-lease 10.2.0.3. Working with these kinds of tooling makes you lazy, at least IMHO, and most of the time, it doesn’t learn you anything. What makes me really itchy is the following…
In a customer database environment, a alert was triggered: a tablespace just went over the 85% full warning. So I logged in on OEM and looked into the tablespace that cause the alert. I came to the conclusion that I had to add a datafile to the tablespace, let’s call it, APP_DATA. Datafiles from tablespace APP_DATA were all defined locally managed with autoextend on until a maxsize of 4096 Mb. So I clicked on the database webpage, drilled down to the tablespace section, clicked for details for the APP_DATA tablespace and saw on the APP_DATA tablespace overview webpage all datafiles that belonged to this tablespace.
Somewhere at the bottom on this page you have the possibility to click on the “add datafile” option. This was what I wanted to do, so I clicked on the option “add datafile” and then on the button “GO”. Great. Here was the page that I needed to fill in some small missing details like, next extend size, datafile name and some other small stuff. It recognized that it had to add a datafile that should belong to the APP_DATA tablespace. Great! Filled the missing data in, checked the syntax via “show SQL” and applied it to the database.
The threshold was now below 85%, so the alert / warning disappeared from my overall dashboard. Great stuff in all. Automatic alert; simple login into OEM; Enter some stuff into a “solution” window; problem solved!. Childsplay, no DBA knowledge needed (as advertised). After this I logged my actions in our system and mailed my actions to my direct customer counterpart to tell him that I had altered the system pro-actively as promised.
The problem here is that I looked but I did not really LOOK! I didn’t not see what was presented in front of me…
Then I got a mail from my customer counterpart, that more or less said that he thanked me for the actions, but if I could put the datafiles on Windows volume G in the future, instead of Windows volume F as I had done now…
I check with my, beloved, SQL*Plus this time and, yep, he was correct. The datafile had been created on F:\oracle\oradata\prod instead of G:\oracle\oradata\prod directory. I was really angry with myself, that I had missed this fact. If I had only done via my standard SQL*Plus command line tool. Then at least I had picked this up, because I have had checked where all the datafiles of tablespace APP_DATA resided…
By the way, before some comes with remarks like: “Ehh, do you know about SAME, etc”; Yeah I do know about it, but this is the way this environment once has been set-up, so I am currently not in the position to alter the way things were laid out.
So where did it go wrong, where did my actions fail me?
Apparently, the web page where you fill in the name and the size, etc, of the datafile to be added, automatically suggests for you the default directory. Despite the fact that all datafiles of tablespace APP_DATA reside on Volume G, it still suggests Volume F because that is the volume where the first database datafiles where created… While I double checked the sizes and the syntax of the add datafile statement, I overlooked the F instead of G character. These kinds of small slip-ups can really mess up your physical storage layout. Regardless of being disappointed about the tool, let’s be honest, it’s not that hard to present the G volume as the suggested way to go when 100% of all datafiles of this tablespaces reside on Volume G, I was really annoyed about myself that I hadn’t seen it…
So at the end working hours at the customer site, I prepared myself to rectify my typo.
This time I opened a remote terminal service window and logged in; started a SQL*Plus session and did the following (according to the 9.2 manuals for this 9.2.0.7 version EE database):
– safeguarded that I had a proper backup (database is in archive log mode btw)
– alter database backup controlfile to trace
– double checked for locking and still active user sessions
– put the tablespace APP_DATA offline
– copied the datafile from the F volume to the G volume in the correct directory
– alter tablespace APP_DATA RENAME DATAFILE ‘F:\ORACLE\ORADATA\PROD\APP_DATA.DBF’ to ‘G:\ORACLE\ORADATA\PROD\APP_DATA.DBF’
– and brought the tablespace back online…
…Everything went OK and the whole (prepared in advance) actions had taken about approx. 2 minits.
Then my Oracle Enterprise Manager started to alert. A critical warning was detected: “Number of missing datafiles is 10”. What the… I double checked every data dictionary view I knew of regarding offline tablespaces, needed recovery markers etc. I just could not find any. Every thing reported sound and working. Also no trace was logged, nor in the alert file or in a trace file. So I checked in the Oracle Enterprise Manager if I could deduct what was going on.
The tablespaces web page overview gave me a java error. No content available to look at. The datafile overview web page for this tablespace looked OK. Nothing to be seen here. I commented the alert message with data and initials, so that I marked it as “seen”. After some chilly moment, apparently that is what I am thinking; the Oracle agent at the customer site refreshed the status with the latest data and cleared the critical warning. The critical warning disappeared and the tablespace overview web page worked again as before.
What worries me now the most is…it should have triggered an SMS message.
Every critical alert has been configured to send an email that triggers an SMS message to my mobile. It didn’t happen. It just didn’t happen. Also what is (and this happens more) annoying, is that I can’t find the “critical message” anymore. Something I should be able to trust, is the alerting mechanism and despite everything is set to trace everything, alert everything (especially “critical warnings”) nothing happened. I still have to investigate why this happened and otherwise report it to Oracle support, but for now it leaves me with a unsteady feeling.
I didn’t mind that it didn’t comply with my whish list regarding performance tuning and general set-up rules etc, but at least it should monitor things and above all alert critical issues. All in all it now gives me a shaky feeling, and if I can’t find the problem that caused this behavior, it is maybe time to re-think my remote administration architecture.
Still use SQL scripts — write a script which lists all datafiles, verify the output, add an “ALTER TABLESPACE ADD DATAFILE ..” to the script and run it again. Every database has a scripts directory local to the server. No GUI, no Click-and-Shoot-Yourself-In-The-Foot.
However, the more dangerous aspect is that the second alert was
a) false
b) did not send an SMS
It shouldn’t have triggered. And if it did have to trigger it should have sent an SMS. (Of course, if it is a false alert, you don’t want the SMS to go to your Manager — or as I have setup alert scripts — to the Apps team as well).
I totally agree Hermant, but I am trying to setup a remote administration structure based on OEM with all the mandatory features, like simple, secure, generic, good SLA reporting, etc.
Despite that Oracle wants us to use OEM (and will drive us down that future lane), forced by my own secure environment, only OEM is allowed via secure ports and VPN tunnels to the customer database site.
Currently it is only used by me for monitoring, alerting and reporting. I draw my performance conclusions based on tooling like statspack and common sense.
A few years ago I fell out with an earlier version of EM when we had a network issue that prevented it starting a cold backup.
When the network engineers fixed the network problem in the morning (one between the two servers that the customers hadn’t seen btw), the database promptly shut down for a cold backup.
Since then I prefer to do backups as cron jobs, however un-trendy that makes me!