Am I the only one who has been unable to do much with this feature due
to the woefully absent documentation? Three components of "fault
diagnosability" in particular seem very interesting:
- automatic hang detection
- automatic reactive "health checks"
- incident packages as a replacement for RDA
Hang detection seems like a great idea, but there is no information on
precisely what constitutes a "hang" according to DIAG and DIA0. These
processes seem never to wake up, even in the most dire of hanging
situations. I did find that by default in single-instance databases,
the _hang_resolution, _hm_analysis_output_disk and _hm_log_incidents
parameters are set to FALSE, which I take to mean the feature is turned
off. Even turned on, long hangs involving chains of waiters visible in
hanganalyze output do not trigger any actions that I can discern. This
is slightly complicated by the fact that two components of "fault
diagnosability" share the initials HM, and packages, parameters and
views use HM interchangeably to mean "hang manager" and "heath monitor".
As for Health Checks, there is no documentation indicating what kinds of
events or incidents might result in a "reactive" health check. The
existence of reactive health checks is repeatedly asserted in the
documentation, and there is even a parameter called _diag_hm_rc_enabled
with the description "Parameter to enable/disable Diag HM Reactive
Checks". Set to FALSE by default, this parameter does nothing in the
event of a badly degraded and hanging system either. We are left to
wonder what "reactive" health checks react to!
Finally, the incident packaging service works well enough, but is
predicated completely upon the notion that any and all problems will be
associated with a fatal error of some kind. Anything that does not dump
ORA-600 or another fatal error will not result in an "incident" and thus
there is nothing to package. There is apparently no provision for
problems that do not dump on an error. So, an on-demand incident package
apparently cannot be created. Thus, despite the incident payloads
having many of the same contents as the horrid RDA of yore, you cannot
generate one on demand in a supported way. You can shoot a server
process with a SIGSEGV, but I cannot imagine that is how Oracle intends
us to get diagnostic data for opening an SR.
You can probably detect that I am frustrated but I have been playing
with this feature set for weeks and it is a frustrating morass of
nonworking undocumented wastes of server memory. Remember, we are all
now running two extra background processes, DIAG and DIA0, just for this
feature. They are up and running and using memory on all of our 11g
systems even if they do nothing and are turned off at the parameter
level by default.
I am ranting here in hopes that someone else has gotten further than I
have or knows someone on the inside who can shed some light on these
concerns.
Thanks,
Jeremiah Wilton
ORA-600 Consulting
--
http://www.freelists.org/webpage/oracle-l