Friday, January 30, 2009

Goodbye ORA-600, hello, Blue Gecko

Things are changing a little around here. I have been running ORA-600 Consulting for five years, and have made the big decision to join Blue Gecko, a remote administration and managed hosting provider here in Seattle. My new colleagues look familiar. They are the same people that I worked with in the first several years of building Amazon.com's IT infrastructure in the late 90s.

This was a big decision, but my direction increasingly has been toward teaching and emergency support. These happen to be two areas Blue Gecko is excited to build on. So I will continue my focus in these two areas with the support of a great company. Best of all, I will no longer have to be in charge of billing, accounting, sales and toilet cleaning duties for my own company. This should leave me more time to work on the Oracle topics I care about more.

I'm hoping my clients will all come over to Blue Gecko with me. They'll frankly get much better support from the 24x7 support staff and mature monitoring and administration infrastructure Blue Gecko provides. They can still ask for me if they want!

Friday, January 2, 2009

Log transport annoyance with ORA-01017 and ORA-16191

In 11g, setting up log transport for DataGuard, Streams, CDC, etc., you are likely to run into this annoyance. No matter how carefully you make the SYS passwords in the passwordfile the same byte-for-byte on both the source and destination hosts, the source archiver will fail trying to log in to the destination DB with these messages written to the alert log:

ORA-01017: invalid username/password; logon denied
ORA-16191: Primary log shipping client not logged on standby

The solution is to create the passwordfiles with orapwd using the argument 'ignorecase=y':

% orapwd file=orapwORCL password=crash_on_install ignorecase=y

This issue has been documented by several bloggers and forum participants, but there is still no note in Metalink as of today. I'm not sure what part of the codepath contains this bug or if perhaps this is one of those that would be addressed with the "not a bug" response from dev.

You will probably have found this article after following the instructions in the error documentation:

Check that primary and standby are using password files and that both primary and standby have the same SYS password. Restart primary and/or standby after ensuring that password file is accessible and REMOTE_LOGIN_PASSWORDFILE initialization parameter is set to SHARED or EXCLUSIVE.

Restart the primary. I love it :-) Jeez.

All I can say is I am sorry you went through this.

Tuesday, December 9, 2008

What can I kill?

Fairly often, I have to kill Oracle background processes. At first this probably seems crazy. Do I want to destroy my database?

No, it is SOP. Archivers (ARCn) are my favorite targets. You can solve all kinds of problems by killing the archivers and letting them be automatically restarted. Often issues with redo transport services (Streams, CDC, Logical Data Guard) can be solved by forcing the archivers to restart.

Another time you might want to kill a background process is if it is leaking memory.

Unfortunately, there are many processes in Oracle that cannot be killed. Off the top of your head you might mention PMON, SMON and LGWR. Is there anyplace in the documentation that breaks down which you can kill without terminating the instance? No. So I experimented on 11g with these results:

Killable:
ARCn: Redo log archivers
CJQn: Job scheduler coordinator
Dnnn: Dispatchers
DIA0: Diagnosibility process 0
DIAG: Diagnosibility coordinator
FDBA: Flashback data archiver process
Jnnn: Job scheduler processes
MMNL: Manageability Monitor Process 2
MMON: Manageability Monitor Process
PING: Interconnect latency measurement
Qnnn: Queue cleanup processes
QMNC: Queue coordinator
RECO: Distributed recovery process
Snnn: Shared servers
SMCO: Space management coordinator
Wnnn: Space management processes

Instance-Critical:
ACMS: Atomic controlfile to memory server
CKPT: Checkpoint
DBRM: Resource manager process
DBWn: Database writer processes
LGWR: Redo log writer
LMDn: Global enqueue service daemons
LMON: Global enqueue service monitor
MMAN: Memory manager
PMON: Process monitor
PSPn: Process spawners
RMSn: RAC management server
RVWR: Recovery writer
SMON: System monitor process
VKTM: Virtual keeper of time process

Monday, December 8, 2008

Keeping CDC from falling apart catastrophically

CDC, especially the log-based implementations such as "asynchronous autolog" can be a really nice effective way to maintain a DSS system with low impact on a production OLTP data source.

One problem with CDC is that it uses Streams, specifically the logminer capture and apply components, which can be challenging to make work successfully. A constant danger is the CDC staging database encounterin a problem that makes it unable to continue processing changes. In development, we find ourselves rebuilding the change source and sets frequently to overcome bugs, mistakes and mysterious failures.

In production, however, rebuilding the CDC environment requires rebuilding the entire DSS reporting system or data warehouse. Each change from the production source must be captured by CDC or there will be data loss. Rebuilding CDC implies losing transactions.

As it turns out, everything hinges on the availability fo a recent logminer data dictionary build in the redo logs. With CDC, you can reinstantiate the CDC environment as of any point in time, as long as you have a source archivelog with a dictionary build, and all subsequent logs to the current pont in time. Thus my #1 recommendation for people using async autolog CDC:

Generate a data dictionary into the source redologs daily using a script similar to this:



SQL> variable dicscn number
SQL> exec dbms_capture_adm.build(:dicscn)
SQL> column discn format 9999999999999999999999
SQL> print dicscn

(Record this SCN)


If you ever need to reinstantiate the CDC environment, re-create the change source as of the SCN of the last available dictionary build:


SQL> exec dbms_cdc_publish.create_autolog_change_source( -
change_source_name=>'MYSOURCE', -
description=>'My OLTP Source', -
source_database=>'MYDB', -
first_scn=>'', -
online_log=>'y')


Then ceate a change set that begins processing changes as of the time that the previous failed CDC staging system failed:


SQL> exec dbms_cdc_publish.create_change_set( -
change_set_name=>'MYSET', -
description=>'My Change Set', -
change_source_name=>'MYSOURCE', -
begin_date=>to_date('2008-12-09 18:00:00','yyyy-mm-dd hh24:mi:ss'), -
stop_on_ddl=>'y')


Finally create your change tables and subscriptions as usual. The changes in the change tables will beg as of the date specified in create_change_set.

Friday, November 21, 2008

Working for unlicensed sites

I wonder how other independent professionals feel about helping companies that are running systems on Oracle without purchasing sufficient (or any) licenses from Oracle. Where do you draw the line? A few examples can be illuminating:

One client of mine simply wanted to complete the RFP requirements to write a proposal to sell servers to a major bank. They only needed to install Oracle on a single host, run a simple load test against it for 24 hours, and submit the results with the RFP. I did not ask, but Oracle was clearly unlicensed at this site.

Another client is running a major order processing and fulfillment system, and "are not sure" if they are licensed. They definitely have no support. My sense is that when you fork over your $50K+tax per proc, you will notice the cash missing from your budget. Do you think I should push for proof of license compliance or is it none of my business?

If your answer is the latter, are there any ethics to apply to these scenarios? How far do you go? Would you download EE onto their server yourself and call it their problem if they fail to license it?

What about a site that queries v$active_session_history without paying for the tuning pack?

I routinely help those interested in complying to do so, and help in negotiations with Oracle, but what about these egregiously non-compliant sites?

Monday, November 17, 2008

How to set your server process spinning

For one of my presentations I searched for some time for a way to make my own server process get into a spinning and hung state. I found the best example in this SQL, which spins on 11.1.0.6 and can be run by any user, including non-privileged. When I had a friend run it as a non-privileged user on a windows server, it made the server almost unusable, and necessitated stoping the Oracle service to get things back to normal.

SQL> select 1 from dual where regexp_like(' ','^*[ ]*a');

The statement will not return, and you may not be able to ^C out of it. The server process/thread will consume all idle CPU and the time will be spent in uninstrumented code (STATUS=WAITED SHORT TIME and SECONDS_IN_WAIT > 1).

Don't run this on any system that you can be fired for bringing down! You have been warned.

The alert log, tail -f, Flatland and Windows DBAs

I was working with Dan Morgan over at PSOUG Saturday and mentioned to him that the ADR Command Interpreter (ADRCI) in 11g provides `tail -f`-like capabilities for the alert log, including on Windows. He was impressed by this, as previously, we have had to download third-party software to do anything like `tail -f` against the alert log on Windows.

c:\> adrci
ADRCI: Release 11.1.0.6.0 - Beta on Mon Nov 17 10:29:51 2008
Copyright (c) 1982, 2007, Oracle. All rights reserved.
adrci> show alert -tail 100 -f


I imagine the ability to follow the tail of logfile to be nothing short of revelatory for some Windows-only DBAs, some of whose eyes I have seen visbly widen in amazement as they watched output appear in the cmd window real-time from the alert log. I described it to Dan as being like popping out from a two-dimensional world into a world where you suddenly realize there are three dimentions. Wow! Everything looks so colorful!