Saturday, 30 July 2016

Why you should restart AIMS/MapGuide when testing ....

I noticed by chance that a few layers - which are all based on one Oracle view - were significantly slower in our production system but no performance issue occurred in our test environment. Whilst investigating the issue I made a few mistakes which made the whole process of finding and fixing the issue more time consuming than necessary. Here are the details:

After noticing the issue I asked my colleagues if they had any idea why this is and whether anything had changed recently. None of them had an explanation so I spent some time checking / testing:

- running performance reports in MapGuide/AIMS noticing that time spent on "Layer" is 20times slower in production
- comparing view definitions but only cursorily (looking at WHERE clause only) : mistake #1
- re-creating spatial index for base table : no change
- running 1-click maintenance in Map-Administrator : no change
- copy a layer definition from production to test environment : no change
- trying to load view into other applications to see whether the issue is related to application or to view but GeoRaptor couldn't add layer to map due to rowid issue, QGIS rejected view complaining that view didn't have a proper spatial index and TatukGIS Viewer basically saying something similar (first the message about the spatial index issue made me re-re-create the spatial index but testing further views in QGIS/TatukGISViewer I got the same messages for all other views as well).
- running a few spatial SQL queries in SQLDeveloper but there wasn't any significant difference between test and production : mistake # 2 (this result caused quite some confusion as I did not have an explanation, it did not fit and even contradicted other findings!)

After spending some time on running these tests a colleague came back to me and told me, that the view had changed recently but when he applied the change he didn't notice any performance issue whilst generating graphics in Map. The change was made by adding a subquery to the SELECT clause. He asked if the change should be removed but I replied that I will do some tests now taking the new information into account : mistake # 3.

As the change to the view definition had only been applied in the production system I applied the change in the test system as well - here mistake # 4 occurred. I then ran the MapGuide/AIMS performance report again but the performance had not decreased. Performance was good as before. Then I changed the data source of the Oracle user I was testing with and pointed it to our production database. When I ran the test again performance was slow. I stopped any further testing for the day and did something else.

At this point my theory was that the same query has different execution plans in Oracle (production and test) therefore different execution times due to different Oracle settings. What didn't really fit was that when I ran the spatial queries in SQLDeveloper I didn't see any significant difference in performance. But I didn't have anything else to test so finally I decided to replace the view definition in production with the previous view definition - basically removing the change which had been applied recently. I re-ran the performance report in MapGuide/AIMS and suddenly I got an error message:

Failed to stylize layer: DP_xxx
An exception occurred in FDO component.
ORA-00904: "G"."OWNER_OBJTYP": invalid identifier
...

Now I realised one of my mistakes. MapGuide/AIMS caches certain settings/definitions. I don't know how it works in detail but probably roughly like that:
- when AIMS/MapGuide starts it reads the Oracle schema and caches some information
- (SELECT) statements are being build based on the cached information and used against Oracle

For a view MapGuide/AIMS seems to read the view definition and then creates a SELECT statement which includes all columns of the view. 
If the view definition in Oracle changes and new columns are added to the view MapGuide can still run the SELECT statement against the new view definition - basically just ignoring all newly added columns.  But when columns have been removed from the view the SELECT statement still includes those columns and Oracle will response with ora-00904.

Mistake 4 - when I applied the change to view definition in the test system I did not restart MapGuide/AIMS. MapGuide was still using the SELECT statement which did not include the newly added column in the view. But exactly that new column had a negative impact on performance.
If I had restarted MapGuide/AIMS at this point in time I could have avoided further testing and head-scratching.

Mistake 3 - if I had listened to my colleague who offered to remove the change in production I might have noticed the cause of the problem earlier (because even without restarting MapGuide/AIMS we would have seen the Ora-00904 message in the log file).

Mistake 2 - my spatial query was simplified, the select clause included only "fid, geom". As it did not include the column which caused the performance issue there is really no surprise that performance in both production and test were nearly the same. Conclusion: always use (if possible) exactly the same (SELECT) statement as the application when you do testing!

Mistake 1 - I should have noticed the difference in view definitions myself just by copying the view definitions from production and test into an editor and then noticing the different numbers of lines....

So at the end a trivial issue but it took longer then necessary to get it fixed.

No comments:

Post a Comment