True cause of the pages dissappearing from the google index

We had been trying for a month to figure out why our pages had disappeared from the google index. The best way to get an answer seems to be to ask a question rather than guessing. I finally got fed up with this issue and got contact information for someone that we had been talking to at google and asked them directly why our pages had disappeared from their index. This is what our problem was. It seems like google does quality checks on their pages to help ensure that the results that the provide are what their users want and our site failed for the reasons below.

  1. There was spam cloaking, by that they mean that they meant that the pages that we displayed to search were not the same pages we display to users. Most of our articles require a subscription but we allow search engines access to the full content so that they can index our site. When a normal user tried to access content that required a subscription we gave them a 401. The suggestion was that we give them an abstract page for the same content instead. I tend to agree with them here and think this is a much better way of handling this. We just need to add a message on the abstract page to let the user know why they were redirected.
  2. Our site quality has not been the best of late. Since we moved our site to a new location we have had a myriad of problems causing some downtime. One of the problems mentioned was that for some URLs they were getting timeouts. These URLs should have been giving 401 errors to users that do not have subscriptions but they were taking forever to do so. The cause is described here, if you are interested.
  3. Redirecting all 404s to a sitemap seems like a no no, and we were doing this. Maybe having the sitemap as the 404 instead of redirecting to the sitemap would be a better approach. That way we could send the 404 as the response code and the sitemap as the content of the page.
The issues have been resolved. Now we just have to wait and see what happens. I found the Google staff to be surprisingly helpful in resolving this issue. I am pretty sure they can't do this for all sites I still appreciate their assistance.


Posted in Uncategorized | 2 Comments

Debugging Production Peformance Issues With The Power Of The Thread Dump using jstack

Until recently I have never had to do much performance tuning. On production systems and when ever I have, I have always been able replicate the problem in the development environment. Recently I ran across a problem was happening in one production environment but not in another production environemnt. You see we have our live production site and our back up site. Our back up site used to be our production site, but we moved to the new environment a few months ago. users started to get time outs when they were trying to access pages that they were not authorised to instead of getting 401s.

I was at a loss on how to debug this issue. However, one of the other developers showed me how to use thread dumps to see what was going on the server at the time the problem was happening. Jstack comes with the Java SDK. It prints out what looks like a stack trace for all of the threads in a particular JVM. What we did is we went to  one of the offending pages and while we were waiting for the page to load we ran jstack repeatedly and sent the output to different files which we then compared to see if any of our code was causing the problem. 
Our servers are linux so we ran the following command to get our application server's process id.
ps -ef | grep java
The output was something like this:
developers 3456 1 41 07:11 pts/0 00:00:16 /usr/lib/jvm/java-6-sun- -cp /home/developers/glassfish3/glassfish/modules/glassfish.jar -XX:+UnlockDiagnosticVMOptions -XX:MaxPermSize=192m -XX:NewRatio=2 -Xmx512m -javaagent:/home/developers/glassfish3/glassfish/lib/monitor/flashlight-agent.jar -client -Dfelix.fileinstall.disableConfigSave=false
Where 3456 is the process id. We then used jstack to take a bunch of thread dumps like so:
jstack -l 3456 > ~/tdump1.txt
jstack -l 3456 > ~/tdump2.txt
We took more than 2 dumps, but we were able to see where our code was stuck. We were doing something silly. We were catching the exception caused by the 401 and emailing it. We changed the code to write to a log file instead. This problem was part of the cause of some of our pages disappearing from the google index.
Posted in Performance Tuning | 2 Comments

Samba won’t let me browse my work computers

I have been using Ubuntu at work since day one, just over a year. I have never had any problems accessing windows shares on the network before I did a clean install of Ubuntu 11.10 on the laptop they gave me. I had upgraded the desktop at work from 10.04 to 11.10 and I was still able to access the network shares. However, the fresh install of 11.10 did not let me. It kept saying password required for machineX. 

I found this bug in Ubuntu and I thought this was my problem.  The default hostname that Ubuntu chose when I was setting up was something like developer-dell-lattitune-e4260. After doing some more research I tried changing my hostname to a2nb and that fixed it. Just for kicks in rebooted the machine in windows and changed the network name. Windows only lets you enter 15 characters. I guess that is the maximum length you can use in Ubuntu if you want to access windows shares using samba. 
To change your hostname you must:
  1. edit /etc/hostname and type the new name that you want. 
  2. edit /etc/hosts and update the corresponding entry in there to match the name you put in the hostname file.
  3. reboot.

I almost forgot, one other developer at work had the same issue after she installed Ubuntu on her laptop, Changing the hostname worked for her as well.


Posted in Mysteries | Leave a comment

Dual External Monitors on a Dell E4260 in Ubuntu 11.10

I basically wanted to replace my desktop at work with the laptop that they gave me. I might get to telecommute sometimes. Anyway, the desktop at work had two nice big monitors and I wanted to use these with the laptop, through the dell docking station. This is one of those things that I think should be easy but was not for me. Mostly because I look the wrong place for the solution at first. 
The laptop has a nVidia 4200m video card with optimus. I installed ubuntu as normal, then I installed the current nVidia driver sudo apt-get install nvidia-current, rebooted the computer. went to the console and typed nvidia-settings. I got this error message “You do not appear to be using the NVIDIA X driver. Please edit your X configuration file (just run `nvidia-xconfig` as root), and restart the X server.” So like a lamb to the slaughter I did just what it said. Unfortunately when I rebooted the computer I just got a blank screen. 
I rebooted the computer in recovery mode, deleted the /etc/X11/xorg.conf file and rebooted the computer. It started normally again. I did some more research and found that the bumblebee driver is supposed to work with the optimus nVidia cards I installed it, the system started up fine, but I still was not able to detect all 3 monitors. After some more research I found here that all I needed to do was disable the optimus feature from the bios. 
So I disabled the optimus in the bios rebooted the computer, reinstalled the nVidia driver all then I was able to run nvidia-setting and see all 3 monitors, but I only configured the two external on twin-view. I might update this later after I figure out how to have just the laptop monitors as primary when the  other 2 are not connect and the one of the externals as primary when the monitors are connected. I am not sure how much effort I will put into this as I hear that the next version of Ubuntu will have more support for this kind of thing.
Posted in Simple things made difficult | 8 Comments

Cisco VPN Install On Ubuntu 11.10 64bit

So I got hired by the client that I was contracting for. I start work on Monday. They gave me a laptop today and I wanted to use that as my development machine. The computer I use at work has Ubuntu and I decided to installed Ubuntu on this as well. I don't like Unity I was thinking about going with a pre-unity version of Ubuntu but I decided against that and I am dual booting with Windows 7. The 11.10 installation went off without incident. 

When I got home I wanted to install the vpn client. I installed sun java and the jre plugin using instructions found here. Then I went to the vpn url. I logged in and I was almost immediately taking to the following screen.
I had a 32 bit ubuntu virtual machine on another computer. So I went to the manual download and mailed to myself from that computer. According to Cisco   64 bit is not supported on Ubuntu.
When I tried to tried to run the 32 bit installer I got the following.
Installing Cisco AnyConnect VPN Client …
Removing previous installation…
Extracting installation files to /tmp/vpn.xtpx8H/vpninst469538298.tgz…
Unarchiving installation files to /tmp/vpn.xtpx8H…
Starting the VPN agent…
/opt/cisco/vpn/bin/vpnagentd: error while loading shared libraries: wrong ELF class: ELFCLASS64
After a bit of searching I found the solution here. I needed to get the 32 bit dependencies.
  1. wget
  2. sudo dpkg -i getlibs-all.deb
  3. getlibs
  4. sudo ./
This time the installation ran without error. As always I hope this saves the next person some time.
Posted in Simple things made difficult | 6 Comments

Outputting the contents of a String variable in Freemarker

Today I ran across one of those things that should be simple, but is not. I call a remote service to get product information and I display that information in the store through a Freemarker Template. Normally this is not an issue. I normally just ${x} where x is my variable and it is all good. However I ran across and instance where the product title had some HTML code in it. Yes, I know that the title should not have HTML code in it, but that is beside the point. When Freemarker finds HTML characters like & in a String variable, it decides to clean it up by escaping the HTML code. For example it would replace & with & This cause my page to display badly because all of those escaped HTML characters made one long string with no whitespace that just messed the columns of one part of the page.

I could not find anything in the Freemarker documentation that describes how to just output the String variable in raw form. I saw that if it was a String litteral like " " I could do the following ${r" "} but there is nothing similar for a variable. Fortunately I did find ofbiz’s work around for this problem. What they did is implement a StringWrapper that does nothing but hides the String in a wrapper class whose toString method calls the toString method of the wrapped String. Now Freemarker does not know it is a String and outputs the raw value from the toString method.

Things like this really frustrate me. By default if I say print out a variable I expect the language to just print out the raw value. If I want it formatted for HTML I should have to do something special, not the other way around.

Posted in Simple things made difficult | 1 Comment

OutOfMemoryError – GC overhead limit exceeded

When we initially set up this application we gave it 2 gigs of ram. The server it was running on had 4 gigs of ram. It occasionally run out of memory and gave the error  OutOfMemoryError – GC overhead limit exceeded. We figured that the application needed more memory so we increased the memory on the server to 8 gives and gave the application 6gb of ram. Overkill huh?

After we increase the memory I was tasked with monitoring it to see would happen. At first I thought it was it was memory leak, but the application has some reasonably good memory  management tools. It allows you to see the amount of free and used memory in the JVM and it allows you to clear the caches and run GC. After a few of clearing the cache and running GC for a few days I realised that the memory usage after GC was low and stable, so I decided to look elsewhere.

I ran free -m on the server after we I got results like what are listed below.



  total used free shared buffers cached
Mem: 8192 8172 19 0 178 1445



Then it dawned on me what the problem was. The application will basically use what ever memory is available to the JVM for caching data. When we increased the memory we would delay the problem but not solved it. Java was using just over 6gs of ram and the other applications on the server were using almost all of the other 2 gigs of ram. There was not enough ram left over for the operating system to do much else and when Java wanted to garbage collect resulting in thrashing and as a result the garbage collector spent a long time doing nothing and was not able to free . 
What we should have done is maybe increase the memory on that server by 500mb to 1gb and just left the the Java memory settings as they were. I was even getting page faults when I tried to grep some relatively small files. 
Posted in Performance Tuning | 9 Comments

Moving wordpress to root

So this is my first website and I set up wordpress in a subdirectory and wanted it set up at the root of the website. So from the wp-admin I clicked on settings then changed the wordpress address URL and  site URL from to Unfortunately all that did was make wordpress inacessible. I also had to copy the wordpress directory. I did it by ftping the whole site down to my computer and then ftping the contents of the wordpress directory to the root of my website and that did it.

I enabled ssh and it would have been faster to just log in via ssh and just run these 2 commands
  1. cd wordpress
  2. mv ./* ../
Posted in Website Admin | 1 Comment

The mystery of pages disappearing from the google index

So I work for a reasonably large publishing organisation. Over the years we have changed the format of the URLs to get to our articles. There are a large numbers of links to these articles on the Internet and we want to continue supporting them so we have a couple of applications that redirect the users from the old URLs to the new one. We recently changed the URL format again. This time for the last time. We moved to a nice short URL format but continue to support the old formats. This was fine for a couple of months, but last week I noticed that none of our article content was indexed by Google anymore, so I started to investigate what the cause could be.

I logged on to the Google webmaster tools and discovered a few things.

  1. One of the sites that were were redirecting to was distributing malware.  There what a hidden iframe on their page that was going somewhere nasty. So we changed the redirect to another page on that site.
  2. There were some messages saying that there was a large amount of duplicate content on our site. Most of this was because of the jsessionid URL parameter. So we changed this in the webmaster tools and told Google to ignore this parameter.
    We also canoncialized our URLs. We just added the <link rel="canonical" href=""> to each of our article pages with the short URL.
  3. We are going to change the redirects from the old URL formats to 301 redirects instead of 302. We are a Java shop and I did not find the implementation of the 301 redirect particularly intuitive. Most of the documentation out there for doing a redirect just says httpServletResponse.sendRedirect("" ); // this is a 302 redirect.After a bit of searching I found how to do a 301 redirect.

    httpServletResponse.setHeader( "Location", "" );
    httpServletResponse.setHeader( "Connection", "close" );

I think the api for the HttpServletResponse should have another method for sending redirect that accepts the status code. Just my opinion but that would be more intuitive.

httpServletResponse.sendRedirect("" , HttpServletResponse.SC_MOVED_PERMANENTLY);

Now we just have to wait and see what happens.

Posted in Uncategorized | 11 Comments

Why developers should not have write access to production sytems

A few years ago I was working for a bill payments company. We were using checkfree's software for our presentation of bills and we wrote our own bill payments engine. The database was MS SQL Server and we were using log shipping to send data from the master to the  backup database server. It was  using EJB 2.0 and batch processing was pretty slow, so I decided to delete all the transaction data from my sandbox database except for the first month's data to test something. The only problem was that I typed the delete command in the sql window connected to the production database.

A few minutes later one of the utility companies called and said they could not see any transactions. My face went red. I actually felt the blood rush to it. I went to the DBA to ask if she could restore the database from the last back up only to find out that they did not know they were to back up that database. So I asked if we could change the backup database server to the become the master only to find out that the transaction logs had already been shipped to that database and that had the same data as the master.

I thought I was definitely going to get fired or go to jail. Fortuanately I found this program called Log Explorer that let me create insert statements for all the transactions deleted before the boss found out. It cost me a couple hundred dollars for a trail version but it was worth it. I checked at . They don't seem to make it anymore. Needless to say, I have been very weary of getting write access to anything in production ever since.

All of this could have been avoided if we:

  1. Secured the database server so that only the application could write to it.
  2. Have more than one person look over the plan of what is to be backed up.




Posted in Stupid Things | 2 Comments