Thursday, August 08, 2013

Using CloudStack's Log Files: XenServer Integration Troubleshooting

Background


CloudStack uses the XenServer plugin model to make extensions to XenAPI aka XAPI.  These extensions are written in a combination of Python and Bash scripts.  Python is the programming language for XenServer plugin model; however, calls to command line tools are more natural to write in a bash script.

These XenServer extensions and the management server plugins that use them generate logging information to assist developers and admins to diagnose problems.


Problem


What are the log files are useful, where are they, how do I use them?


Solution


The management server uses the log4j logging library, to generate three logs: 
  • the general log, aka FILE
  • the CloudStack API call log, aka APISERVER
  • the AWS API log, aka AWSAPI.  
The files each log writes to are identified in the <appender> section of the log4j config file. Specifically, the file name is in the ActiveFileName parameter.
XenServer logs sit on the Dom0 operating system in /var/log (source).  CloudStack's XAPI extensions write to /var/log/SMlog.  Although all XAPI events are summarised in /var/log/xensource.log, this log is not recommended.  The information available is limited and difficult to find due to the large number of events being logged.

In the following sub sections, we 
  • explain how to find the management server logs for a development CloudStack
  • explain how to find the management server logs for a production CloudStack
  • explain how the XenServer logs are generated
  • walk through an example of using the logs to diagnose a problem

Developer CloudStack Managment Server Logs


To find the exact log file locations, first find the log4j configuration file using grep.  

Developers that launch CloudPlatform in Jetty using should look under the ./client folder for the log4j configuration.  The ./client folder contains the files that execute when mvn -pl :cloud-client-ui jetty:run is executed.  E.g.
 root@mgmtserver:~/github/cshv3# find ./client/ | grep "log4j.*\.xml"  
 ./client/tomcatconf/log4j-cloud.xml.in  
 ./client/target/generated-webapp/WEB-INF/classes/log4j-cloud.xml  
 ./client/target/cloud-client-ui-4.2.0-SNAPSHOT/WEB-INF/classes/log4j-cloud.xml  
 ./client/target/conf/log4j-cloud.xml  
Next, examine the <appender> elements.in the files.  E.g.
 root@mgmtserver:~/github/cshv3# more ./client/target/generated-webapp/WEB-INF/classes/log4j-cloud.xml   
  ...   
   <!-- ================================= -->   
   <!-- Preserve messages in a local file -->   
   <!-- ================================= -->   
   <!-- A regular appender FIXME implement code that will close/reopen logs on SIGHUP by logrotate FIXME make the paths configurable using the build system -->   
   <appender name="FILE" class="org.apache.log4j.rolling.RollingFileAppender">   
   <param name="Append" value="true"/>   
   <param name="Threshold" value="TRACE"/>   
   <rollingPolicy class="org.apache.log4j.rolling.TimeBasedRollingPolicy">   
    <param name="FileNamePattern" value="vmops.log.%d{yyyy-MM-dd}.gz"/>   
    <param name="ActiveFileName" value="vmops.log"/>   
   </rollingPolicy>   
   <layout class="org.apache.log4j.EnhancedPatternLayout">   
    <param name="ConversionPattern" value="%d{ISO8601} %-5p [%c{1.}] (%t:%x) %m%n"/>   
   </layout>   
   </appender>   
In this case, vmops.log is the general log file.

Production CloudStack Management Server Logs


With production servers, the process is the same:  look for the log4j config.  

In the case of Citrix CloudPlatform 3.0.x, we would see the following:
  [root@cs1-mgr management]# find / | grep "log4j.*\.xml"   
  /etc/cloud/management/log4j-cloud.xml   
  /etc/cloud/server/log4j-cloud.xml   
Next, examine the <appender> elements.in the files.  E.g. 
 [root@cs1-mgr management]# more /etc/cloud/management/log4j-cloud.xml  
 <?xml version="1.0" encoding="UTF-8"?>  
 <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">  
 <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/" debug="false">  
   <throwableRenderer class="com.cloud.utils.log.CglibThrowableRenderer"/>  
   <!-- ================================= -->  
   <!-- Preserve messages in a local file -->  
   <!-- ================================= -->  
   <!-- A regular appender FIXME implement code that will close/reopen logs on SIGHUP by logrotate FIXME make the paths configurable using the build system -->  
   <appender name="FILE" class="org.apache.log4j.rolling.RollingFileAppender">  
    <param name="Append" value="true"/>  
    <param name="Threshold" value="TRACE"/>  
    <rollingPolicy class="org.apache.log4j.rolling.TimeBasedRollingPolicy">  
     <param name="FileNamePattern" value="/var/log/cloud/management/management-server.log.%d{yyyy-MM-dd}.gz"/>  
     <param name="ActiveFileName" value="/var/log/cloud/management/management-server.log"/>  
    </rollingPolicy>  
    <layout class="org.apache.log4j.EnhancedPatternLayout">  
      <param name="ConversionPattern" value="%d{ISO8601} %-5p [%c{3}] (%t:%x) %m%n"/>  
    </layout>  
   </appender>  
Although we there are two log4j configuration files, they point to the same files.  In both cases the general log is /var/log/cloud/management/management-server.log

CloudStack Logging on XenServer


CloudStack logs are generated by the XAPI extensions that CloudStack adds to the XenServer host.  These extensions are deployed in /etc/xapi.d/plugins folder.  E.g.
 [root@camctxlabs ~]# find /etc/xapi.d/plugins | grep vmops  
 /etc/xapi.d/plugins/vmopsSnapshot  
 /etc/xapi.d/plugins/vmops  
 /etc/xapi.d/plugins/vmopspremium  
The XAPI extensions are written in Python, but they make use of bash shell scripts in the /opt/xensource/bin folder.  E.g.
 [root@camctxlabs ~]# find /opt/xensource/bin | grep secondarystorage  
 /opt/xensource/bin/copy_vhd_to_secondarystorage.sh  
 /opt/xensource/bin/copy_vhd_from_secondarystorage.sh  
The XAPI extensions use functions from the sm.util module.  The logging functions in this module write to the SMlog log file.  E.g. in the code below, echo(fn) uses util.SMlog to update the log file /var/log/SMlog.  The @echo declaration in front of copy_vhd_to_secondarystorage causes echo(fn) to be executed in its place. echo is passed the copy_vhd_to_secondarystorage function as a parameter.
 [root@camctxlabs ~]# more /etc/xapi.d/plugins/vmopspremium  
 #!/usr/bin/python  
 # Licensed to the Apache Software Foundation (ASF) under one  
 ...  
 def echo(fn):  
   def wrapped(*v, **k):  
     name = fn.__name__  
     util.SMlog("#### VMOPS enter %s ####" % name )  
     res = fn(*v, **k)  
     util.SMlog("#### VMOPS exit %s ####" % name )  
     return res  
   return wrapped  
 ...  
 @echo  
 def copy_vhd_to_secondarystorage(session, args):  
   mountpoint = args['mountpoint']  
   vdiuuid = args['vdiuuid']  
   sruuid = args['sruuid']  
   try:  
     cmd = ["bash", "/opt/xensource/bin/copy_vhd_to_secondarystorage.sh", mou  
     ntpoint, vdiuuid, sruuid]  
     txt = util.pread2(cmd)  
   except:  
     txt = '10#failed'  
     return txt  
 ...  
Logging messages are also generated by exception handling code in the sm.utils modules.  E.g. in the code above, util.pread2 is used to execute bash scripts.  If an error occurs, it will be reported to the SMlog file.

Example of Using the CloudStack Logs


In this example, I find out that I accidentally downloaded an out of date vhd-utils to my XenServer.

To start, I followed the instructions on how to build from master.  These told me how to build and run the latest management server.  After configuring a zone with S3 secondary storage, NFS staging storage, and a XenServer cluster, I waited for the system VM template to download and for the management server to start its secondary storage VM.  After a while, I noticed that the secondary storage system VM did not start.  Also, I saw that exceptions were appearing in console I used to launch the management server.  Therefore, I knew I had a problem.

The first step was to look for the exceptions that occurred in the management server log.  This is the vmops.log file, because I'm using a developer environment.

NB:  Look for the first exception and the logs just before it.  

Here's what I found:
 2013-08-05 10:16:01,853 DEBUG [c.c.h.d.HostDaoImpl] (ClusteredAgentManager Timer:null) Completed acquiring hosts for clusters not owned by any management server  
 2013-08-05 10:16:02,403 WARN [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-11:null) destoryVDIbyNameLabel failed due to there are 0 VDIs with name cloud-8e789d62-3062-4a13-8235-35ca49b7b924  
 2013-08-05 10:16:02,404 WARN [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-11:null) can not create vdi in sr 8544bcca-f54d-cbc5-151d-b91c822addba  
 WARN [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-11:null) Catch Exception com.cloud.utils.exception.CloudRuntimeException for template + due to com.cloud.utils.exception.CloudRuntimeException: can not create vdi in sr 8544bcca-f54d-cbc5-151d-b91c822addba  
 com.cloud.utils.exception.CloudRuntimeException: can not create vdi in sr 8544bcca-f54d-cbc5-151d-b91c822addba  
     at com.cloud.hypervisor.xen.resource.XenServerStorageProcessor.copy_vhd_from_secondarystorage(XenServerStorageProcessor.java:793)  
     at com.cloud.hypervisor.xen.resource.XenServerStorageProcessor.copyTemplateToPrimaryStorage(XenServerStorageProcessor.java:864)  
     at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:70)  
     at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:49)  
     at com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:625)  
     at com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)  
     at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)  
     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)  
     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)  
     at java.util.concurrent.FutureTask.run(FutureTask.java:166)  
     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)  
     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)  
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
     at java.lang.Thread.run(Thread.java:724)  
 INFO [c.c.v.VirtualMachineManagerImpl] (secstorage-1:ctx-8124c385) Unable to contact resource.  
 com.cloud.exception.StorageUnavailableException: Resource [StoragePool:1] is unreachable: Unable to create Vol[1|vm=1|ROOT]:Catch Exception com.cloud.utils.exception.CloudRuntimeException for template + due to com.cloud.utils.exception.CloudRuntimeException: can not create vdi in sr 8544bcca-f54d-cbc5-151d-b91c822addba  
Next, look at all the management server code that was executing at the time.  This code ends at the first line named in the stack dump.  E.g. 
 com.cloud.hypervisor.xen.resource.XenServerStorageProcessor.copy_vhd_from_secondarystorage(XenServerStorageProcessor.java:793)
It starts around the last log message.  E.g.
 root@mgmtserver:~/github/cshv3# grep -n -R "destoryVDIbyNameLabel" * --include *.java  
 plugins/hypervisors/xen/src/com/cloud/hypervisor/xen/resource/CitrixResourceBase.java:3069:        s_logger.warn("destoryVDIbyNameLabel failed due to there are " + vdis.size() + " VDIs with name " + nameLabel);  
 plugins/hypervisors/xen/src/com/cloud/hypervisor/xen/resource/XenServerStorageProcessor.java:800:        s_logger.warn("destoryVDIbyNameLabel failed due to there are " + vdis.size() + " VDIs with name " + nameLabel);  
Have a look at the source code in this area.  E.g.
   private String copy_vhd_from_secondarystorage(Connection conn, String mountpoint, String sruuid, int wait) {  
     String nameLabel = "cloud-" + UUID.randomUUID().toString();  
     String results = hypervisorResource.callHostPluginAsync(conn, "vmopspremium", "copy_vhd_from_secondarystorage",  
         wait, "mountpoint", mountpoint, "sruuid", sruuid, "namelabel", nameLabel);  
     String errMsg = null;  
     if (results == null || results.isEmpty()) {  
       errMsg = "copy_vhd_from_secondarystorage return null";  
     } else {  
       String[] tmp = results.split("#");  
       String status = tmp[0];  
       if (status.equals("0")) {  
         return tmp[1];  
       } else {  
         errMsg = tmp[1];  
       }  
     }  
     String source = mountpoint.substring(mountpoint.lastIndexOf('/') + 1);  
     if( hypervisorResource.killCopyProcess(conn, source) ) {  
       destroyVDIbyNameLabel(conn, nameLabel);  
     }  
     s_logger.warn(errMsg);  
     throw new CloudRuntimeException(errMsg);  
   }  
In the source above, we see that a call was being made to the XenServer extension, which suggests we should look at the SMlog on the XenServer itself for that period of time.  

NB:  The SMlog files is regularly archived.  So make a copy of the file.  If it does not fit the time period, look at the dates of the archived versions to find the useful one.  Make a copy and use gunzip to uncompress the file.


In my case, the SMlog file shows that a call to the copy_vhd_from_secondarystorage extension threw an exception.  E.g.

 [2106] 2013-08-05 10:07:34.249054    #### VMOPS enter copy_vhd_from_secondarystorage ####  
 [2106] 2013-08-05 10:07:34.249197    ['bash', '/opt/xensource/bin/copy_vhd_from_secondarystorage.sh', '10.70.176.36:/mnt/cshv3/secondarystorage/template/tmpl/1/1/e35bed31-3d25-465a-9e56-02d817491550.vhd', '8544bcca-f54d-cbc5-151d-b91c822addba', 'cloud-8e789d62-3062-4a13-8235-35ca49b7b924']  
 ...  
 [2127] 2013-08-05 10:07:34.824402    ['/usr/sbin/vhd-util', 'create', '--debug', '-n', '/dev/VG_XenStorage-8544bcca-f54d-cbc5-151d-b91c822addba/VHD-9a7e2a96-511b-4d94-a745-2619abb99919', '-s', '2000', '-S', '2097152']  
 [2127] 2013-08-05 10:07:34.835447    FAILED: (rc 22) stdout: 'options: <-n name> <-s size (MB)> [-r reserve] [-h help]  
 ', stderr: 'create: invalid option -- S  
 '  
 [2127] 2013-08-05 10:07:34.835849    lock: released /var/lock/sm/8544bcca-f54d-cbc5-151d-b91c822addba/sr  
 [2127] 2013-08-05 10:07:34.843383    ***** vdi_create: EXCEPTION util.CommandException, 22  
  File "/opt/xensource/sm/SRCommand.py", line 94, in run  
   return self._run_locked(sr)  
 ...  
 [2106] 2013-08-05 10:07:35.021318    #### VMOPS exit copy_vhd_from_secondarystorage ####  
The cause of the exception is calling vhd-util. with an invalid option -S.  Indeed, the version of vhd-util in use on this machine does not provide for the '-S' option.  E.g.
 [root@camctxlabs ~]# ./vhd-util create -help  
 options: <-n name> <-s size (MB)> [-r reserve] [-h help]  
Whereas another XenServer in my data center does not have this problem.  E.g.
 [root@camctxlabs2 ~]# vhd-util create -help  
 options: <-n name> <-s size (MB)> [-r reserve] [-h help] [<-S size (MB) for metadata preallocation (see vhd-util resize)>]  
From this, I learnt that I need to find an updated version of vhd-util.

Final Remarks


The system VMs and web container for the management server have additional log files.  In the case of Citrix CloudPlatform these are listed here.

1 comment :

n4th4nr1ch said...

Interesting to see that the logs are on the hosts themselves as well.

However I've got two requests for your article.

1. You don't mention how you were able to determine *which* hypervisor to look on
2. You don't mention where to get this new vhd-util

Thanks for the info