Java Performance on POWER7-Best Practice

no comments

A useful IBM Java Performance on Power 7 Best practice whitepaper .

Here are some highlights from system administrator’s perspective.

  • The Power Architecture® does not require running applications in 64-bit mode to achieve best

performance since 32-bit and 64-bit modes have the same number of processor registers.

For best performance, use 32-bit Java unless the memory requirement of the application requires
running in 64-bit mode, interesting!

Use 32-bit Java for workload that has heap size requirement less than 2.5GB and when
the application spawns a few hundred threads.

  • Medium and Large Pages for Java Heap and Code Cache

Configure large pages(Dynamically  configured 1GB of 16MB pages, -r option to make it permanently as well):

# vmo -r -o lgpg_regions=64 -o lgpg_size=16777216
# bosboot -a

Non-root users must have the CAP_BYPASS_RAC_VMM capability enabled to use large pages. The system administrator can add this capability using the chuser command like in the example below:

# chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user_id>
  • Application Scaling

Scale JVM/WAS instances with an instance for every 2 –cores

Choosing the right SMT mode:

Most applications benefit from SMT. However, some applications do not scale with an increased number of logical CPUs on an SMT enabled system.

Using Resource Sets:

Resource sets (rsets) allow specifying on which logical CPUs an application can run. This is
useful when an application that doesn’t scale beyond a certain number of logical CPUs should
run on large LPAR. For example, an application that scales well up to 8 logical CPUs should run
on an LPAR that has 64 logical CPUs.
The following example demonstrates
how to use execrset to create an rset with CPU 4 to 7 and starts the application attached to it:

execrset -c 4-7 -e <application>

In addition to running the application attached to a rset, the MEMORY_AFFINITY
environment variable should be set to MCM to assure that the applications private and shared
memory get allocated from memory that is local to the logical CPUs of the rset:
MEMORY_AFFINITY=MCM
In general, rsets should be created on core boundaries. For example, a system with four virtual
processors (cores) running in SMT4 mode will have 16 logical CPUs. Creating an rset with four
logical CPUs should be created by selecting four SMT threads that belong to one core. An rset
with eight logical CPUs ld be created by selecting eight SMT threads that belong to two cores.
The smtctl command can be used to determine which logical CPUs belong to which core:

smtctl
 This system is SMT capable.
 This system supports up to 4 SMT threads per processor.
 SMT is currently enabled.
 SMT boot mode is not set.
 SMT threads are bound to the same physical processor.
 proc0 has 4 SMT threads.
 Bind processor 0 is bound with proc0
 Bind processor 1 is bound with proc0
 Bind processor 2 is bound with proc0
 Bind processor 3 is bound with proc0
 proc4 has 4 SMT threads.
 Bind processor 4 is bound with proc4
 Bind processor 5 is bound with proc4
 Bind processor 6 is bound with proc4
 Bind processor 7 is bound with proc4

The smtctl output above shows that the system is running in SMT4 mode with bind processor
(logical CPU) 0 to 3 belonging to proc0 and bind processors 4 to 7 to proc1. An rset with four
logical CPU should be created either for CPUs 0 to 3 or for CPUs 4 to 7.
Java Performance on POWER7 – Best Practice
POW03066-USEN-00.doc Page 10
To achieve best performance with rsets that are created across multiple cores, all cores of the rset
should be in the same scheduler resource allocation domain (SRAD). The lssrad command can
be used to determine which logical CPUs belong to which SRAD:

lssrad -av
 REF1 SRAD MEM CPU
 0
 0 22397.25 0-31
 1
 1 29801.75 32-63

The example output above shows a system that has two SRADs. CPUs 0 to 31 belong to the first
SRAD while CPUs 32 to 63 belong to the second SRAD. In this example, an rset with multiple
cores should be created either using the CPUs of the first or second SRAD.
Note: A user must have root authority or have CAP_NUMA_ATTACH capability to use rsets.

  • Data Pre-fetching

For most Java applications, it is recommended to turn off HW data prefetch with AIX command

“dscrctl –n –s 1”. Java 6 SR7 exploits HW transient prefetch on POWER 7.

 

Full whitepaper can be downloaded here:

 

Java Performance on POWER7-Best Practice

Tuning AIX NIM master with no options, max_nimesis_threads, global_export parameters

no comments

Occasionally tuning is required on AIX NIM master if your environment grow bigger and bigger.
1) To support a high number (16 or more) simultaneous installs, you should consider:
increasing max_nimesis_threads

nim -o change -a max_nimesis_threads=60 master

2) no options tcp_sendspace, tcp_recvspace, rfc1323 should already be set in the default AIX install. Watch for them on ifconfig -a, and verify that use_isno is on.

# ifconfig en0
 en0: flags=1e080863,4c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPR
 T,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
 inet 9.19.51.115 netmask 0xffffff00 broadcast 9.19.51.255
 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
# no -a | grep isno
 use_isno = 1
# no -F -a | grep isno 
 use_isno = 1

Note: (restricted setting in 6.1. Use -F)

3) Consider setting global_export=yes. If you perform frequent simultaneous installs, when one install completes, the default behavior of the master is to unexport NFS exports, remove the completed client from the export lists and re-export the filesystems. During this interval, other “in-flight” client installs may see the message “NFS server not responding, still trying” on the client console.
As an alternative, set global_export. With no clients enabled for install:

# nim -o change -a global_export=yes master

 
In this configuration, resources are exported read-only for every enabled client, and held exported until the last client completes.

Before, exports list every specific client allowed to mount

# showmount -e
 export list for bmark29:
 /export/mksysb/image_53ML3 sq07.dfw.ibm.com,sq08.dfw.ibm.com
 /export/53/lppsource_53ML3 sq07.dfw.ibm.com,sq08.dfw.ibm.com
 /export/53/spot_53ML2/usr sq07.dfw.ibm.com,sq08.dfw.ibm.com

With global_export, exports are read-only for everyone

# exportfs
 /export/mksysb/image_53ML3 -ro,anon=0
 /export/53/lppsource_53ML3 -ro,anon=0
 /export/53/spot_53ML3/usr -ro,anon=0

Realize, of course, anyone can mount these, even if they are not a NIM client
(read-only, AIX install content. Security issue? Probably not for most cases)

IO queuing and how to tune IO queues on AIX

no comments

A really useful IBM document on  how IO queuing works, and explains how to tune the queues to improve performance, including in VIO environments. This will help ensure you don’t have unnecessary IO bottlenecks at these queues.

It also documents tools available for monitoring disk performance in these environments.

Download here:

AIX-VIOS_DiskAndAdapterQueueTuningV1.2.pdf

AIX-VIOS_DiskAndAdapterQueueTuningV1.2.pdf