Wednesday, January 14, 2015

Linux memory management configuration

Linux by default uses up to half of available memory for disk caches, uses swap, and 4k pages to manage memory. In the case of in-memory processing system, the data handling is truly bound to physical RAM, and all typical memory management is not necessary (at least to some extend). 

Rationale
---------
Configuration, described in this article, minimizes swap, minimizes disk cache, and enables HugePages. Enabling HugePages gives relief to kernel and CPU in area of memory management, due to less mapping needed to be done between virtual and physical memory (TLB). In the system hosting Coherence services, it guarantees that JVM memory pages will never be paged out to swap. With all the settings it's possible to assign maximum memory to Java processes. Without this Linux was paging out memory pages to swap as disk cache layer was competing for memory with JVM processes, what completely breaks required low latency and stability nature of the system under Coherence. 

Check your system
-----------------
To check real consumption of JVMs use 'top' (press F-o-Enter to sort by VIRT).

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
19047 CohUser  22   0 13.9g 470m  13m S  0.0  0.5 117:54.59 java
18362 CohUser  21   0 13.9g 460m  13m S  0.0  0.5 126:35.87 java
18704 CohUser  24   0 13.9g 470m  13m S  0.3  0.5 122:12.50 java
18535 CohUser  23   0 13.9g 468m  13m S  0.0  0.5 126:24.88 java
18870 CohUser  22   0 13.9g 461m  13m S  0.0  0.5 119:23.77 java
19329 CohUser  25   0 8941m 317m  13m S  0.0  0.3  72:15.52 java
19213 CohUser  25   0 8936m 314m  13m S  0.0  0.3  66:43.44 java
19451 CohUser  25   0 1503m 109m  13m S  0.0  0.1  32:01.73 java

In above configuration JVMs are configured with 1.5, 8, and 13GB heaps. As you see virtual memory is little bigger, and reserved memory is 100-500MB depending on heap size. In this situation we need to have out of huge pages 5*500MB+2*350MB+110MB = 3GB of standard 4k memory for JVMs apart from 83GB (5*13+2*8+1.5) for heap. Linux need some memory for other processes (management, kernel), network and disk buffers - it's very safe to reserve 5GB for this purpose. Finally it was discovered out that for 96GB system reserving 85GB for huge pages, leaving 11GB for Linux makes operating system stable. More fine grained tunings may be done to gain additional 5GB per box.

Note that huge pages configuration made it possible to assign +12GB to JVM processes. Before system was configured with 73GB assigned to heaps.

Preparations
------------
The proper number of huge pages depends on (a) available memory, (b) memory assigned to JVM, (c) number of JVMs. Note that Linux should have some memory available for other tasks. Even JVM needs some 4k memory to operate, as it uses some of out of heap memory as well. It's safe to assume that non-heap memory takes 5% of heap. Another 5% of physical memory is still needed for Linux.  Finally for 96GB system, 10GB should be available for 4kB pages, leaving 85GB for huge pages.

Configuration
-------------
Note that below changes requires root privileges, thus should be sent to Unix team via Change Request.

1. update kernel parameters

/etc/sysctl.conf should be updated with:

kernel.shmmax = 98784247808           #set to maximum memory, here set to 92GB: 92*1024^3
vm.nr_hugepages=43520                 #calculate as: 85*1024^3/(2*1024^2), where 85 is 85GB of needed huge pages
vm.hugetlb_shm_group = 1342           #group id permitted to use huge pages, here CohGroup
vm.swappiness = 0                     #minimize pressure of using swap
vm.min_free_kbytes = 1024             #minimum available memory before OS will start aggressive memorymanagement
vm.dirty_background_ratio = 3         #disk cache tuning - the % of system memory that can be filled with "dirty" pages. 2.88GB here
vm.dirty_ratio = 6                    #disk cache tuning - max memory for cache pages. 5.76GB here.
vm.dirty_expire_centisecs = 500       #disk cache tuning - flush timing
vm.dirty_writeback_centisecs = 100    #disk cache tuning - flush timing

2. update ulimit parameters

/etc/security/limits.conf should be updated with:

CohUser soft memlock unlimited       #CohUser user is allowed to use huge pages
CohUser hard memlock unlimited       #CohUser user is allowed to use huge pages

3. reboot machines.

Verification
------------
1. use top to check swap usage level

top - 08:12:47 up 34 days, 16:38,  1 user,  load average: 0.20, 0.33, 0.39
Tasks: 488 total,   1 running, 487 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98833664k total, 96415852k used,  2417812k free,   507856k buffers
Swap:  2097144k total,        0k used,  2097144k free,  2463688k cached
                           /\
        here               |
----------------------------

2. check is kernel has enabled huge pages

cat /proc/meminfo | grep -i "huge" 

HugePages_Total: 43520
HugePages_Free:  28347
HugePages_Rsvd:  27139
Hugepagesize:    2048 kB

3. kernel parameters

function checkKernel {
sysctl -a 2>/dev/null >/tmp/$$.sysctl.tmp
grep kernel.shmmax /tmp/$$.sysctl.tmp
grep vm. /tmp/$$.sysctl.tmp | grep huge
grep vm.swappiness /tmp/$$.sysctl.tmp
grep vm.min_free /tmp/$$.sysctl.tmp
grep vm.dirty /tmp/$$.sysctl.tmp
rm /tmp/$$.sysctl.tmp
}
checkKernel

kernel.shmmax = 98784247808
vm.hugetlb_shm_group = 1342
vm.nr_hugepages = 43520
vm.swappiness = 0
vm.min_free_kbytes = 1024
vm.dirty_background_bytes = 0
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_ratio = 15
vm.dirty_background_ratio = 3

Next steps
----------
It's recommended to investigate Transparent Hugepage Support feature available in EL6. Moreover is looks that JVM in version 7 behaves with THS in similar way as on Solaris by enabling Hugepage support by default. Probably on RH7 it's possible to minimize swapiness and disk cache, leaving hugepages on transparent level. Note that explicitly reserving huge pages at boot time (current approach) guarantees that memory space will be available for JVM - Linux will not use these memory segments for other purposes.

CohUser@CohBox$ cat /proc/meminfo | grep HugePages
AnonHugePages:  67004416 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
CohUser@CohBox$  uname -a
Linux CohBox 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Feb 20 12:17:37 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

References
----------

Change list
-----------
1.0 - initial version, tested
2.0 - anonimized, fixed page cache parameters, added references

###

1 comment:

  1. References:
    1. https://www.novell.com/support/kb/doc.php?id=7010287
    2. https://www.kernel.org/doc/Documentation/sysctl/vm.txt

    ReplyDelete