主机内存脏数据写入超时导致主机死机,提示:echo 0 > /proc/sys/kernel/hung_task_timeout_secs.

2019/07/27 Linux Read:

今天发现突然有一台主机无缘无故死机了,于是翻看了/var/log/message日志,发现提示: echo 0 > /proc/sys/kernel/hung_task_timeout_secs.

完整的日志如下:

Mar 28 03:12:47 namenode01 collectd[1731]: write_graphite plugin: send to epc-graphite-server:2003 (tcp) failed with status 110 (Connection timed out)
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status -1.
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Some write plugin is back to normal operation. `write' succeeded.
Mar 28 03:14:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:14:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:14:31 namenode01 kernel: kworker/15:2    D ffff88208618bdd8     0 22246      2 0x00000080
Mar 28 03:14:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:14:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:14:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8
Mar 28 03:14:31 namenode01 kernel: Call Trace:
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff8105aeae>] ? physflat_send_IPI_mask+0xe/0x10
Mar 28 03:14:31 namenode01 kernel: [<ffffffff81050b5c>] ? native_smp_send_reschedule+0x4c/0x70
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c0548>] ? resched_curr+0xa8/0xc0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a8f8b>] create_worker+0xeb/0x200
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a9216>] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a9743>] worker_thread+0x383/0x3c0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:14:31 namenode01 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Mar 28 03:14:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/2:2:11678 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/2:2     D ffff88003922bdd8     0 11678      2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88003922bbf0 0000000000000046 ffff881fffd86eb0 ffff88003922bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88003922bfd8 ffff88003922bfd8 ffff881fffd86eb0 ffff88003922bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88003922bd58 ffff881fffd86eb0 ffff88003922bdd8
Mar 28 03:16:31 namenode01 kernel: Call Trace:
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810c8f18>] ? __enqueue_entity+0x78/0x80
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810cf90c>] ? enqueue_entity+0x26c/0xb60
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a8f8b>] create_worker+0xeb/0x200
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a9216>] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a9743>] worker_thread+0x383/0x3c0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Mar 28 03:16:31 namenode01 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/15:2    D ffff88208618bdd8     0 22246      2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8

原因在于,默认情况下, Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘, 导致后续的IO请求都是同步的。将缓存写入磁盘时,有一个默认120秒的超时时间。

出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。

这个Linux延迟写机制带来的问题,并且在主机内存越大时,出现该问题的可能性更大。这篇文章也提到This is a know bug

This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.

The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.

链接:nfo-task-blocked-for-more-than-120-seconds

关于脏数据,有几个配置:

vm.dirty_background_ratio 是内存可以填充“脏数据”的百分比。这些“脏数据”在稍后是会写入磁盘的,pdflush/flush/kdmflush这些后台进程会稍后清理脏数据。举一个例子,我有32G内存,那么有3.2G的内存可以待着内存里,超过3.2G的话就会有后来进程来清理它。

vm.dirty_ratio 是绝对的脏数据限制,内存里的脏数据百分比不能超过这个值,如果超过,将强制刷写到磁盘。如果脏数据超过这个数量,新的IO请求将会被阻挡,直到脏数据被写进磁盘。这是造成IO卡顿的重要原因,但这也是保证内存中不会存在过量脏数据的保护机制。

vm.dirty_expire_centisecs 指定脏数据能存活的时间。在这里它的值是30秒。当 pdflush/flush/kdmflush 进行起来时,它会检查是否有数据超过这个时限,如果有则会把它异步地写到磁盘中。毕竟数据在内存里待太久也会有丢失风险。

vm.dirty_writeback_centisecs 指定多长时间 pdflush/flush/kdmflush 这些进程会起来一次。

调优

我们可以从以下思路进行调优:

  1. 减少脏数据的比例,避免刷写超时
  2. 减小脏数据在内存中的存放时间,避免积少成多

修改相应参数

临时修改

# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -p

永久修改

#vi /etc/sysctl.conf
写入
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
sysctl -p

参考


Search

    Table of Contents