摘要

为了对GAP3程序进行调试和优化, 准备使用Valgrind对其内存调用进行检查. 本文是对Valgrind功能的初步探索, 对一个简单程序的Valgrind输出进行了分析.

八卦

很早就听说过Valgrind这个软件, 对它的名字和Logo很好奇, 粗略搜了一下名字, 在官方FAQ里发现这样一段话

Q: Where does the name “Valgrind” come from?

A: From Nordic mythology. Originally (before release) the project was named Heimdall, after the watchman of the Nordic gods. He could “see a hundred miles by day or night, hear the grass growing, see the wool growing on a sheep’s back”, etc. This would have been a great name, but it was already taken by a security package “Heimdal”.

Keeping with the Nordic theme, Valgrind was chosen. Valgrind is the name of the main entrance to Valhalla (the Hall of the Chosen Slain in Asgard). Over this entrance there resides a wolf and over it there is the head of a boar and on it perches a huge eagle, whose eyes can see to the far regions of the nine worlds. Only those judged worthy by the guardians are allowed to pass through Valgrind. All others are refused entrance.

It’s not short for “value grinder”, although that’s not a bad guess.

大意就是, Valgrind取自北欧神话中Valhalla[1]入口的名字, 这个地方挂着熊头, 栖息着一匹狼和一只巨鹰, 视野达九界终焉. 只有被这些守卫者认为具有价值者才可通过Valgrind. 真是读书人呀 :P

安装

Valgrind的安装比较容易, 可以下载源码, 用./configure; make; make install的方式. 在Fedora上可以从源安装,

1
sudo dnf install valgrind

撰写本文时Valgrind最新版本为3.15.0, macOS支持到10.13, 但不支持10.14 Mojave.

程序编译选项

为了让Valgrind能够工作, 在编译程序时, 应该采用-g -O1选项进行编译, 原因是

  • 使用-g产生所有debug需要的symbol.
  • 优化选项推荐用O1. 可以用O0, 但那会非常慢, 因为在用Valgrind跑程序本身就要慢上20-30倍并用上数倍的内存.
  • 不推荐用O2及以上, 那样会产生很多uninitialised value错误, 但这些error其实是由于优化产生, 并不实际存在于code当中.

编译完程序以后, 在需要运行程序(比如hello.out)的地方执行

1
valgrind --leak-check=yes ./hello.out

将默认使用Memcheck工具, 检查内存leakage问题. 将yes改成full可以输出具体细节.

如果需要检查内存泄漏以外的问题, 尽量使用-O0. 这个可以在后面的例子里看到.

实例: 越界赋值与内存泄漏

Fortran源码

下面是一个Fortran程序例子abf.f90, 参考自Jason Blevins的代码

1
2
3
4
5
6
7
8
9
10
11
12
program abc
integer :: i
integer, allocatable :: data(:)

allocate(data(5))
print*, rank(data), size(data), loc(data)
do i = 1, 5
data(i-1) = i
end do
print*, data(1)
print*, rank(data), size(data), loc(data)
end program abc

做一点说明

  • 这个程序存在两个问题, 一是对data(0)赋值, 但Fortran分配数组时下标从1开始. 这是一个heap block overrun问题. 二是没有对data数组进行deallocate内存释放
  • rank, sizeloc内建函数分别返回data的维度, 数组中确定类型的数的个数, 即所有维度上长度连乘, 以及array descriptor的位置.

编译和运行

编译abc.f90

1
gfortran abc.f90 -o abc -g -O0

为了不让程序直接报错, 没有加上debug需要的-fbounds-check. 运行Valgrind

1
valgrind ./abc --leak-check=full

输出解读

运行Valgrind给出的输出是

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
==28740== Memcheck, a memory error detector
==28740== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28740== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==28740== Command: ./abc --leak-check=full
==28740==
1 5 97983168
==28740== Invalid write of size 4
==28740== at 0x40098C: MAIN__ (abc.f90:8)
==28740== by 0x400A8F: main (abc.f90:12)
==28740== Address 0x5d71abc is 4 bytes before a block of size 20 alloc'd
==28740== at 0x4C2CDCB: malloc (vg_replace_malloc.c:299)
==28740== by 0x400865: MAIN__ (abc.f90:5)
==28740== by 0x400A8F: main (abc.f90:12)
==28740==
1 5 97983168
==28740==
==28740== HEAP SUMMARY:
==28740== in use at exit: 20 bytes in 1 blocks
==28740== total heap usage: 22 allocs, 21 frees, 13,516 bytes allocated
==28740==
==28740== LEAK SUMMARY:
==28740== definitely lost: 20 bytes in 1 blocks
==28740== indirectly lost: 0 bytes in 0 blocks
==28740== possibly lost: 0 bytes in 0 blocks
==28740== still reachable: 0 bytes in 0 blocks
==28740== suppressed: 0 bytes in 0 blocks
==28740== Rerun with --leak-check=full to see details of leaked memory
==28740==
==28740== For counts of detected and suppressed errors, rerun with: -v
==28740== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

对这些输出做一些说明

  • 运行./abc不会报错, 两个print均正常打印结果, 输出data(1)值为2. loc在赋值前后不变.
  • 28740是运行任务的PID
  • Invalid write部分指出越界赋值的问题. at 0x40098C: MAIN__ (abc.f90:8)表示该write在abc.f90的第8行. 如果采用-O1优化, 这个错误会被隐藏起来.
  • HEAP SUMMARY给出占用内存的信息. 可以看到, 由于没有free掉整数类型(4 bytes)的data(5), 在退出时还有20 bytes被占用(in use at exit)

关于memory lost的类型

LEAK SUMMARY给出内存泄漏的总结. 不同的lost对应的含义(总结于Valgrind FAQ)

类型 解释
definitely lost 程序有内存泄漏. 这些泄漏必须修正.
indirectly lost 程序的一个基于指针的结构里存在内存泄漏
possibly lost 程序存在内存泄漏, 除非做一些操作使得指针重新指向已分配的内存块.
still reachable It didn’t free some memory it could have.
suppressed 存在内存泄漏, 但这些泄漏信息因为Valgrind设置的关系, 被抑制输出了

做一点说明

  • 对于“indirectly lost”的解释是, 例如一个二叉树根节点definitely lost, 那么它的子节点就是indirectly lost. Indirectly lost通常在修复definitely lost后消失.
  • “still reachable”的含义还不是很清楚, 只是把FAQ原文抄录了, 因为自己还没有遇到, FAQ也说它是一个“常见和合理的错误”, 因此不做展开.

总结

本文基于一个简单的Fortran程序, 从编译源代码开始, 展示了用Valgrind对程序内存调用进行检查的过程. 对Valgrind输出信息进行了初步解读.


  1. 1.Valhalla在老伊达(The Poetic Edda)里有记录, 想起来之前读文史大纲的时候遇到过, 见郑振铎全集第十卷P343.

Comments