by Joerg Walter
We implemented a so called 'long running server application'
with Microsoft Visual C++ under Windows NT, extensivly using
dynamic memory allocation via the new/delete
features of the programming language. In production we then
experienced the following diffuse memory problem: some time after
starting the process the performance monitor of Windows NT showed
us a relative low value for the 'Private Bytes' and an unexpected
high, continuously growing value for the 'Virtual Bytes' counter.
The analysis and resolution of this memory problem led to this
working note.
First step in today's world is to search the Internet, whether anybody else out there had a similar problem in the past and was able to solve it. If you do so, you soon will find links to the following related sites:
| The Memory Management Reference | The most comprehensive overview of the topic, we have seen so far. |
| Dynamic Storage Allocation Information Repository | A useful site maintained by Mr. Zorn from Microsoft (sic!). |
| Doug Lea's Workstation | There are many references to the allocator, Mr. Lea has written. |
You then will get the following public available, excellent survey article on the subject:
Paul R. Wilson, Mark S. Johnstone, Michael Neely, David Boles. 1995. Dynamic Storage Allocation: A Survey and Critical Review. University of Texas at Austin. Online copy.
It mentions widely unknown facts about memory fragmentation of allocators, which were first proven in the following article:
J. M. Robson. 1977. Worst case fragmentation of first fit and best fit storage allocation strategies. ACM. ACM Computer Journal, 20(3):242-244.
Next step is to understand these critical limits for memory allocators. It is possible to estimate a lower bound for the worst case fragmentation of storage allocation strategies with the following formula:
Given the quantities
M: net size of all memory requested by an application,
nmin: minimal requested
chunk size,
nmax: maximal requested
chunk size,
in the worst case the gross size G of memory needed by an
allocator to fulfill the requests is O(M log2(nmax
/ nmin)).
Let us give a simplified sketch of the proof, assuming that all quantities are powers of two. Further the imagined allocator may be so constructed, that it leaves no unused holes in memory and requests new memory from the operating system only if needed. We use natural induction over log2(nmax / nmin).
1. We begin with the simplest case log2(nmax / nmin) = 1, i.e. nmax = 2 nmin. Let X >= nmax be some fixed memory size. First allocate X / nmin chunks of size nmin. Then G >= X holds. In our simplified world the allocated chunks may be consecutive and numbered ci by ascending address . Deallocate every second chunk c2i + 1 of size nmin, resulting in a hole of size nmin < nmax. Next allocate X / nmax chunks of size nmax. Now G >= 2 X holds. Deallocate every second chunk c2i + 1 of size nmax, resulting in a hole of size nmax. Last deallocate all chunks c4i + 2 of size nmin resulting in holes of size 3 nmin < 2 nmax. This results in a memory layout with allocated chunks of net size (1 / 2 + 1 / 4) X and gross size G >= 2 X having holes of maximum size smaller 2 nmax.
2. Now assume, the previous statement holds for log2(nmax / nmin) = k, i.e. we have a memory layout with allocated chunks of net size (1 / 2 + ... + 1 / 2k + 1) X and gross size G >= (k + 1) X having holes of maximum size smaller 2 nmax. Allocate X / (2 nmax ) chunks of size 2 nmax. Now G >= (k + 2) X holds. Deallocate every second chunk c2i + 1 of size 2 nmax, resulting in a hole of size 2 nmax. Then deallocate all chunks c4i + 2 of size nmax resulting in holes of size 3 nmax < 4 nmax and proceed for the other chunk sizes down to nmin. This results in a memory layout with allocated chunks of net size (1 / 2 + ... + 1 / 2k+2) X and gross size G = (k + 2) X having holes of maximum size smaller 4 nmax.
Now using (1 + 1 / 2 + ... + 1 / 2k + 1) X <= 2 X gives our result G >= M / 2 (log2(nmax / nmin) + 1) = O(M log2(nmax / nmin)) for M = 2 X.
From the theoretical perspective one distinguishes between two different kinds of fragmentation, namely internal and external fragmentation. Internal fragmentation denotes the distribution of application allocated memory in the arenas reserved to the allocator, external fragmentation denotes the distribution of arenas reserved to the allocator in the address space of the application.
In the practical perspective, fragmentation on Microsoft Windows may incorporate more phenomena. First one has the distribution of application allocated memory in the committed memory. Then one may see some distribution of committed memory in the reserved memory. Furthermore one has the distribution of reserved memory in the address space of the application.
You can implement the above shown constructive procedure in C++ leading to a function like this:
typedef std::pair< char *, bool > Entry;
typedef std::set< Entry * > EntrySet;
typedef EntrySet::iterator EntrySetIterator;
void robson_test1 (int runs, int M, int n_min, int n_max) {
CTimingStats timing_stats;
CMemoryStats memory_stats_max;
CMemoryStats memory_stats_min;
unsigned allocated = 0, min_address = UINT_MAX, max_address = 0;
CStatsCounter allocated_stats, min_address_stats, max_address_stats;
for (int k = 0; k < runs; k ++) {
timing_stats.Start ();
EntrySet entry_set [32];
for (int n = n_min; n <= n_max; ++ n) {
int i, j;
for (i = 0; i < (1 << (M - 1 - n)); ++ i) {
char *p = new char [1 << n;];
if (! p)
return;
entry_set [n].insert (new Entry (p, true));
allocated += 1 << n;
min_address = min (min_address, (unsigned) p);
max_address = max (max_address, (unsigned) p);
}
if (n == n_max) {
allocated_stats.Collect (allocated / 1024);
min_address_stats.Collect (min_address / 1024);
min_address = UINT_MAX;
max_address_stats.Collect (max_address / 1024);
max_address = 0;
memory_stats_max.Collect ();
}
EntrySetIterator entry_set_iterator;
for (i = n; i >= n_min; -- i) {
for (j = 1, entry_set_iterator = entry_set [i].begin ();
entry_set_iterator != entry_set [i].end ();
++ j, ++ entry_set_iterator) {
if (j % (1 << (n + 1 - i))) {
if ((*entry_set_iterator)->second) {
delete [] (*entry_set_iterator)->first;
(*entry_set_iterator)->second = false;
allocated -= 1 << i;
}
}
}
}
}
EntrySetIterator entry_set_iterator;
for (n = n_min; n <= n_max; ++ n) {
for (entry_set_iterator = entry_set [n].begin ();
entry_set_iterator != entry_set [n].end ();
++ entry_set_iterator) {
if ((*entry_set_iterator)->second) {
delete [] (*entry_set_iterator)->first;
(*entry_set_iterator)->second = false;
allocated -= 1 << n;
}
delete *entry_set_iterator;
}
entry_set [n].clear ();
}
memory_stats_min.Collect ();
timing_stats.Stop ();
}
allocated_stats.Dump ("allocated memory ", "", " kbytes");
min_address_stats.Dump ("minimal virtual address ", "", " kbyte");
max_address_stats.Dump ("maximal virtual address ", "", " kbyte");
memory_stats_max.Dump ("at max ");
memory_stats_min.Dump ("at min ");
timing_stats.Dump ("");
}
With this function you are able to induce a fragmentation effect near the theoretical limit to every allocator. So this one seems to be a good test case, if you are examining fragmentation phenomena in memory allocators.
The above referenced measuring classes CTimingStats
and CMemoryStats may be implemented using Windows NT
systems calls.
class CStatsCounter {
private:
int m_count, m_min, m_sum, m_max;
public:
CStatsCounter (void):
m_count (0), m_min (INT_MAX), m_sum (0), m_max (INT_MIN) {}
~CStatsCounter (void) {}
void Collect (int value) {
++ m_count;
m_min = min (m_min, value);
m_sum += value;
m_max = max (m_max, value);
}
void Dump (const char *what, const char *point, const char *unit) {
std::cout << "<tr>" << std::endl;
std::cout << "<td><code>" << what << point << "</code></td>" << std::endl;
std::cout << "<td><code>" << m_count << "</code></td>" << std::endl;
std::cout << "<td><code>" << m_min << unit << "</code></td>" << std::endl;
std::cout << "<td><code>" << m_sum / m_count << unit << "</code></td>" << std::endl;
std::cout << "<td><code>" << m_max << unit << "</code></td>" << std::endl;
std::cout << "</tr>" << std::endl;
}
};
class CMemoryStats {
private:
MEMORY_BASIC_INFORMATION m_memory_info;
CStatsCounter m_region_stats, m_free_stats, m_reserve_stats, m_commit_stats, m_frag_stats;
public:
CMemoryStats (void) {}
~CMemoryStats (void) {}
void Collect (void) {
memset (&m_memory_info, 0, sizeof (m_memory_info));
unsigned region = 0;
unsigned sum_free = 0, max_free = 0;
unsigned sum_reserve = 0, max_reserve = 0;
unsigned sum_commit = 0, max_commit = 0;
while (::VirtualQuery(m_memory_info.BaseAddress, &m_memory_info, sizeof (m_memory_info))) {
++ region;
switch (m_memory_info.State) {
case MEM_FREE:
sum_free += m_memory_info.RegionSize;
max_free = max (max_free, m_memory_info.RegionSize);
break;
case MEM_RESERVE:
sum_reserve += m_memory_info.RegionSize;
max_reserve = max (max_reserve, m_memory_info.RegionSize);
break;
case MEM_COMMIT:
sum_commit += m_memory_info.RegionSize;
max_commit = max (max_commit, m_memory_info.RegionSize);
break;
}
m_memory_info.BaseAddress = (char *) m_memory_info.BaseAddress + m_memory_info.RegionSize;
}
m_region_stats.Collect (region);
m_free_stats.Collect (sum_free / 1024);
m_reserve_stats.Collect (sum_reserve / 1024);
m_commit_stats.Collect (sum_commit / 1024);
m_frag_stats.Collect (100 * (1. - double (max_free + max_reserve + max_commit) / double (sum_free + sum_reserve + sum_commit)));
}
void Dump (const char *point) {
m_region_stats.Dump ("regions ", point, "");
m_free_stats.Dump ("free memory ", point, " kbytes");
m_reserve_stats.Dump ("reserved memory ", point, " kbytes");
m_commit_stats.Dump ("committed memory ", point, " kbytes");
m_frag_stats.Dump ("fragmentation ", point, " %");
}
};
class CTimingStats {
private:
unsigned m_start, m_stop;
CStatsCounter m_stats;
public:
CTimingStats (void):
m_start (0), m_stop (0) {}
~CTimingStats (void) {}
void Start (void) {
m_start = ::GetTickCount ();
}
void Stop (void) {
m_stop = ::GetTickCount ();
m_stats.Collect (m_stop - m_start);
}
void Dump (const char *point) {
m_stats.Dump ("timing ", point, " ms");
}
};
We decided to test new/delete of
Microsoft's C++ runtime library and Doug Lea's malloc/free
against a simple own allocator to get a
picture, how these behave. The tests were implemented with
Microsoft's Visual C++ 6.0 SP3 and Doug Lea's allocator version
2.6.6. The tests were independently run on an Intel Pentium II
machine with 64 MB RAM running Windows 95, an Intel Pentium II
machine with 128 MB RAM running Windows NT 4.0 SP6 and an
Alpha-221164 machine with 128 MB RAM running Windows NT 4.0 SP5.
The tables on the following pages present the results for different runs with our test frame.
The tests showed us, that both allocators, new/delete
of Microsoft's C++ runtime library and Doug Lea's malloc/free,
behave well for small and medium buffers. For big buffers, the
CPU time used by Microsoft's allocator is unexpected high. In the
critical big buffer test, Doug Lea's allocator does not
release committed memory. We assume, that this is only a defect
in the Windows port of his otherwise very good allocator.
We are also able to explain with these tests, why there may be
a bad ratio of private bytes vs. virtual bytes for our long
running server application: under certain circumstances, i.e. the
allocation of big buffers, the implementation of HeapAlloc/HeapFree
seems to reserve big chunks of the virtual address space
and even worse, seems not to be able to release this reserved
chunks in the appropriate extent.
On the other hand the implementation of VirtualAlloc/VirtualFree
seems to fragment the virtual address space, if one lets
the operating system choose the region boundaries. The only way
we found to prevent VirtualAlloc from fragmenting
the virtual address space was to specifiy the (somewhat unclear
documented) MEM_TOP_DOWN flag in its parameter set.
Given the previous results of our examination, we saw no other possibility as to integrate our simple own allocator into production software. Beside the fact, that this was a serious intervention into our software, it might be interesting to note,
new/delete
and to achieve an almost complete coverage of allocation calls.
With these changes incorporated our software is now back in
production and we are not experiencing the mentioned problem
anymore. However the problems in new/delete
of Microsoft's C++ runtime library and Doug Lea's malloc/free
may lead to another tale...
Thanks to Joachim Pyras for discussing this.
© 2000-2002 GeNeSys
mbH & Co. KG
Last revised: 12/14/2000