Comparison Of C/C++ Memory Allocators Under Microsoft Windows

by Joerg Walter

Introduction

We implemented a so called 'long running server application' with Microsoft Visual C++ under Windows NT, extensivly using dynamic memory allocation via the new/delete features of the programming language. In production we then experienced the following diffuse memory problem: some time after starting the process the performance monitor of Windows NT showed us a relative low value for the 'Private Bytes' and an unexpected high, continuously growing value for the 'Virtual Bytes' counter. The analysis and resolution of this memory problem led to this working note.

Resources

First step in today's world is to search the Internet, whether anybody else out there had a similar problem in the past and was able to solve it. If you do so, you soon will find links to the following related sites:

The Memory Management Reference The most comprehensive overview of the topic, we have seen so far.
Dynamic Storage Allocation Information Repository A useful site maintained by Mr. Zorn from Microsoft (sic!).
Doug Lea's Workstation There are many references to the allocator, Mr. Lea has written.

You then will get the following public available, excellent survey article on the subject:

Paul R. Wilson, Mark S. Johnstone, Michael Neely, David Boles. 1995. Dynamic Storage Allocation: A Survey and Critical Review. University of Texas at Austin. Online copy.

It mentions widely unknown facts about memory fragmentation of allocators, which were first proven in the following article:

J. M. Robson. 1977. Worst case fragmentation of first fit and best fit storage allocation strategies. ACM. ACM Computer Journal, 20(3):242-244.

Fragmentation in Theory

Next step is to understand these critical limits for memory allocators. It is possible to estimate a lower bound for the worst case fragmentation of storage allocation strategies with the following formula:

Given the quantities
  M: net size of all memory requested by an application,
  n
min: minimal requested chunk size,
  n
max: maximal requested chunk size,
in the worst case the gross size G of memory needed by an allocator to fulfill the requests is O(M log
2(nmax / nmin)).

Let us give a simplified sketch of the proof, assuming that all quantities are powers of two. Further the imagined allocator may be so constructed, that it leaves no unused holes in memory and requests new memory from the operating system only if needed. We use natural induction over log2(nmax / nmin).

1. We begin with the simplest case log2(nmax / nmin) = 1, i.e. nmax = 2 nmin. Let X >= nmax be some fixed memory size. First allocate X / nmin chunks of size nmin. Then G >= X holds. In our simplified world the allocated chunks may be consecutive and numbered ci by ascending address . Deallocate every second chunk c2i + 1 of size nmin, resulting in a hole of size nmin < nmax. Next allocate X / nmax chunks of size nmax. Now G >= 2 X holds. Deallocate every second chunk c2i + 1 of size nmax, resulting in a hole of size nmax. Last deallocate all chunks c4i + 2 of size nmin resulting in holes of size 3 nmin < 2 nmax. This results in a memory layout with allocated chunks of net size (1 / 2 + 1 / 4) X and gross size G >= 2 X having holes of maximum size smaller 2 nmax.

2. Now assume, the previous statement holds for log2(nmax / nmin) = k, i.e. we have a memory layout with allocated chunks of net size (1 / 2 + ... + 1 / 2k + 1) X and gross size G >= (k + 1) X having holes of maximum size smaller 2 nmax. Allocate X / (2 nmax ) chunks of size 2 nmax. Now G >= (k + 2) X holds. Deallocate every second chunk c2i + 1 of size 2 nmax, resulting in a hole of size 2 nmax. Then deallocate all chunks c4i + 2 of size nmax resulting in holes of size 3 nmax < 4 nmax and proceed for the other chunk sizes down to nmin. This results in a memory layout with allocated chunks of net size (1 / 2 + ... + 1 / 2k+2) X and gross size G = (k + 2) X having holes of maximum size smaller 4 nmax.

Now using (1 + 1 / 2 + ... + 1 / 2k + 1) X <= 2 X gives our result G >= M / 2 (log2(nmax / nmin) + 1) = O(M log2(nmax / nmin)) for M = 2 X.

From the theoretical perspective one distinguishes between two different kinds of fragmentation, namely internal and external fragmentation. Internal fragmentation denotes the distribution of application allocated memory in the arenas reserved to the allocator, external fragmentation denotes the distribution of arenas reserved to the allocator in the address space of the application.

Fragmentation in Practice

In the practical perspective, fragmentation on Microsoft Windows may incorporate more phenomena. First one has the distribution of application allocated memory in the committed memory. Then one may see some distribution of committed memory in the reserved memory. Furthermore one has the distribution of reserved memory in the address space of the application.

You can implement the above shown constructive procedure in C++ leading to a function like this:

typedef std::pair< char *, bool > Entry;
typedef std::set< Entry * > EntrySet;
typedef EntrySet::iterator EntrySetIterator;

void robson_test1 (int runs, int M, int n_min, int n_max) {
    CTimingStats timing_stats;
    CMemoryStats memory_stats_max;
    CMemoryStats memory_stats_min;
    unsigned allocated = 0, min_address = UINT_MAX, max_address = 0;
    CStatsCounter allocated_stats, min_address_stats, max_address_stats;
    for (int k = 0; k < runs; k ++) {
        timing_stats.Start ();
        EntrySet entry_set [32];
        for (int n = n_min; n <= n_max; ++ n) {
            int i, j;
            for (i = 0; i < (1 << (M  - 1 - n)); ++ i) {
                char *p = new char [1 << n;];
                if (! p)
                    return;
                entry_set [n].insert (new Entry (p, true));
                allocated += 1 << n;
                min_address = min (min_address, (unsigned) p);
                max_address = max (max_address, (unsigned) p);
            }
            if (n == n_max) {
                allocated_stats.Collect (allocated / 1024);
                min_address_stats.Collect (min_address / 1024);
                min_address = UINT_MAX;
                max_address_stats.Collect (max_address / 1024);
                max_address = 0;
                memory_stats_max.Collect ();
            }
            EntrySetIterator entry_set_iterator;
            for (i = n; i >= n_min; -- i) {
                for (j = 1, entry_set_iterator = entry_set [i].begin (); 
                    entry_set_iterator != entry_set [i].end (); 
                        ++ j, ++ entry_set_iterator) {
                    if (j % (1 << (n + 1 - i))) {   
                        if ((*entry_set_iterator)->second) {
                            delete [] (*entry_set_iterator)->first;
                            (*entry_set_iterator)->second = false;
                            allocated -= 1 << i;
                        }
                    }
                }
            }
        }
        EntrySetIterator entry_set_iterator;
        for (n = n_min; n <= n_max; ++ n) {
            for (entry_set_iterator = entry_set [n].begin (); 
                entry_set_iterator != entry_set [n].end (); 
                    ++ entry_set_iterator) {
                if ((*entry_set_iterator)->second) {
                    delete [] (*entry_set_iterator)->first;
                    (*entry_set_iterator)->second = false;
                    allocated -= 1 << n;
                }
                delete *entry_set_iterator;
            }
            entry_set [n].clear ();
        }
        memory_stats_min.Collect ();
        timing_stats.Stop ();
    }
    allocated_stats.Dump ("allocated memory ", "", " kbytes");
    min_address_stats.Dump ("minimal virtual address ", "", " kbyte");
    max_address_stats.Dump ("maximal virtual address ", "", " kbyte");
    memory_stats_max.Dump ("at max ");
    memory_stats_min.Dump ("at min ");
    timing_stats.Dump ("");
}

With this function you are able to induce a fragmentation effect near the theoretical limit to every allocator. So this one seems to be a good test case, if you are examining fragmentation phenomena in memory allocators.

The above referenced measuring classes CTimingStats and CMemoryStats may be implemented using Windows NT systems calls.

class CStatsCounter {
private:
    int m_count, m_min, m_sum, m_max;

public:
    CStatsCounter (void): 
        m_count (0), m_min (INT_MAX), m_sum (0), m_max (INT_MIN) {}
    ~CStatsCounter (void) {}

    void Collect (int value) {
        ++ m_count;
        m_min = min (m_min, value);
        m_sum += value;
        m_max = max (m_max, value);
    }

    void Dump (const char *what, const char *point, const char *unit) {
        std::cout << "<tr>" << std::endl;
        std::cout << "<td><code>" << what << point << "</code></td>" << std::endl;
        std::cout << "<td><code>" << m_count << "</code></td>" << std::endl;
        std::cout << "<td><code>" << m_min << unit << "</code></td>" << std::endl;
        std::cout << "<td><code>" << m_sum / m_count << unit << "</code></td>" << std::endl;
        std::cout << "<td><code>" << m_max << unit << "</code></td>" << std::endl;
        std::cout << "</tr>" << std::endl;
    }
};

class CMemoryStats {
private:
    MEMORY_BASIC_INFORMATION m_memory_info;
    CStatsCounter m_region_stats, m_free_stats, m_reserve_stats, m_commit_stats, m_frag_stats;

public:
    CMemoryStats (void) {}
    ~CMemoryStats (void) {}

    void Collect (void) {
        memset (&m_memory_info, 0, sizeof (m_memory_info));
        unsigned region = 0;
        unsigned sum_free = 0, max_free = 0;
        unsigned sum_reserve = 0, max_reserve = 0;
        unsigned sum_commit = 0, max_commit = 0;
        while (::VirtualQuery(m_memory_info.BaseAddress, &m_memory_info, sizeof (m_memory_info))) {
            ++ region;
            switch (m_memory_info.State) {
            case MEM_FREE:
                sum_free += m_memory_info.RegionSize;
                max_free = max (max_free, m_memory_info.RegionSize);
                break;
            case MEM_RESERVE:
                sum_reserve += m_memory_info.RegionSize;
                max_reserve = max (max_reserve, m_memory_info.RegionSize);
                break;
            case MEM_COMMIT:
                sum_commit += m_memory_info.RegionSize;
                max_commit = max (max_commit, m_memory_info.RegionSize);
                break;
            }
            m_memory_info.BaseAddress = (char *) m_memory_info.BaseAddress + m_memory_info.RegionSize;
        } 
        m_region_stats.Collect (region);
        m_free_stats.Collect (sum_free / 1024);
        m_reserve_stats.Collect (sum_reserve / 1024);
        m_commit_stats.Collect (sum_commit / 1024);
        m_frag_stats.Collect (100 * (1. - double (max_free + max_reserve + max_commit) / double (sum_free + sum_reserve + sum_commit)));
    }

    void Dump (const char *point) {
        m_region_stats.Dump ("regions ", point, "");
        m_free_stats.Dump ("free memory ", point, " kbytes");
        m_reserve_stats.Dump ("reserved memory ", point, " kbytes");
        m_commit_stats.Dump ("committed memory ", point, " kbytes");
        m_frag_stats.Dump ("fragmentation ", point, " %");
    }
};

class CTimingStats {
private:
    unsigned m_start, m_stop;
    CStatsCounter m_stats;

public:
    CTimingStats (void): 
        m_start (0), m_stop (0) {}
    ~CTimingStats (void) {}

    void Start (void) {
        m_start = ::GetTickCount ();
    }
    void Stop (void) {
        m_stop = ::GetTickCount ();
        m_stats.Collect (m_stop - m_start);
    }

    void Dump (const char *point) {
        m_stats.Dump ("timing ", point, " ms");
    }
};

Testing Different Allocators

We decided to test new/delete of Microsoft's C++ runtime library and Doug Lea's malloc/free against a simple own allocator to get a picture, how these behave. The tests were implemented with Microsoft's Visual C++ 6.0 SP3 and Doug Lea's allocator version 2.6.6. The tests were independently run on an Intel Pentium II machine with 64 MB RAM running Windows 95, an Intel Pentium II machine with 128 MB RAM running Windows NT 4.0 SP6 and an Alpha-221164 machine with 128 MB RAM running Windows NT 4.0 SP5.

The tables on the following pages present the results for different runs with our test frame.

Discussion

The tests showed us, that both allocators, new/delete of Microsoft's C++ runtime library and Doug Lea's malloc/free, behave well for small and medium buffers. For big buffers, the CPU time used by Microsoft's allocator is unexpected high. In the critical big buffer test, Doug Lea's allocator does not release committed memory. We assume, that this is only a defect in the Windows port of his otherwise very good allocator.

We are also able to explain with these tests, why there may be a bad ratio of private bytes vs. virtual bytes for our long running server application: under certain circumstances, i.e. the allocation of big buffers, the implementation of HeapAlloc/HeapFree seems to reserve big chunks of the virtual address space and even worse, seems not to be able to release this reserved chunks in the appropriate extent.

On the other hand the implementation of VirtualAlloc/VirtualFree seems to fragment the virtual address space, if one lets the operating system choose the region boundaries. The only way we found to prevent VirtualAlloc from fragmenting the virtual address space was to specifiy the (somewhat unclear documented) MEM_TOP_DOWN flag in its parameter set.

Given the previous results of our examination, we saw no other possibility as to integrate our simple own allocator into production software. Beside the fact, that this was a serious intervention into our software, it might be interesting to note,

to achieve an almost complete coverage of allocation calls.

With these changes incorporated our software is now back in production and we are not experiencing the mentioned problem anymore. However the problems in new/delete of Microsoft's C++ runtime library and Doug Lea's malloc/free may lead to another tale...

Acknowledgements

Thanks to Joachim Pyras for discussing this.


Navigate: Main, Top

© 2000-2002 GeNeSys mbH & Co. KG
Last revised: 12/14/2000