SQLDTS.comSQLIS.comSQLIS Wiki
Data Flow: DefaultMaxBufferRows and DefaultBufferSize
By Ashvini Sharma
Version 2005
Level Beginner

Blogs From The Inside Series.

Reproduced with kind permission from the blog of Ashvini Sharma (MSFT)

There're two properties on the Data Flow task that have piqued enough interest to cause this post. Note that while I'm describing numbers below, the SSIS team reserves the right to change them in response to our performance testing. There're no secrets here, most of this information is exposed plainly and some other (like MaxBufferSize) can be retrieved easily as well. This post hopefully puts it together.

First the basics:

DefaultMaxBufferRows: maximum number of rows in a buffer. Defaults to 10,000.

DefaultBufferSize: maximum size of the buffer (in bytes). Defaults to 10MB.

Now, how do these inter-relate? I.e. how many rows will go down a path in a buffer?

Well, first of all that depends on the type of buffer and what columns will be in it. For the purposes of this post, assume that one execution tree has one type of a buffer. This means in my previous post's data flow task, there will be three types of buffers. Moving from one execution tree to another means a new buffer type and that means memcpy, something that users should typically avoid. However, getting an execution tree could also mean the scheduler might allocate a different thread for components on this execution tree, so you might get more CPUs being utilized. More on that when we discuss execution plans.

Back to the buffer sizes. Lets take one execution tree (#1) that goes from Flat file -> Data Convert -> Lookup and ends up going into the Sort. Note that the buffer type is the same even though in my case I'm creating new columns in Data Convert and bringing in new columns from the reference table in the Lookup. Why? Because memcpy just for reformatting memory is bad for performance so we try to do as little of that as possible. That means the underlying buffer that the Flat File produces has some extra columns that are not filled in until downstream components get a chance. There're also some optimizations the pipeline does to reuse columns in the buffer when possible like when a multicast is producing two outputs, both of which have a derived column that produce a column of the same type. So, it's not a trivial exercise to find out what columns are on a buffer but hopefully you can get a ballpark. Lets say this is RowSize in a buffer.

A few other variables should be called out here.

MaxBufferSize (not the DefaultMaxBufferSize). This cannot be changed externally. Currently it's set to 100MB but that might change. The reason for this maximum is to avoid wastage of memory where most of the buffer is empty. You'll get an error if you attempt to increase DefaultMaxBufferSize to more than this.

MinBufferSize. This is the minimum size of the buffer and currently has the granularity of VirtualAlloc. On my machine, for example, this is exposed in the following BufferSizeTuning LogEvent:

Rows in buffer type 13 would cause a buffer size less than allocation minimum, which is 65536 bytes. There will be 1365 rows in buffers of this type.

So, I've got a MinBufferSize of 64K.

Given these, the pipeline initialization code (that creates buffer types and sets metadata about them such as what columns and how many rows) then gets busy to calculate the buffer sizes.

If the size of one buffer (RowSize * DefaultMaxBufferRows) will exceed DefaultMaxBufferSize, the number of rows will be reduced to keep the memory requirements predictable. Note that one row cannot span multiple buffers, so in a rare case there'd be a huge buffer with just one row.

If the size of one buffer will be less than MinBufferSize, the number of rows will be increased to at least match MinBufferSize which is the allocation granularity to maximize memory utilization.

If the size of one buffer is between the Min and Max, the memory for a buffer will be a multiple of the allocation granularity that most closely fits the number of rows requested. This is again to maximize memory utilization. The number of rows will be adjusted a little to fit the allocated space.

Drop me a note if you play around these and have some interesting information to share, or better still, write/present on it. I'm also interested in hearing about what further information you need to gain insights into perf tuning the pipeline.

later!

Page Last Updated: 06 May 2005 (50)