【转】Thinking Set-Based .... or not?

转自 http://weblogs.sqlteam.com/jeffs/archive/2007/04/30/60192.aspx

Thinking "Set-Based"

So, I hear you're a "set-based SQL master"!

As Yoda once said, you've "unlearned what you have learned". You've trained yourself to attack your database code not from a procedural, step-by-step angle, but rather from the set-based "do it all at once" approach. It may have taken weeks, months or even years to finally obtain this enlightened state of "database zen", but it was worth it. Your SQL code is short, fast, and efficient. There is not a cursor in sight. You have reached the point where you can write a single SELECT that replaces hundreds of lines cursors, temp tables and client-side processing. Life is good.

As I read somewhere once, you don't tell SQL how to do it, you tell SQL what you want, and that's a great way of thinking about it. A procedural programmer gets bogged down with the details, and has to concentrate on breaking things down into small pieces, explicitly reading and processing one row of data at a time, and figuring out how to combine those results together at the end to make it all work. A set-based SQL programmer worries about none of those things: In the set-based world, you state your relations and join the tables together, add some grouping and criteria, and it is the database engine that worries about the specifics.

Well, maybe not ... You might not want to abandon all of the things that you learned from your procedural background. There's a danger in misunderstanding that set-based programming means "doing it all at once", and thinking that it forbids processing things "one at a time" or "in steps". Sometimes, when you get too comfortable in the set-based way of thinking, you abandon the good things that you learned as a procedural programmer. The two mindsets aren't as different as you might think!

Approaching a Problem

What if I ask you to write a somewhat complicated SELECT, something like this:

"Write a SELECT that returns, for a given @Year, the total sales by office, and also the office's top salesperson (highest total sales for the year) with their salary (as of the last day of that year), their total bonuses for that year, and their hire date."

While this isn't rocket science, what makes this request slightly complicated is that it appears there are at least 3 different transactional queries (sales by employee, sales by office, bonus totals by employee) that we need to put all together, as well some point-in-time reporting off of a history table (employee salaries) which can be difficult depending on how the table is structured.

Now, how does a "set-based" programmer attack this? The schema and the specifics are not important, it is really just the general approach that I am commenting on.

Do you start by immediately finding all of the necessary tables and put them all into 1 big SELECT by joining everything that matches? Then, from there, you may start adding columns and expressions to your GROUP BY clause, adding in criteria and CASE expressions, maybe a DISTINCT before it all? And then, if that doesn't work, maybe you add some correlated subqueries to your SELECT list, or move things in and out of derived tables? Then more GROUPING, more criteria, more JOINs, more moving things and shifting parts of the SELECT around until it "looks right" and it "seems to work"?

Well, that does seem to be the set-based approach for many, since you get so trained and so used to thinking of the "big picture", and not worrying about details, that you just assume that you can dive right in and start joining and selecting and eventually you'll get there. We've all done it. That's what you want to do, after all. We don't want to think that we need to break things down into smaller, discrete steps, or that things should be "processed" on step at a time. It goes against everything that we've been trying to train ourselves to do ever since we embraced the concept of relational database programming, right?


Thinking in Sets = Thinking in Steps

It is so important to understand that "thinking set-based" does not conflict with "thinking in steps" !! In fact, it is more important than ever in some ways, especially as your data and your schemas and your requirements become more complex.

In the above example, if you "dive right in" and start joining and selecting and grouping and seeing how things work, that is exactly the wrong way to do it! You need to remember that the skill you learned from your procedural world -- breaking larger problems down into smaller parts -- still applies even in when writing SQL.

Looking at the above statement, a really good SQL developer will immediately break the problem down into smaller, completely separate parts:
a SELECT that returns 1 row per Office, with each Office's total sales for a @Year
a SELECT that returns 1 row per employee, with their salary as of the last day of a given @Year
a SELECT that returns 1 row per employee, with their total bonus amount for a given @Year
a SELECT that returns 1 row per Office, with the top salesperson (Employee) and their sales amount, for a @Year
Starting with those 4 basic pieces, all of which are completely isolated from the others, is the way to begin to approach the problem. You don't focus on returning employee names, or sorting, or formatting dates -- you focus on the data, and returning it in small parts that will eventually all fit together. For each SELECT, you can test it and optimize it and verify the data, and only at the very end, when all the individual parts are working, do you put them together. This sounds familiar, doesn't it? Much like a procedural programmer who breaks their application down into smaller parts via functions or classes or whatever tools their language provides, I am suggesting that the overall approach is still valid and in fact a great idea even when writing SQL!

In fact, when writing a SELECT that requires multiple non-related transactional tables this is really the only way to go about solving this problem, since each one must be fully grouped and summarized and ready to join on matching key columns before we can begin to even think about combining the results. In this case, it is only at the very end, when all of our individual SELECTs are grouped by Office or Employee, that we join them together as derived tables.

In addition, the "step-based" approach involves understanding that things like formatting dates, deciding on how to output a name (first/last or last/first, etc), or sorting is irrelevant to the larger problem. In a complicated select with lots of calculations or point in time reporting, if you can write a select that returns 1 row per employee (determined by the employee's primary key column, let's say EmployeeID), that is all you need; if you know that the Employee table has first name, last name, hire date, and a simple relations to their Department, then don't worry about any of that until the very last step! Just focus on returning a reference to the entity (EmployeeID) and calculating the results or values that you are trying to return per entity (total sales, salary, bonus), and only when everything is accurate and correct should you dress things up with the other attributes of the entity which are trivial to obtain (employee name, hire date) through simple joins.

Putting it all Together

In the end, it really does resemble procedural programming quite a bit in that each of these little, self-contained parts, all of which are responsible for doing their job accurately and efficiently, are much like functions or classes. And our primary SELECT is like the main program that calls each of them and in the end puts them all together:

select OfficeSales.OfficeID,
OfficeSales.TotalSales as OfficeSales,
TopSalesPerson.TotalSales as EmployeeSales,
( .... ) OfficeSales
inner join
( .... ) TopSalesPerson on OfficeSales.OfficeID = TopSalesPerson.OfficeID
inner join
( .... ) EmpSalaries on TopSalesPerson.EmployeeID = EmpSalaries.EmployeeID
inner join
( ... ) EmpBonus on TopSalesPerson.EmployeeID = EmpBonus.EmployeeID
inner join
Employees on TopSalesPerson.EmployeeID = Employees.EmployeeID
inner join
Offices on OfficeSales.OfficeID = Offices.OfficeID

When all the code is in place, this will probably be a very large, complicated SELECT. But looking at this way, doesn't it look pretty simple? And each of those derived tables, on their own, will also be quite simple. That's the approach we want to take!

(note: In addition to using derived tables, you can use Common Table Expressions to facilitate this approach, since they work essentially the same way but are often easier to read and incorporate into your complicated SELECT statements. Views and parameterized User Defined Functions can be useful as well. The same concepts still apply -- divide and conquer!)

Only now, at the very end, do we worry if some of those joins should become LEFT OUTER JOINs, since maybe some employees might not have a bonus for a given year, and so on. Getting the employee's Name and HireDate and the name of each Office is done here, at the very end, where it is very easy and clear since we have just focused on returning the key columns for both of those entities in our derived table results.

Think again!

So, the next time you dive right into and start joining and selecting because you know that a "set-based master" doesn't worry about breaking down the details, consider instead becoming a "step-based set-based programmer", and break down your large problem into smaller, easily solvable steps. Even in T-SQL, this is the way to go and it will make your life easier, your code simpler, and often more efficient as well. Don't completely disregard your past experience as you become a relational database programmer, learn how to combine the best of both worlds.

posted on 2007-09-28 12:42 季阳 阅读(300) 评论(0)  编辑 收藏 引用

网站导航: 博客园   IT新闻   BlogJava   知识库   博问   管理