Tuesday, March 21, 2017

What kind of JOIN syntax is this?

I am not often puzzled when looking at TSQL syntax, but this one time I could not figure out the syntax. Syntax looked similar to the example below.
 
SELECT  t.*
      , s.*
FROM    Teacher AS t
LEFT OUTER JOIN dbo.Student AS s
JOIN dbo.Grades AS gr
    ON s.ID = gr.StudentID
       AND  gr.GradeLetter = 'A'
    ON s.TeacherID = t.ID


My first gut reaction was that this code is broken and would not run. To my amazement code ran just fine. Now came the hard part, which was to figure out what the code was doing because I have never seen this syntax before. Since I did not understand what I was looking at I could not BING "weird join syntax" to get an answer. As a developer, I learned long time ago to break down code into smallest possible chunks to get the answer.

After I have figured out the relationship between tables, I was able to understand what query was doing. To be able to read query better it can be rewritten in the following way.
 
SELECT  t.*
      , s.*
FROM    Teacher AS t
LEFT OUTER JOIN
            (
                SELECT  s.*
                      , gr.*
                FROM    dbo.students AS s
                JOIN dbo.Grades AS gr
                    ON s.ID                = gr.StudentID
                       AND  gr.GradeLetter = 'A'
            ) AS s
    ON s.TeacherID = t.ID
This code is much easier to read and to understand. Students filtered by Grades and we are only interested in Students who have "A" grade. Therefore we get all records from Teacher table but only get Students who have good grades. If we would try to rewrite the query in the following matter, we would only get Teachers that have Students with perfect grades as we specified inner join; therefore, it takes precedence over the left outer join.
 
SELECT  t.*
      , s.*
FROM    Teacher AS t
LEFT OUTER JOIN dbo.Student AS s
    ON s.TeacherID         = t.ID
JOIN dbo.Grades AS gr
    ON s.ID                = gr.StudentID
       AND  gr.GradeLetter = 'A'
Now that I knew what code was doing I wanted to know performance because I do not care how the code looks as long as it does not perform fast it is useless. The execution plans of the two queries are almost identical with the only difference being that when I rewrote the original query as sub-query, I got additional operation added to the plan. Physical joins were identical and execution time was almost exact.


I could not believe I learned something new after writing T-SQL for many years. All of the sudden I got another tool in my toolbelt to write better code. Now if I ever need to filter tables that should not effect result I can write with this new syntax. To be honest, I have not taken this to the extreme to see how many JOIN operations I can nest before it becomes unreadable. However, I am sure it would not take many before someone else could not figure out what was going on and would have to break it down piece by piece to see it clear.

Just imagine having to support code that looks like example 1 below instead of example 2. Even though the code in the following case produces same execution plan, it becomes hard to read and therefore hard to maintain, and if you have junior developers who are not strong in SQL, they are more likely to make mistakes.
 
--Example 1:
SELECT  t.FullName AS TeacherName
      , s.FullName AS StudentName
      , gr.Class
FROM    dbo.Teacher AS t
JOIN dbo.Student AS s
JOIN dbo.Grades AS gr
JOIN dbo.Class AS c
JOIN dbo.Room AS r
    ON c.ID = gr.ClassID
    ON r.ClassID = c.ID
    ON s.ID = gr.StudentID
    ON s.TeacherID = t.ID

--Example 2:
SELECT  t.FullName AS TeacherName
      , s.FullName AS StudentName
      , gr.Class
FROM  dbo.Teacher AS t
JOIN dbo.Student AS s
    ON t.ID = s.TeacherID
JOIN dbo.Grades AS gr
    ON s.ID = gr.StudentID
JOIN dbo.Class AS c
    ON c.ID = r.ClassID
JOIN dbo.Room AS r
    ON c.ID = gr.ClassID
P.S. I still don't know if this JOIN syntax has special name, but at least I now know what it does.

15 comments :

  1. Oh Vlad, there are many special names for this steaming pile of ridiculous hyper-optimization and intentional obfuscation that I can think of. The real problem, which you have correctly identified, is the complete antipathy of the author to easy understanding and maintainability.

    ReplyDelete
  2. Randomly, this JOIN syntax causes the TransactSql ScriptDom to parse much more slowly - so if you're doing any sort of AST based analysis to walk dependencies - then this syntax actually makes that a lot more processor intensive as well.

    ReplyDelete
  3. It's called "Logical Join Evaluation Order". You can find explanation and example in "Inside Microsoft SQL Server 2008: T-SQL Querying" on page 408.

    ReplyDelete
  4. Hi! Like you I saw a similar code time ago and I had the same impressions regarding the script was wrong (but it wasn't). After google a little I found that this approach seems to change the execution orden in joins, which sometimes it will help you for performance issues.
    Mauricio

    ReplyDelete
  5. I have inherited a lot of legacy code that is written this way. There are lots of filtering subqueries etc. Fortunately for me there are not a lot of select * in these, so you are able to determine WTH they were trying to do. Cryptic but functional, I do prefer temp tables when possible as not to stress out memory and cache. Thanks for shedding the light, at least I know I am not alone in this discover.

    ReplyDelete
  6. You see it more often in generated queries. I wrote about it in one of the "MVP Deep Dives" books, including a description of how to read it. That is, to find the first ON, and then work backwards from there to find the last (unmatched) JOIN. Everything between the two is the right side of the join (so ident, put parentheses, whatever). Then keep looking backwards until you reach another unmatched JOIN or the FROM. That's the left side of your join. Now find the next ON, and repeat. Works every time, and lets you understand nasty things that are code-generated queries.

    ReplyDelete
  7. When we created the in fixed join notation, we were basing it on some notes from Chris Date in an article in one of the database magazines at the time. He never actually propose the syntax; in fact, neither he nor Dr. Codd ever sent anything into ANSI X3H2 for consideration.

    Our rules were that the ON clauses associate with the nearest join (assuming the particular join could take an on clause; for example, CROSS JOIN does not) and that none of the infixed joins have precedent. If you wanted that you had to use parentheses. Unfortunately, your example is pretty bad; you believe in the mysterious, magical, universal, ever-changing "id", but when you take a course in data modeling will find out that identifiers have to be the identifier of something in particular. There are no generics! You also don't seem to understand that tables are sets, and that sets should be named with either collective or plural nouns. Is there really only one room, only one student, only one teacher? Probably not. One of the horrors that SQL Server people commit is assuming that the table property IDENTITY can ever be a valid relational key. This, of course, is completely false and will make Dr. Codd very, very sick. Essentially, it's how non-RDBMS programmers mimic the old pointer chains which they learned in their previous languages.

    This led to some interesting things. If you are doing nothing but INNER JOINs, then the ON clauses can be all clustered together at the end of the FROM clause. You just trust in the optimizer to rearrange things for you. This is the same thing we did in procedural programming languages with arithmetic expressions.

    The problem is that when you have a nice long list of tables, it's hard to see which piece of logic goes with which table. You just trust in the optimizer to match things up correctly. The bad news is we had to turn one of those inner joins into an outer join, they don't always match up the way you might want them to.

    It continues to get worse. If you do use parens, then the parenthesized expression can be given a correlation name. This changes the scoping rules for references to the tables and hides them. I mention some of these things and some of my books but may be able to get the an article on one of the websites.



    ReplyDelete
  8. In the old days, the optimizer gave up beyond a certain point, so you had to optimize the plan your self. Try this query syntax with option force order, or any join hint, then switch around the join and ON clauses, see how it produces different joins orders.

    ReplyDelete
  9. The joins happen based on the order of the ON statements not the join statements. Student is first joined to grades (first ON statement) and then the results are left joined to teacher (second on statement).

    It could also be written as:

    SELECT t.*
    , s.*
    FROM dbo.Student AS s
    JOIN dbo.Grades AS gr
    ON s.ID = gr.StudentID
    AND gr.GradeLetter = 'A'
    RIGHT OUTER JOIN Teacher AS t
    ON s.TeacherID = t.ID

    Which gives the identical query plan as the original query.

    I'm not aware of any special name for it as it's how join statements work. A leftover from the days when all joins were done in the where clause.

    ReplyDelete
  10. I think this join syntax is called a chiastic join, as you mention if you need to nest a join with another inner table before joining to the outer table you can use this method.
    Itzik Ben-Gan describes this join in the book Inside Microsoft SQL Server 2008 T_SQL Querying: T-SQL Querying

    ReplyDelete
  11. I think this join syntax is called a chiastic relationship and is described by Itzik Ben-Gan in the book Inside Microsoft SQL Server 2008 T-SQL Querying: T-SQL Querying

    ReplyDelete
  12. In other word SQL JOIN and ON statements are not required to be adjacent. Both examples show that. IMHO it's better practice, as Vlad noted, to place the ON next to the JOIN whose table it references.

    ReplyDelete
  13. That's the "syntax" I teach and use (I'm an instructor). It's a consecuencuence of following the model. After joining two tables you can treat the resultset as another table so this style results naturally when we introduce outer joins.
    I would write the query like this to explain it:

    SELECT t.*,
    s.*
    FROM Teacher AS t
    LEFT JOIN (
    /*SELECT s.*,
    gr.* */
    dbo.Student AS s
    INNER JOIN dbo.Grades AS gr
    ON s.ID = gr.StudentID
    AND gr.GradeLetter = 'A'
    )
    ON s.TeacherID = t.ID

    What to me appears weird is that developers cluster join operations by type in queries once an outer join is written. In your example, to obtain the intended results, a developer would usually write it like this:

    SELECT t.*, s.*
    FROM Teacher AS t
    LEFT JOIN dbo.Student AS s
    ON s.TeacherID = t.ID
    LEFT JOIN dbo.Grades AS gr
    ON s.ID = gr.StudentID
    AND gr.GradeLetter = 'A'

    Asuming that there's a subjects table and you want to present the subject in the final resultset, it should be obtained with an inner join since there's no grade without a subject. But developers usually would just write another left join.

    ReplyDelete
  14. I would call it a form of structured or nested query.
    https://docs.microsoft.com/en-us/sql/t-sql/queries/from-transact-sql shows that a "joined table" can be like a Russian doll of joined tables within joined tables.
    It is not obvious or intuitive, I prefer to expand things like this as common table expressions which help me build up my understanding of the nested constructs that SQL facilitates.

    ReplyDelete