SQL Puzzle

Sebastian Bergmann » 03 December 2007 » in MySQL, PHP » 8 Comments

Dear lazyweb,

I want to mine a code repository for data to map past bugs to sourcecode files.

I have written a small PHP script (the initial version of the script can be found here) to import the relevant data from a Subversion repository into the following tables of a relational database:
bugs            changes         paths
--------        --------        -------
bug_id          path_id   <-->  path_id
revision  <-->  revision        path
What I need now is two queries to ask the database for
  • paths that are most commonly changed during bugfix commits and
  • paths that are commonly changed together
Your suggestions are most welcome in the comments to this posting :-)
Defined tags for this entry: , , ,

Trackback specific URI for this entry

8 Comments to "SQL Puzzle"

Display comments as (Linear | Threaded)
  1. Bill Karwin
    03/12/2007 at 21:16 Permalink
    Paths that are most commonly changed during bugfix commits:

    SELECT path, COUNT(*) AS cnt
    FROM paths
    JOIN changes USING (path_id)
    JOIN bugs USING (revision)
    GROUP BY revision
    ORDER BY cnt DESC

    Paths that are commonly changed together:

    SELECT p1.path AS path1, p2.path AS path2, COUNT(*) AS cnt
    FROM paths AS p1
    JOIN changes AS c1 ON (p1.path_id = c1.path_id)
    JOIN changes AS c2 ON (c1.revision = c2.revision AND c1.path_id < c2.path_id)
    JOIN paths AS p2 ON (p2.path_id = c2.path_id)
    GROUP BY path1, path2
    ORDER BY cnt DESC

    Reply

  2. Sebastian Bergmann
    04/12/2007 at 09:01 Permalink
    Sorry, I was not clear enough: for each path I would like a list of other paths that are commonly changed at the same time.

    Reply

  3. Chuck Burgess
    03/12/2007 at 21:34 Permalink
    If I'm following you, I think these will do it (though I'm writing without testing):

    -- paths that are most commonly changed
    -- during bugfix commits
    select path, count(*)
    from paths
    group by path
    order by 2 desc
    ;



    -- paths that are commonly changed together
    select p1.path, p2.path, count(*)
    from changes c1
    join changes c2 on c1.revision = c2.revision
    and c1.path_id != c2.path_id
    join paths p1 on c1.path_id = p1.path_id
    join paths p2 on c2.path_id = p2.path_id
    group by p1.path, p2.path
    order by 3 desc
    ;

    Reply

  4. Chuck Burgess
    03/12/2007 at 21:47 Permalink
    Might be interesting to carry that table layout further by also recording the last committer that touched a given file _before_ the current commit...

    Reply

  5. Sebastian Bergmann
    04/12/2007 at 09:07 Permalink
    The author of the previous change is not neccessarily the one responsible for the bug.

    Reply

  6. Chuck Burgess
    04/12/2007 at 19:07 Permalink
    Ah, true... the change tracking here is at the file level, not the line level...

    Reply

  7. Sebastian Bergmann
    04/12/2007 at 19:17 Permalink
    Even if you use "svn blame" on the line level, who says that the directly preceeding committer broke it?

    Reply

  8. Hubert Roksor
    03/12/2007 at 23:42 Permalink
    I didn't notice comments had already been posted when I started writing this. It's nearly identical to Bill's except that it will perform better if your repository is big and you only need the top X of bugs/paths, that's why I post it anyway.


    SELECT tmp.bug_count, p.*
    FROM (
    SELECT c.path_id, COUNT(*) AS bug_count
    FROM changes c
    JOIN bugs b USING (revision)
    GROUP BY c.path_id
    ORDER BY bug_count DESC
    LIMIT 10
    ) AS tmp
    JOIN paths p USING (path_id)

    SELECT cnt, p1.path AS path1, p2.path AS path2
    FROM (
    SELECT c1.path_id AS id1, c2.path_id AS id2, COUNT(*) AS cnt
    FROM changes c1
    JOIN changes c2 ON c2.revision = c1.revision
    AND c2.path_id > c1.path_id
    GROUP BY c1.path_id, c2.path_id
    ORDER BY cnt DESC
    LIMIT 10
    ) AS tmp
    JOIN paths p1 ON p1.path_id = tmp.path_id
    JOIN paths p2 ON p2.path_id = tmp.path_id

    Reply

2 Trackbacks to "SQL Puzzle"

  1. PHPDeveloper.org 05/12/2007 at 14:34
    Sebastian Bergmann has figured out half of the puzzle he was ...
  2. Sebastian Bergmann 06/12/2007 at 12:07
    The script to mine a Subversion repository for bug information that I previously mentioned here and here has found a (most likely temporary) home here.

Add Comment


To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

Submitted comments will be subject to moderation before being displayed.