Results 1 to 10 of 10
  1. #1
    Join Date
    Oct 2009
    Posts
    115

    Looking for a fast array diff function

    If you have two byte arrays: array_a and array_b, what's the fastest way to create an array_c which is a diff of the two sources? Meaning that array_c would contain all the bytes in array_b that do not appear in array_a...

    The only way I know is to read each file into an array of x bytes and then use memcmp to see if they are equal. If not, iterate through each element to find the discrepancies. This is unfortunately slow for large files.

    What I'm trying to replicate is the "Synchronize and Compare" feature in WinHex. When you open two files and click this option, it will very quickly (almost instantly if the files are < 1gb) hilight the bytes that are different.

    Thanks.

  2. #2
    Join Date
    Jun 2003
    Location
    Trier, Germany
    Posts
    1,350
    First of all, this problem is an instance of the Longest Subsequence Problem, which is of polynomial complexity for your case of two arrays. This means it *will* get slow for large files, no matter what you do. Using memcmp does not give you the real diff here, as it just tests for equality.
    The linked Wikipedia entry has a discussion on how to efficiently solve that problem. The algorithm is not too difficult, but it's of course also no one-liner. Since it uses dynamic programming, you may use it to calculate only parts of the solution for the initial display to get a better response time with large files.

    For your special case of a binary diff à la WinHex, there is the open-source xdelta library which implements a very efficient variant of the hash optimization of the LCS algorithm, specifically optimized for your problem.

  3. #3
    Join Date
    Oct 2009
    Posts
    115
    @ComicSansMS

    Thanks for the info. I knew it had to be more complex than simple nested FOR loops but I couldn't remember exactly what it was called. That wikipedia link you provided was exactly what I was looking for. Will be interesting to try to write my own algorithm.

    I played around with it a bit and was able to write a rudimentary implementation in VB6 for text strings...

  4. #4
    Join Date
    Oct 2009
    Posts
    115
    So I found a few implementations of the LCS in C++, VBA and Python and after looking at them they don't seem that they'd be any faster than a byte-to-byte comparison on large files.

    Having to build the LCS table out of the two arrays and then compare the arrays again against the table required using so many loops that you might as well just go from 0..N once and compare each byte.

    It works well on small arrays but once you get into large (hundreds of MBs) arrays the performance seems questionable.

  5. #5
    Join Date
    Jun 2003
    Location
    Trier, Germany
    Posts
    1,350
    Quote Originally Posted by MrSmite View Post
    So I found a few implementations of the LCS in C++, VBA and Python and after looking at them they don't seem that they'd be any faster than a byte-to-byte comparison on large files.
    Except that byte-for-byte comparison is not enough to get common subsequences. A diff is supposed to detect things like

    ABCD <-> ABFGHCD

    which can't be done in a single iteration. With one iteration, the best you can do is:
    ABCD <-> ABFGHCD

    which is simple prefix-detection, a problem that can be solved in linear time and has nothing to do with diff.

    As I've said before, if your main concern is performance, you can calculate only parts of the solution for better response time.

  6. #6
    Join Date
    Oct 2009
    Posts
    115
    My ultimate goal is to make a binary diff tool that will take File_A and File_B, compare them and create File_C which contains only the bytes necessary to turn File_A into File_B.

    This can be done with byte-by-byte comparison in a single loop but the two files are 3.5 GB each so this takes hours to complete. I thought using the LCS to diff these would help but I just don't see how it would benefit (speed wise) since there are so many iterations involved to create the table, do the backtrace and then do the diff.

    My reason for this project is partly curiosity and partly because a few games I play have stopped releasing standalone patches and everything is updated on the fly. I wanted to be able to create my own patches for archival purposes by diffing the binaries before and after the patch. That way if I need to I can reinstall the game and then apply the smaller patches rather than downloading all the data again.

    The last time I did a reinstall it took almost 12 hours to download the patch data to get current with the servers.

  7. #7
    Join Date
    Jun 2003
    Location
    Trier, Germany
    Posts
    1,350
    Quote Originally Posted by MrSmite View Post
    My ultimate goal is to make a binary diff tool that will take File_A and File_B, compare them and create File_C which contains only the bytes necessary to turn File_A into File_B.

    This can be done with byte-by-byte comparison in a single loop but the two files are 3.5 GB each so this takes hours to complete. I thought using the LCS to diff these would help but I just don't see how it would benefit (speed wise) since there are so many iterations involved to create the table, do the backtrace and then do the diff.
    You seem to be confused about what you are trying to achieve. Again: A full diff can *not* be done in a single loop with byte-by-byte comparison. This is a fundamental limitation in the complexity of the problem that you cannot cheat around.
    Finding the diff is an instance of the LCS problem, which is of polynomial complexity (i.e. many nested loops), which means that the approach described on the Wikipedia page is the fastest way to solve this problem. This is a mathematical fact.
    You do indeed need all those many loops for building the table to solve the problem. If your algorithm only does the single loop, your solution will not be a diff, but something else, probably a simple prefix (see my example from the last post).

    That being said, a single loop over 3.5 GB of data should *never* take hours unless your machine is from the early nineties. You probably messed up somewhere in that code. If you want, paste your code and we can do a review.

    If you plan using this in production code, I'd strongly recommend using a library like xdelta. Their implementation is certainly way better than anything both you and me could come up with. So unless you are just trying to understand the algorithm, don't reinvent the wheel.

  8. #8
    Join Date
    Oct 2009
    Posts
    115
    Ok, here's a function I wrote a while ago in VB 6 which was painfully slow:

    Code:
    Public Function lcsArr_byte(ByRef a1() As Byte, ByRef a2() As Byte) As Byte()
        
     On Error Resume Next
    
     ' Builds the LCS table for two arrays of Bytes
    
     Dim res() As Byte
     Dim i As Long
     Dim j As Long
     Dim lSizeA As Long
     Dim lSizeB As Long
     
     lSizeA = UBound(a1)
     lSizeB = UBound(a2)
     
     ReDim res(0 To lSizeA, 0 To lSizeB)
    
     For i = 0 To lSizeA
     
        For j = 0 To lSizeB
        
            If a1(i) = a2(j) Then
                res(i + 1, j + 1) = res(i, j) + 1
            Else
                res(i + 1, j + 1) = MaxB(res(i, j + 1), res(i + 1, j))
            End If
            
        Next ' j
        
     Next ' i
    
     lcsArr_byte = res
     
    End Function
    
    Public Function MaxB(ByVal a As Byte, ByVal b As Byte) As Byte
    
     ' Returns the MAX() of a pair of Bytes
     
     If a >= b Then
        MaxB = a
     Else
        MaxB = b
     End If
     
    End Function
    So if you read 1 MB (1048576 bytes) into each buffer, just to build the table means 1,099,511,627,776 iterations because the inner loop (j) is processed for every outer loop (i). Now imagine doing this for every 1 MB in a 3.5 GB file.

    That's just for computing the LCS table, nevermind actually determining the DIFF.
    Last edited by MrSmite; 03-30-2012 at 01:24 AM.

  9. #9
    Join Date
    Jun 2003
    Location
    Trier, Germany
    Posts
    1,350
    Quote Originally Posted by ComicSansMS View Post
    That being said, a single loop over 3.5 GB of data should *never* take hours unless your machine is from the early nineties.
    ...or if your programming language is. Why are you using VB6 when posting on a C++ forum?

    VB6 is a horrible choice for this kind of problem. The language itself is ancient and the compiler is not very good at creating fast code. Upgrading to a more recent version might already result in a significant performance gain. Furthermore, the language was never designed to do large-scale data processing like this.

    I was asking for your original byte-by-byte comparison code, but since you posted the simple LCS algorithm, we might as well talk about it: What you implemented is the naïve solution, which is great for illustrating how the algorithm works, but very bad for processing large amounts of data. With two 3 GB files, that algorithm would need at least 15 GB of RAM for the application, or else it would start swapping and things get really slow. You already noticed the bane of quadratic runtime yourself. The wikipedia page has some suggestions for possible optimizations, google will give a lot more. Also, make sure you understand the dynamic nature of the problem: You don't have to solve the whole problem at once. Divide it into manageable chunks of data and use memoization techniques to reuse data between passes. With VB6 you will have to do this anyway, since the VB6 compiler only generates 32 bit binaries, which are limited to 4 GB of active memory.

    If that sounds too complicated, just write a DLL wrapper for a library like XDelta and call that from VB. After all, that's why people write libraries in the first place...
    Last edited by ComicSansMS; 03-30-2012 at 05:06 AM.

  10. #10
    Join Date
    Oct 2009
    Posts
    115
    Thanks, I'll have to do some more research.

    The reason I posted the vb6 code here was because I'm in the process of porting it to C++ but since it is extremely slow I figured I'd see if there was something wrong with it or maybe something better.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •