Externel sort
#1
Hi,

I am looking for implementation of some externel sorts (no in memory, but on  disk)
like POLYPHASE MERGE.

I would like to understand  the algorithm.
Thanks.
Reply
#2
Caylus,

This is the first time I ever heard or thought about external sorting as I never had to deal with so much data that it wouldn't fit into internal memory. This is a great subject to do research in, though I would never have any use for it (not at the moment anyway).

I did a little research on "Polyphase Merge Sort" through Google and found a ton of information on it. As with most things in programming, my head started trying to fully understand the concept and then devise a way to implement the algorithm. I had to stop myself from diving too deep into the subject as I would get to the point that I would drop my current projects and start work on that. So I gathered some links for you, and I hope they help you realize your next project.

Here is a lecture written by Russell C. Bjork from the Department of Mathematics and Computer Science at Gordon College in Wenham Massachusetts (USA). The lecture was so well written and very easy to understand, that I kept forgetting it was college level information. Mr. Bjork, if you happened to read this post, great job! The lecture is located at: CS321 Lecture: External Sorting (Date: 3/21/88 - revised 11/22/99).

Mr. Bjork has several other great lectures which are a must read, and they are located at: More lecture notes from Russell C. Bjork.

There is a whole book on sorting algorithms that might be of interest: Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition) Hardcover – May 4, 1998. That's 800 pages worth of information on sorting internal and external.


Thanks,
Walter Whitman
The Joyful Programmer
Dedicated to empowering computer programming hobbyists, tinkerers, amateurs, and enthusiasts.
profile for Walter Whitman at Stack Overflow, Q&A for professional and enthusiast programmers


Reply
#3
Working with data files is what I do all the time, and sometimes I have to work with MASSIVE file sizes, so the concept of external sorting is second nature for me now.  Wink


For instance, let's say that I want to sort a 20 GB Database.  My PC is ran under a 32-bit environment.  The largest amount of memory I could conceivably sort at a time would be about 1GB.  How the heck could I ever sort it??

EASY!

First, read in 1 GB and sort that and save it as a file.  FILE001.bin
Then read in the next GB and sort that and save it.  FILE002.bin
Continue until you read the whole original file.  You'll now have 20+ 1GB files on your drive.

Open those files and only read in a single record from each.  Look for the first one.  Write it to disk.
Read the next entry from that existing file, compare and repeat.
Do the same until you've merged all those files into one.
Delete the temp 1GB files.
Delete the original (if wanted).
Rename the sorted, merged file to the original filename.  (If wanted.)

You've just externally sorted a massive data structure!

It's basically that simple a process.  Wink
Reply
#4
Steve,

What your talking about, sounds just like the Polyphase Merge Sort algorithm. From what I was reading about it, the algorithm does seem quite simple to use.

Of course, all we need now is for someone to write a Polyphase Merge Sort demo in QB64.


Thanks,
Walter Whitman
The Joyful Programmer
Dedicated to empowering computer programming hobbyists, tinkerers, amateurs, and enthusiasts.
profile for Walter Whitman at Stack Overflow, Q&A for professional and enthusiast programmers


Reply
#5
This is a technique I use for merging multiple unsorted files. If you KNOW they are sorted, the MergeSort() can safely be removed.

Code:
basename$ = "deleteme"
MaxFiles& = 3 ' * 1 less than the actual number of files being merged. 0 counts as 1
XtLen& = NDigits&(MaxFiles&)
RangeMaxNum& = 131071: NumDigitsRangeMaxNum& = NDigits&(RangeMaxNum&)
FOR pass& = 0 TO MaxFiles&
    p$ = LTRIM$(STR$(pass&))
    WHILE LEN(p$) < XtLen&
        p$ = "0" + p$
    WEND
    testout% = FREEFILE
    OPEN basename$ + p$ FOR OUTPUT AS testout%
    numberout& = 0
    DO
        IF RND <= 1 / 2 THEN
            numberout& = numberout& + 1
        END IF
        p$ = LTRIM$(STR$(numberout&))
        WHILE LEN(p$) < NumDigitsRangeMaxNum&
            p$ = "0" + p$
        WEND
        PRINT #testout%, p$
    LOOP UNTIL numberout& > RangeMaxNum& - 1
    CLOSE testout%
NEXT
mwrite& = 0
swrite& = 0
MergeCount& = 0
MaxMerges& = 0: MaxMergesNplaces& = NPlaces&(MaxMerges&) '* left for FileSplit(), coming soon to code near you.
'* Polyphase Merge on sorted files.
DO

    '* reset these each merge phase
    m$ = "" '* assume empty primary file (aka master file)
    s$ = "" '* assume empty secondary file (aka slave file)
    '***************

    f$ = "deleteme"
    IF MergeCount& > 0 THEN
        '* outputfile now becomes the new input file
        '* and the next sequential deletemeX the next chunk read in
        master$ = "outputfile"
        '* master% = FREEFILE: OPEN master$ FOR INPUT AS master%

        ext1$ = LTRIM$(STR$(MergeCount&))
        ext1$ = STRING$(MaxMergesNplaces& - LEN(ext0$), "0") + ext1$
        slave$ = "deleteme" + ext1$
        '* slaver% = FREEFILE: OPEN slave$ FOR INPUT AS slaver%

        outf$ = "outputfilex"
        outpfile% = FREEFILE: OPEN outf$ FOR OUTPUT AS outpfile%

    ELSE
        '* this is the first merge, so "deletemeX" and "deletemeY" are the
        '* first two files to be merged.

        ext0$ = LTRIM$(STR$(MergeCount&))
        ext0$ = STRING$(MaxMergesNplaces& - LEN(ext0$), "0") + ext0$
        '* master% = FREEFILE: OPEN "deleteme" + ext0$ FOR INPUT AS master%
        master$ = "deleteme" + ext0$
        LoadAndSort master$ '* only has to be done on first run. following, it will always be sorted

        ext1$ = LTRIM$(STR$(MergeCount& + 1))
        ext1$ = STRING$(MaxMergesNplaces& - LEN(ext0$), "0") + ext1$
        '* slaver% = FREEFILE: OPEN "deleteme" + ext1$ FOR INPUT AS slaver%
        slave$ = "deleteme" + ext1$

        outpfile% = FREEFILE: OPEN "outputfile" FOR OUTPUT AS outpfile%
    END IF

    master% = FREEFILE
    OPEN master$ FOR INPUT AS master%
    WriteAndRead m$, master%, outpfile%, mwrite&, totalReads&

    LoadAndSort slave$
    slaver% = FREEFILE
    OPEN slave$ FOR INPUT AS slaver%
    WriteAndRead s$, slaver%, outpfile%, swrite&, totalReads&

    PRINT MergeCount&
    DO
        IF s$ > m$ OR s$ = "" THEN
            WriteAndRead m$, master%, outpfile%, mwrite&, totalReads&
        ELSE
            WriteAndRead s$, slaver%, outpfile%, swrite&, totalReads&
        END IF
        LOCATE 1, 1: PRINT USING "################"; totalReads&; mwrite&; swrite&;
    LOOP UNTIL (m$ = "") AND (m$ = "")

    '* PRINT "*******************"
    CLOSE
    KILL slave$ '* clear the storage this file takes.
    IF MergeCount& > 0 THEN
        KILL "outputfile"
        NAME "outputfilex" AS "outputfile" '* the output from these merge stages now becomes the input for subsequent merge stages
        MergeCount& = MergeCount& + 1
    ELSE
        KILL master$
        '* in the first merge, 2 chunks are used to create the output file
        MergeCount& = MergeCount& + 2
    END IF
LOOP UNTIL MergeCount& > MaxFiles&
CLOSE master%
CLOSE slave%
CLOSE outpfile%

LOCATE 2, 1
PRINT "done... now sequence checking...";
Sequence% = FREEFILE
sequencecheck$ = ""
errorcon% = 0
OPEN "outputfile" FOR BINARY AS #Sequence%
xcount& = 0
DO
    LINE INPUT #Sequence%, sequence$
    IF sequence$ < sequencecheck$ THEN
        PRINT sequence$
        PRINT sequencecheck$
        errorcon% = -1
        EXIT DO
    ELSEIF sequence$ > sequencecheck$ THEN
        sequencecheck$ = sequence$
    END IF
    xcount& = xcount& + 1
    LOCATE 5, 1
    PRINT USING "#############"; xcount&
LOOP UNTIL EOF(Sequence%)
CLOSE Sequence%
IF errorcon% THEN
    PRINT "Sequence Error detected..."
ELSE
    PRINT mwrite&, swrite&, totalReads&
END IF
'******************* end of main **************

SUB WriteAndRead (s$, ichannel%, ochannel%, writecount&, readcount&)
IF s$ > "" THEN
    PRINT #ochannel%, s$
    writecount& = writecount& + 1
END IF
IF EOF(ichannel%) THEN
    s$ = ""
ELSE
    readcount& = readcount& + 1
    'LOCATE 1, 1: PRINT USING "#########"; readcount&;
    LINE INPUT #ichannel%, s$
    WHILE LEN(s$) < 5
        s$ = "0" + s$
    WEND
END IF
END SUB

FUNCTION NDigits& (x&)
xt& = ABS(x&)
NDigits& = 0
DO
    m& = xt& MOD 10
    xt& = (xt& - m&) / 10
    NDigits& = NDigits& + 1
LOOP UNTIL xt& < 1
END FUNCTION

SUB SplitFiles (f$, MaxSize&, Nfiles&)
iosf% = FREEFILE
tacclen& = 0
OPEN f$ FOR INPUT AS #iosf%
WHILE NOT EOF(iosf%)
WEND
CLOSE iosf%
END SUB

SUB LoadAndSort (f$)
REDIM array(0 TO 1048575) AS STRING
LoadAndSortIo% = FREEFILE
OPEN f$ FOR INPUT AS #LoadAndSortIo%
finish& = 0
WHILE NOT EOF(LoadAndSortIo%)
    LINE INPUT #LoadAndSortIo%, array(finish&)
    IF NOT EOF(LoadAndSortIo%) THEN
        finish& = finish& + 1
        IF finish& > UBOUND(array) THEN
            REDIM _PRESERVE array(0 TO finish& + (finish& + 10) \ 10)
        END IF
    END IF
WEND
CLOSE LoadAndSortIo%
MergeSort array(), LBOUND(array), finish& - 1, 1
OPEN f$ FOR OUTPUT AS #LoadAndSortIo%
FOR c& = 0 TO finish& - 1
    PRINT #LoadAndSortIo%, array(c&)
NEXT
CLOSE LoadAndSortIo%
ERASE array
END SUB

SUB MergeSort (Array() AS STRING, start&, finish&, order&)
SELECT CASE finish& - start&
    CASE IS > 31
        middle& = start& + (finish& - start&) \ 2
        MergeSort Array(), start&, middle&, order&
        MergeSort Array(), middle& + 1, finish&, order&
        IF order& = 1 THEN
            EfficientMerge Array(), start&, finish&, order&
        ELSE
            MergeRoutine Array(), start&, finish&, order&
        END IF
    CASE IS > 0
        InsertionSort Array(), start&, finish&, order&
END SELECT
END SUB

SUB MergeRoutine (Array() AS STRING, start&, finish&, order&)
length& = finish& - start&
middle& = start& + length& \ 2
DIM temp(0 TO length&) AS STRING
FOR i& = 0 TO length&
    temp(i&) = Array(start& + i&)
NEXT
'* for refactoring purposes,
'* mptr& = 0
'* sptr& = middle& - start& + 1
'* could be omitted from the select case blocks and declared here instead. However, I am leaving them as is
'* so code between SELECT CASE conditional checks can simply be copied for a fully functioning merge.

SELECT CASE order&
    CASE 1
        mptr& = 0
        sptr& = middle& - start& + 1
        FOR i& = 0 TO length&
            IF sptr& <= finish& - start& THEN
                IF mptr& <= middle& - start& THEN
                    IF temp(mptr&) > temp(sptr&) THEN
                        Array(i& + start&) = temp(sptr&)
                        sptr& = sptr& + 1
                    ELSE
                        Array(i& + start&) = temp(mptr&)
                        mptr& = mptr& + 1
                    END IF
                ELSE
                    Array(i& + start&) = temp(sptr&)
                    sptr& = sptr& + 1
                END IF
            ELSE
                Array(i& + start&) = temp(mptr&)
                mptr& = mptr& + 1
            END IF
        NEXT
    CASE ELSE
        mptr& = 0
        sptr& = middle& - start& + 1
        FOR i& = 0 TO length&
            IF sptr& <= finish& - start& THEN
                IF mptr& <= middle& - start& THEN
                    '* i see what you did there -- change from
                    '* temp(mptr&) > temp(sptr&) to temp(sptr&) > temp(mptr&)
                    IF temp(sptr&) > temp(mptr&) THEN
                        Array(i& + start&) = temp(sptr&)
                        sptr& = sptr& + 1
                    ELSE
                        Array(i& + start&) = temp(mptr&)
                        mptr& = mptr& + 1
                    END IF
                ELSE
                    Array(i& + start&) = temp(sptr&)
                    sptr& = sptr& + 1
                END IF
            ELSE
                Array(i& + start&) = temp(mptr&)
                mptr& = mptr& + 1
            END IF
        NEXT
END SELECT
ERASE temp
END SUB

SUB EfficientMerge (righthalf() AS STRING, start&, finish&, order&)
half& = start& + (finish& - start&) \ 2
REDIM lefthalf(start& TO half&) AS STRING '* hold the first half of the array in LeftHalf() -- must be the same type as RightHalf()
FOR LoadLeft& = start& TO half&
    lefthalf(LoadLeft&) = righthalf(LoadLeft&)
NEXT
SELECT CASE order&
    CASE 1
        i& = start&
        j& = half& + 1
        insert& = start&
        DO
            IF i& > half& THEN '* LeftHalf() exhausted
                IF j& > finish& THEN '* RightHalf() exhausted
                    EXIT DO
                ELSE
                    '* stuff remains in right to be inserted, so flush RightHalf()
                    WHILE j& <= finish&
                        righthalf(insert&) = righthalf(j&)
                        j& = j& + 1
                        insert& = insert& + 1
                    WEND
                    EXIT DO
                    '* and exit
                END IF
            ELSE
                IF j& > finish& THEN
                    WHILE i& < LoadLeft&
                        righthalf(insert&) = lefthalf(i&)
                        i& = i& + 1
                        insert& = insert& + 1
                    WEND
                    EXIT DO
                ELSE
                    IF righthalf(j&) < lefthalf(i&) THEN
                        righthalf(insert&) = righthalf(j&)
                        j& = j& + 1
                    ELSE
                        righthalf(insert&) = lefthalf(i&)
                        i& = i& + 1
                    END IF
                    insert& = insert& + 1
                END IF
            END IF
        LOOP
    CASE ELSE
        i& = half&
        j& = finish&
        insert& = finish&
        DO
            IF i& < start& THEN '* LeftHalf() exhausted
                IF j& < half& + 1 THEN '* RightHalf() exhausted
                    EXIT DO
                ELSE
                    '* stuff remains in right to be inserted, so flush RightHalf()
                    WHILE j& >= half&
                        righthalf(insert&) = righthalf(j&)
                        j& = j& - 1
                        insert& = insert& - 1
                    WEND
                    EXIT DO
                    '* and exit
                END IF
            ELSE
                IF j& < half& + 1 THEN
                    WHILE i& >= start&
                        righthalf(insert&) = lefthalf(i&)
                        i& = i& - 1
                        insert& = insert& - 1
                    WEND
                    EXIT DO
                ELSE
                    IF lefthalf(i&) > righthalf(j&) THEN
                        righthalf(insert&) = righthalf(j&)
                        j& = j& - 1
                    ELSE
                        righthalf(insert&) = lefthalf(i&)
                        i& = i& - 1
                    END IF
                    insert& = insert& - 1
                END IF
            END IF
        LOOP
END SELECT
ERASE lefthalf
END SUB

SUB InsertionSort (Array() AS STRING, start&, finish&, order&)
SELECT CASE order&
    CASE 1
        FOR i& = start& + 1 TO finish&
            FOR j& = i& TO start& + 1 STEP -1
                IF Array(j& - 1) > Array(j&) THEN
                    SWAP Array(j&), Array(j& - 1)
                ELSE
                    EXIT FOR
                END IF
            NEXT
        NEXT
    CASE ELSE
        FOR i& = start& + 1 TO finish&
            FOR j& = i& TO start& + 1 STEP -1
                IF Array(j& - 1) < Array(j&) THEN
                    SWAP Array(j&), Array(j& - 1)
                ELSE
                    EXIT FOR
                END IF
            NEXT
        NEXT
END SELECT
END SUB
Reply
#6
To the best of my knowledge, this is the Polyphase Merge, which unfortunately does not appear in any qbxx form I can find otherwise. This implementation retains stability, using an O(NLogN) MergeSort(), with modified merge for ascending order that requires only 1/2 the auxiliary of standard MergeSort(). It was a joy writing this code for others to use freely. I am another one of thejoyfulprogrammer(s). It's a shame so many useful algorithms have not been ported to qbxx. This is an especially important one. As you will see by the execution, it erases temp files, so the space they formerly consumed is freed as soon as the merging is done. I suppose I could have deleted the slave file once it has reached eof(), but this way is fine unless you ABSOLUTELY require the space it took. You COULD conceivably write chunks of data no bigger than the required maximum memory available, but this will be done another time. A SplitFile() utility will be coded eventually.
Reply
#7
I thought the last time we talked, you told me you were out of sorts... or was that a bit out of sorts?

Not bad for a "Newbie."

Pete Big Grin
Reply
#8
I am FULL of sorts of many kinds, including variations on a theme. All wonderfully coded and tested. Including verification methods. I have dedicated MUCH time to researching even obscure methods that can be parallelized (uh, is that a word?), even my efficient version of MergeSort(), which only requires 1/2 the memory to be implemented versus the original version. I had to decipher a LOT of crappy code, test and test some more until I understood fully how the code worked and where it could be improved (in some cases a LOT everywhere). Even developed some from terse, crappy pseudocode. You wouldn't BELIEVE how much time has gone into researching even my own algorithms (PrimeGapSort(), anyone?), verifying correctness and tuning. It's like being a one-man software company. It's done as a labor of love for the art and to help keep others from having to "reinvent the wheel." I am glad if ANYONE out there uses this library.
Reply
#9
I am FULL of sorts of many kinds, including variations on a theme. All wonderfully coded and tested. Including verification methods. I have dedicated MUCH time to researching even obscure methods that can be parallelized (uh, is that a word?), even my efficient version of MergeSort(), which only requires 1/2 the memory to be implemented versus the original version. I had to decipher a LOT of crappy code, test and test some more until I understood fully how the code worked and where it could be improved (in some cases a LOT everywhere). Even developed some from terse, crappy pseudocode. You wouldn't BELIEVE how much time has goe into researching even my own algorithms, verifying correctness and tuning. It's like being a one-man software company. It's done as a labor of love for the art and to help keep others from having to "reinvent the wheel."
Reply