Bytestring In Haskell Let Your Big File Fly High
A normal routine when your want to read and process file in haskell is like:
main = do
handle <- openFile "todo.txt" ReadMode
contents <- hGetContents handle
......(do process)
The hGetContents read the contents of the file, and store the result as a haskell list which type is [Char]; As you already know, list evaluating in haskell is very lazy(^_^ if you don’t use a element, the element will not be evaluated), so if your print one character a time in contents, the list will be evaluated n times, which n equals to the number of characters in the file. Use your toes to think, this is such a slow process when the file is bigger.
1. Data.ByteString : the strict guy
The Data.ByteString eliminate the laziness completely, it just stands a series of bytes in an array. A infinite strict bytestring? No such thing exists!!! It means you load all the contents of file into memory, when the file is big, this is always a problem.
2. Data.ByteString.Lazy : the new kind of laziness
The lazy version ByteString read 64KB contents into memory each time. This 64KB ByteString is really a strict ByteString. But the contents of 64KB+ will not be evaluated until you really use them. It just like some reading buffer.
3. Compare Normal File Process and Lazy File Process
First, let's see how to reverse all lines in a file, using the normal method:
import System.Environment
import System.IO
import Data.List
main = do
(fileName1:fileName2:_) <- getArgs
revFileLines fileName1 fileName2
revFileLines :: FilePath -> FilePath -> IO ()
revFileLines source dest = do
contents <- readFile source
writeFile dest $ unlines $ map reverse $ lines contents
Compile and run the program, which eat a 1000000 lines file:
$ghc -o NormalFileRead NormalFileRead.hs
$time ./NormalFileRead input.txt input_rev.txt
In my mechine, the time consumption:
real 0m3.947s
user 0m3.644s
sys 0m0.300s
It takes almost 4 seconds to read the file, reverse each line and write new lines out.
What about the lazy version? Check it out:
import System.Environment
import qualified Data.ByteString.Lazy.Char8 as C
main = do
(fileName1:fileName2:_) <- getArgs
revFileLines fileName1 fileName2
revFileLines :: FilePath -> FilePath -> IO ()
revFileLines source dest = do
contents <- C.readFile source
let lines = C.lines contents
revLines = map C.reverse lines
newstr = C.unlines revLines
C.writeFile dest newstr
The scketch of two program is same, but time consumption of the lazy version:
real 0m1.454s
user 0m1.172s
sys 0m0.281s
Just 1.45 seconds!!!
Do the same testing with 5000000 lines file:
NormalFileRead:
real 0m24.005s
user 0m21.856s
sys 0m1.693s
ByteStringFileRead:
real 0m8.849s
user 0m7.133s
sys 0m1.321s
Always fast!!!