Data Extraction

Taro includes a few high level functions that extract data from various document formats.

Text extraction

The Taro.extract method retrieves document metadata and the body text of a document, using Apache Tika. Formats supported by Tika include MS Office and Open Office documents, as well as PDF files.

The function return a Tuple of a Dict and String. The Dict contains name/value pairs of various metadata from the document, while the string contains the body text of the document.


julia> testfile = joinpath(Pkg.dir(),"Taro","test","WhyJulia.docx");

julia> meta, text = Taro.extract(testfile);

julia> meta["Last-Save-Date"]
"2013-12-28T00:17:00Z"

julia> typeof(text)
UTF8String

julia> text[1:53]
"Why we created Julia\n\nIn short, because we are greedy"

Read Excel files into a DataFrame

The Taro.readxl method reads a rectangular region from an excel sheet, and returns a Dataframe. This function takes as an input parameter the name and path of the Excel file to be read. A sheet name (or number) can optionally be supplied. If no sheet information is given, the first sheet (index 0) is read. Finally, this function is provided with the rectangular region from which data is extracted. This region is specified as an excel range.

This function is similar to, and inspired by, the readtable function in DataFrames.


julia> testfile = joinpath(Pkg.dir(),"Taro","test","df-test.xlsx");

julia> Taro.readxl(testfile, "Sheet1", "B2:F10")
8×5 DataFrames.DataFrame
│ Row │ H1  │ H2  │ H3  │ H4  │ H5    │
├─────┼─────┼─────┼─────┼─────┼───────┤
│ 1   │ "a" │ 1.0 │ 1.0 │ 1.0 │ "a a" │
│ 2   │ "b" │ 2.0 │ 2.0 │ 1.0 │ "b b" │
│ 3   │ "c" │ NA  │ 3.0 │ 0.0 │ "c c" │
│ 4   │ "d" │ 4.0 │ NA  │ NA  │ "d d" │
│ 5   │ "e" │ 5.0 │ 5.0 │ 1.0 │ "e e" │
│ 6   │ NA  │ 6.0 │ 6.0 │ 1.0 │ " "   │
│ 7   │ "g" │ 7.0 │ 7.0 │ 1.0 │ "g g" │
│ 8   │ "h" │ 8.0 │ 8.0 │ 1.0 │ "h h" │

julia> Taro.readxl(testfile, "Sheet1", "B3:F10"; header=false)
8×5 DataFrames.DataFrame
│ Row │ x1  │ x2  │ x3  │ x4  │ x5    │
├─────┼─────┼─────┼─────┼─────┼───────┤
│ 1   │ "a" │ 1.0 │ 1.0 │ 1.0 │ "a a" │
│ 2   │ "b" │ 2.0 │ 2.0 │ 1.0 │ "b b" │
│ 3   │ "c" │ NA  │ 3.0 │ 0.0 │ "c c" │
│ 4   │ "d" │ 4.0 │ NA  │ NA  │ "d d" │
│ 5   │ "e" │ 5.0 │ 5.0 │ 1.0 │ "e e" │
│ 6   │ NA  │ 6.0 │ 6.0 │ 1.0 │ " "   │
│ 7   │ "g" │ 7.0 │ 7.0 │ 1.0 │ "g g" │
│ 8   │ "h" │ 8.0 │ 8.0 │ 1.0 │ "h h" │

julia> Taro.readxl(testfile, "Sheet1", "B3:F10"; header=false, nastrings=[" "])
8×5 DataFrames.DataFrame
│ Row │ x1  │ x2  │ x3  │ x4  │ x5    │
├─────┼─────┼─────┼─────┼─────┼───────┤
│ 1   │ "a" │ 1.0 │ 1.0 │ 1.0 │ "a a" │
│ 2   │ "b" │ 2.0 │ 2.0 │ 1.0 │ "b b" │
│ 3   │ "c" │ NA  │ 3.0 │ 0.0 │ "c c" │
│ 4   │ "d" │ 4.0 │ NA  │ NA  │ "d d" │
│ 5   │ "e" │ 5.0 │ 5.0 │ 1.0 │ "e e" │
│ 6   │ NA  │ 6.0 │ 6.0 │ 1.0 │ NA    │
│ 7   │ "g" │ 7.0 │ 7.0 │ 1.0 │ "g g" │
│ 8   │ "h" │ 8.0 │ 8.0 │ 1.0 │ "h h" │