Skip to content

temp_ascii_reader.py does not properly read most current data files #192

@butlerpd

Description

@butlerpd

The current reader has several issues.

First issue
The guessing is assumed the same for both 1D and 2D, which is not obvious that it is to me. However for 1D, SasView spent some time long ago to optimize the universality of the guesses which the new code for some reason does not follow. This leads to several issues:

  • start position - the current code looks for the first line beginning with a number. This is not correct. Headers often include rows that start with a number. Moreover there are even some header rows that are all numbers. and even sometimes several rows with numbers only. After trial and error over a few years and a number of formats, the recipe for finding the starting line is: find the first 3 rows in a row with only numbers and exactly the same number of numbers (same number of columns)
  • number of columns - a minor issue. Currently there is some code to pull the number of columns on all rows after the first (what happens when you hit a footer row that no longer has numbers?) and chooses the most frequently encountered number of columns. Data should all have exactly the same number of columns and in fact is what is used above to determine that one is in fact inside the data block. This whole method should be removed and the logic checking for number of columns moved to the start position method IMO.
  • ending -- currently the assumption seems to be that everything from the "starting" row onward is valid data. This is mostly true. However, there used to be some data formats which used footers instead of headers. the way around that was to define the number of rows as the number of rows from starting row till either EOF or reaching a line that was not a row of numbers of the same number of columns as the rest.

Second Issue
The assumed order in the onedim and twodim are incorrect for most data out there I believe. Almost all existing onedim data follows the order q, I, dI, sigmQ, "mean Q", and shadow factor (where mean Q corrects Q for the shadow factor). So the current order is right for 2 and 3 column data but not for 4. The last 2 columns were proposed at a noBUGS meeting years ago but only implemented in the NIST ABS data format as far as I know.

There are rather few 2D ascii data formats in Q space out there. the NIST *.DAT format is the main one I know of (unless GRASP has one but I believe its 2D data is in pixel format?), and served as the basis for the first 2D reduced data used by SasView. That format follows Qx,Qy,I(Qx,Qy,Qz), dI(Qx,Qy,Qz), Qz, sigmaQ parallel (to Q), sigmaQ perpendicular (to Q), Beamstop shadow factor, mask (only in some cases). However I note that we have 3, 4, and 7 column *.dat formats in our 2D example data. This needs to be investigated as to what they are and how they were being treated. I note that the ASCII extension has been deprecated. I also note that Q data put on a grid can be done in many ways so hard to put a default to that.

Third Issue
Only two datatypes currently appear to be envisioned: onedim and twodim with *.DAT being the only 2D ASCII envisioned (possibly true). The ABS 1D type is unique and should be addressed separately. Other 1D types might also need to be addressed separately if one wants to extract metadata from the headers (not currently done in SasView 6.x).

In fact all extensions should be handled by sasdata I think as discussed in sasview#3899. Moreover, I believe xml and hdf should not require those extensions as the files themselves are self-describing so the reader should be able to decide if a file is an xml of hdf format if I'm not mistaken?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions