Tab Separated Value parser to supplement CSV?

I was experimenting with transferring my Place hierarchy using the CSV view export & the Import Text Gramplet. But one common step is to clean up the data a bit in Excel.

I was hoping to be able to just copy filtered chunks of spreadsheet & have it be parsed in a variant of the Import Text Grampletwithout messing with double-quoting Place Titles & names. And without applying an Excel formula to concatenate each row with comma delimiters or having to write the rows to a subsequently importable CSV file.

But the the Import Text Gramplet chokes on tab separated text. And our importer doesn’t recognize .tsv file extensions nor tab delimited content.

There’s a thread on StackOverflow that says the python csv module can delimit on tabs instead of commas.

with open("file.tsv") as fd:
    rd = csv.reader(fd, delimiter="\t", quotechar='"')
    for row in rd:
        print(row)

Is there a way to support a different delimiter in our text parser?

It looks like you have found such a way. Since the Import Text Gramplet is apparently an addon, why don’t you try modifying it to work the way you want, and then submit the changes as a PR?

1 Like

Thanks for the encouragement. (That’s genuine, not sarcasm.) I’ve been slogging through doing that since posting.

Forked the add-on and doing a first pass as brute force… with a dedicated TSV version that ignores commas as the delimiter. Then I’ll try to figure out how to integrate a delimiter selector.

I think I’ve gotten most of the way to having Gramps recognize a .tsv mime type for import too.

If I get REALLY ambitious, I’ll find a way to have the Text Importer apply a selected Tag or Citation too. I really found those features useful for cleaning up a GEDCOM import. Then discover how to interface that too. (Arrghh!)

I did discover that the CSV header labels from the Export View and what the Import Text will accept are slightly different. (The labels are more compatible when Exporting a Tree than a View.)

Gramps exports views the ID columns labeled as ‘ID’ whereas the CSV exporter, importer and Import Text Gramplet all expect the Column label for ID to be labeled with the primary object type (Person, Marriage, Event) instead.

I wonder which should be the preferred. But it seems like they should be compatible.

Patching eight 5.1.x files with Serge’s changes for Gramps 5.2 gave the TSV functionality for the Text Import Gramplet and CSV Import.

See:

The labelings on the CSV dialect options are ambiguous to me.

  • excel
  • excel-tab
  • unix
  • Custom
    – ,
    – ;
    – :
    – |
    – Tab

Here are some more explicit labels. But are they correct/accurate? And can we use a MS branded product name without Trademark infringement?

  • Excel comma separated values (CSV)
  • Excel tab separated values (TSV)
  • Unix (CSV with LF end-of-line)
  • Custom
    – , (Comma; aka CSV)
    – ; (Semi-colon)
    – : (Colon)
    – | (Vertical Bar aka pipe-delimited)
    – Tab (TSV)

As for the Unix option, here’s what Perplexity ai describes as the difference:

The main differences between a CSV file created by Microsoft Excel on Windows and a Unix CSV file are:

  1. Line endings:
  • Windows Excel CSV files typically use CRLF (\r\n) for line endings
  • Unix CSV files typically use just LF (\n) for line endings
  1. Default field delimiter:
  • Windows Excel often uses semicolon ( ; ) as the default field delimiter, especially in regions that use comma as the decimal separator
  • Unix CSV files typically use comma ( , ) as the field delimiter
  1. Character encoding:
  • Windows Excel CSV files are often saved with Windows-1252 or UTF-8 with BOM encoding
  • Unix CSV files typically use UTF-8 without BOM encoding
  1. Quoting:
    Excel may add quotes around fields containing commas or line breaks inconsistently
    Unix CSV files tend to follow RFC 4180 more strictly for quoting
  2. Handling of special characters:
  • Excel may handle certain special characters differently than standard Unix tools
  1. Metadata:
  • Excel CSV files may contain hidden metadata or formatting information
  • Unix CSV files are typically plain text without extra metadata
  1. Decimal separator:
  • Excel CSV files may use comma or period as decimal separator depending on regional settings
  • Unix CSV files typically use period as decimal separator

To ensure compatibility when working with CSV files across platforms, it’s often recommended to use a standardized format like RFC 4180 and explicitly specify encoding, delimiters, and line endings when creating or processing CSV files.