Scott's Recipes Logo

A Key Lesson from 40 Years of Parsing Free Text Data

Last Updated On: 2025-09-01 04:31:51 -0400

I’m putting on my grey beard attitude in writing this blog post.

The year is 2025 and I’ve now been writing data parsers for free text data since roughly 1990 but I’ve been working with free text data since 1987 when I started my first software company – NTERGAID.

During that time I’ve learned A LOT about parsing text:

That last point is key. Let me give you an illustration. These days I write a lot of parsers for data that I get via Google Sheets. So I was running data recently and one of the columns was ADA Accessible which they had as a Yes / No / y / YES set of values. Let’s ignore the crazy inconsistency on the values and just talk about the fact this was a valid and valuable data attribute. My data schema did NOT allow for that. And this was mostly a case of cultural blindness – I’m only 57 and still have full mobility so I don’t think naturally in terms of accessibility.

So I processed thousands of these rows and then I was like “OH CRAP” when I saw this column. Now you might think that I needed to reload the data but – nope!

Here’s the magic trick – DON’T EVER DO LOSSY PARSING.

A LOSSY data operation is one where you lose data as it enters your system. The trick is to keep a copy of the original data in context with the processed data.

Here’s my event model:

 Event.new
#<Event:0x000000012c7ac740> {
                           :id => nil,
                         :name => nil,
                :event_type_id => nil,
               :event_start_at => Wed, 20 Aug 2025,
                 :event_end_at => nil,
                     :address1 => nil,
                     :address2 => nil,
                    :city_name => nil,
                     :state_id => nil,
              :organization_id => nil,
             :organization_fid => nil,
                         :slug => nil,
                          :fid => nil,
    :congressional_district_id => nil,
                    :recurring => false,
                   :recurrence => nil,
                          :url => nil,
                 :mobilize_url => nil,
           :is_suggested_event => nil,
           :pasted_description => nil,
                   :created_at => nil,
                   :updated_at => nil,
                  :county_name => nil,
                     :location => nil,
             :organizing_group => nil,
              :event_type_name => nil,
                        :notes => nil,
                :date_start_at => nil,
                :time_start_at => nil,
                  :time_end_at => nil,
                  :source_data => {},
                      :city_id => nil,
                    :county_id => nil
}

That 3rd to last column: source_data is a json serialized attribute. Because I load my data from a CSV routine, I can just say when I write my event instance:

event.source_data = row.to_h

where row was a data blob from my CSV parser and to_h is a method which transforms the CSV object’s representation of a row from the internal structure to a normal Ruby hash.

Now when I want to add a column to my database for ada_accessible, all I need to do is:

  1. Add that column.
  2. Write a method which extracts the ada data from the source_data column and that is only a matter of a few lines.
  3. Iterate over all the rows in the database and executes that method
  4. Save the row.

I’ve now been doing this for well over a decade – storing source data in the context of the parsed data – and I’ve never regretted it. The overall efficiency gains as well as the fact that it lets me fearlessly write data parsers has been incredible. And, yes, I’ve done this even in the context of multiple millions of pages of web crawl data.