TheStoryGraph Import Thoughts

Please note that all opinions are that of the author.

This write up is my thoughts on how I’d modify TheStoryGraph’s import process to:

Allow for using date_added as date_read when date_read doesn’t exist
Allow for multiple import of GoodReads data
Thoughts on Replacing GoodRead’s scanning feature with a TheStoryGraph scanning feature
A possible new offering that might get you press attention

All thoughts are given to TheStoryGraph which no expectations of anything. I simply love your founding story and want you to succeed. And please know that I don’t understand your architecture save what you disclosed to Adam on CoRecursive. I’m just a book freak who writes code and has some renown for loving books: https://oxide.computer/podcasts/oxide-and-friends/955244.

The other thing to know is I’ve been really fortunate in my career and I’ve done well. I’m now 55 and trying to make a concerted effort to give back to people and products that I think are amazing.

The Date Added Thing

Note: After I wrote all of this below, something occurred to me – you likely have a huge bucket of Goodreads CSV files sitting around. Don’t even bother reading the next section; just use the Goodreads data to find out if the way I use Goodreads is normal or not. I could just be a very vocal minority. Smart people with loud voices often bias data.

As we discussed, my usage of GoodReads is nothing more than “I read this”. I have 2 friends there and I don’t use it for anything other than bar code scanning so I can check that I already read something when I’m shopping at Barnes and Noble or my local used book store. I don’t know but I suspect that I’m not the only person who uses it that way.

My assumption is that your data set works better if there is a date associated with a book.

There could be a few options for handling this:

Assume that date_added maps to date read IFF (if and only if) there are no date_read values.
Use a preferences setting to allow people to set this post import

The problem with option 2 is two fold:

If requires looping back over (in whatever) fashion, all of a user’s records and updating them.
The user may not find the setting

From my perspective (and its my perspective; I don’t know your context), if you know that the CSV doesn’t have it at all then you only benefit from making this assumption. I have found that data centric products always benefit from the data getting better.

The problem is that you don’t know that the CSV doesn’t have the field set until you’ve looped through it in full – and, at that point, you have to loop back over the records as per #1 above.

A possible approach is to use a fast command line tool as a pre-processor on the file to see if any date_read fields are set. This approach:

keeps Ruby out of the way
gives you something that scales across cores
and then you could route the result of that into a sidekiq queue with an option to set_date_read_from_date_added (or not)

Allowing for Multiple Import

If you don’t end up replacing GoodReads with your own scanning then one way that you could allow for multiple import is to move to an idempotent approach to BookUser records. Idempotency is just a fancy term which says “don’t create it if it already exists”. The way I build Rails apps is a find_or_create class method and a class constant which identifies what makes a record unique. In the case of import, it would likely be the user’s id and the Good Reads Book Id column. If you simply do an existence query against those two columns then you could allow a user to keep scanning with Good Reads and upload over and over.

And since database load is absolutely your bottleneck (at least based on your CoRecursive interview), you could easily populate a Redis key that represents this and then check that first.

Just a thought but you’re a data centric app and your overall value is a function of the amount of data you have. Multiple imports get you more data.

Replacing GoodReads Scanning with Your Own

Looking at TheStoryGraph today, the big thing that stood out to me was the lack of an easy way to add the book I’m reading. Here’s an interesting article about implementing barcode scanning in an HTML context:

https://blog.classycode.com/scanning-barcodes-in-progressive-web-apps-using-angular-5d5a9d50bd8?gi=814c723e7167

The very, very interesting thing that we talked about yesterday was that you might be able to bootstrap your own barcode reading since you have large numbers of unique books.

Another approach might be to move to a “upload a picture” of the book cover that you are reading and see if you can crowd source or use something like Amazon TextExtract (or an OCR approach powered by Docker; I had great luck with this a number of years ago) and then cross reference that against an Internet search.

My guess is that if you used even the less certain approach from OCR against a picture of the book cover, you could likely then use your own ML model to see if the book corresponds to the user.

Note: I did about 18 months doing ML stuff identifying online hate. I don’t think I’m wrong that you could use your ML models in this way.

A Possible New Offering: Your Read Data as Trends

There is always a market for trend data and market research reports. Back in the early days of blogging, my biggest competitor was Technorati (I was the founder of Feedster). Technorati used to issue “The State of the Blogosphere” reports and got huge accolades for this.

I suspect your underlying reading pattern data could power this type of report. There is also a market for selling that data back to publishers.

My Final Idea - I Promise

I’ve recently been finding good long form fiction just on the web. Here’s an example:

https://parahumans.wordpress.com/

I’m a pretty astute reader and this is as good as most published genre fiction (Mike Chen’s works come to mind).

I don’t know if you’ve considered tracking online reading but there’s something here I think.

Last Updated On: 2025-10-13 07:27:40 -0400