Scraped Content
In the Tagged Index blog entry, I talked a bit about about deciding which books to include in the Haskell books index. There are many different kinds of Haskell books as well as books that are not about Haskell in particular but are usually of interest to Haskell programmers. Deciding which books to include inevitably introduces bias.
The meta page includes a section on book selection, which states the criteria for inclusion in the index. This makes it clear which books should be included in the index. If I feel that a book does not belong in the index even though it passes the criteria, then I will update the criteria to make the reason/bias explicit.
There are currently three criteria for book selection:
- The topic of the book should be programming in Haskell or
the implementation of Haskell-like languages.
- The new site that I am currently developing will have more than one index, so there can be a separate index for Haskell-adjacent topics.
- The index only includes English books.
- The new site that I am currently developing will have separate indexes for other languages.
- The index only includes books that are complete.
- I considered adding a
wip
(work in progress) tag for incomplete books, but I decided against it. Some books that are viewable while they are still being written are of high quality, but many are not. Some books are abandoned before they are completed.
- I considered adding a
I recently received a request to add a few “books” (free PDFs) that consist of content scraped from StackOverflow. They both pass all of the current criteria. It was difficult to decide if such books should be included in the index or if the criteria should be changed so that such content is not included.
On one hand, the bar for creating such books is very low. Scraping content and assembling it into a document (using a GUI, Pandoc, and/or LaTeX) is not difficult and can be done in short time, compared to writing a decent book. The books also seem to be a marketing hack, produced by tutorial websites that I would normally never visit. At worst, they could be attempts at profiting off of others’ efforts.
On the other hand, reading StackOverflow questions and answers can be educational. Collections of top content could be a convenient way for somebody to increase their Haskell knowledge after finishing beginner books. The PDF format makes it easy to read on an ebook reader or tablet during a commute or trip without reliable internet access.
I decided to include those two books, prioritizing their utility over my bias against them. If more such content pops up, however, I would not like such low-bar content filling the index. If it becomes a problem, I plan on updating the criteria to not include scraped content.
The index on my personal website is an experiment, after all. I shall see if I receive any complaints or suggestions, and I will hopefully have learned a lot when I release the new site!