u/Combinatorilliance

You can do basic web-search with just two simple cli tools

You can do basic web-search with just two simple cli tools

Hi! I was looking at the web search options available in the pi ecosystem and most of them wrap some API or require config..

I just want my tool to be able to

  1. Run a search query via a search provider
  2. Fetch pages preferably as markdown

For this I found that there exist two boring tools that work well together:

  1. The duckduckgo commandline tool ddgr. This is just one sudo apt install ddgr away
  2. The super weirdly named trafilatura tool. This is a python tool that extracts text content from a url. Has lots of options for presentation and what to include/exclude. pip install trafilatura.. I suppose? I use NixOS so I dunno how to install this globally with Python. Python is hell.

What is trafilatura?

It's a commandline tool that extract meaningful content from a web-page. It's been actively maintained for over 9 years (probably longer?), and its primary use-case is to help with academic research. I suppose it's usually useful for researchers to do scraping.

Anyway, it is rich, mature, old and just a cli tool. It supports markdown output, regular output, a mode to show very little content, a mode to show more content. You can choose to include/exclude links etc.


Anyway. If you wrap these in a simple extension you get 100% local search that works for the common use-case of "just quickly look something up on a forum, documentation, wikipedia or Github".

I haven't looked into how to publish this as an extension, but if people like it I could package it up.

This is the extension as a gist if anyone wants to try it.

https://gist.github.com/Azeirah/9375fb67c5aee6ca1b7e046f8b7cf0cd

Trafilatura has been configured to do:

  1. Show links
  2. Show markdown
  3. Show the concise output, so not the verbose output. I did that to save tokens
u/Combinatorilliance — 2 days ago