Compare commits

...

3 Commits

Author SHA1 Message Date
5363547637 new blogpost 2025-08-23 20:28:44 -04:00
7b3715039a update styling 2025-08-23 20:22:49 -04:00
1061b92d28 make my site slightly more accessible 2025-08-21 22:18:15 -04:00
3 changed files with 674 additions and 1 deletions

View File

@@ -0,0 +1,332 @@
<!DOCTYPE HTML>
<html lang="en">
<title>'Writing my own web crawler in go'</title>
<meta name="date" content="2025/08/21">
<link rel="stylesheet" href="/style.css">
<style>
.-spell {}
.Number {color: #e29eca}
.Function {color: #c1c0d4}
.String {color: #90b99f}
.Character {color: #90b99f}
.PreCondit {color: #ea83a5}
.-property {color: #c1c0d4}
.Comment {color: #757581; font-style: italic}
.Macro {color: #ea83a5}
.-type-builtin {color: #e29eca}
.Type {color: #b9aeda}
.Keyword {color: #aca1cf}
.Constant {color: #ea83a5}
.-variable-parameter {color: #e29eca}
.-punctuation-delimiter {color: #9998a8}
.-punctuation-bracket {color: #9998a8}
.Include {color: #aca1cf}
.Structure {color: #e6b99d}
.-variable {color: #c9c7cd}
.Operator {color: #e6b99d}
</style>
<body id="blog">
<h1>Writing my own web crawler in go</h1>
<p>
I got bored, it happens to everyone (especially software developers).
So like every software developer I've started a new side-project: a web
crawler. This isn't for any actual usecase, I kinda just wanna learn how
to use go and (hopefully) sharpen my SQL skills in the process. At this
point in the writing process I've just started so let me show you what
I've currently gotten working.
</p>
<p>
To start I'm just searching through the hrefs of all &lt;a&gt; tags on a
site and printing them out. On my site that looks like this:
</p>
<pre>
/
mailto:me@zacharyscheiman.com
https://github.com/squibid
https://codeberg.org/squibid
https://git.squi.bid/squibid/wiz
https://git.squi.bid/squibid
/blog/rss.xml
/blog/New-Keyboard!
/blog/Serializing-data-in-C
/blog/Why-"suckless"-software-is-important
/blog/What-is-a-squibid
/blog/librex-and-dots
/?all_blog
https://lunarflame.dev
https://eggbert.xyz/
</pre>
<p>
Here's the code which fetched this list:
</p>
<pre>
<span class="Statement"><span class="Keyword">package</span></span> <span class="Structure">main</span>
<span class="Statement"><span class="Keyword">import</span></span> <span class="-punctuation-bracket">(</span>
<span class="String"><span class="String">&quot;fmt&quot;</span></span>
<span class="String"><span class="String">&quot;io&quot;</span></span>
<span class="String"><span class="String">&quot;net/http&quot;</span></span>
<span class="String"><span class="String">&quot;golang.org/x/net/html&quot;</span></span>
<span class="-punctuation-bracket">)</span>
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter"><span class="DiagnosticUnderlineInfo">site</span></span><span class="DiagnosticUnderlineInfo"></span></span><span class="DiagnosticUnderlineInfo"> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span></span><span class="Type"><span class="Type"><span class="-type-builtin"></span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">reader</span></span> <span class="Structure">io</span><span class="-punctuation-delimiter">.</span><span class="Type">Reader</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="-variable">doc</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Parse</span></span><span class="-punctuation-bracket">(</span><span class="-variable">reader</span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;Error parsing HTML:&quot;</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span><span class="-punctuation-bracket">)</span>
<span class="Statement"><span class="Keyword">return</span></span>
<span class="-punctuation-bracket">}</span>
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">walk</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span>
<span class="-variable">walk</span> <span class="Operator">=</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Type</span> <span class="Operator">==</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property">ElementNode</span> <span class="Operator">&amp;&amp;</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Data</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;a&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">_</span><span class="-punctuation-delimiter">,</span> <span class="-variable">attr</span> <span class="Operator">:=</span> <span class="Repeat"><span class="Keyword">range</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Attr</span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Key</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;href&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">c</span> <span class="Operator">:=</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">FirstChild</span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">=</span> <span class="-variable">c</span><span class="-punctuation-delimiter">.</span><span class="-property">NextSibling</span> <span class="-punctuation-bracket">{</span>
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">c</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">doc</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">main</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="-variable">site</span> <span class="Operator">:=</span> <span class="String"><span class="String"><span class="-spell">&quot;https://squi.bid/&quot;</span></span></span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;fetching &quot;</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;Error getting the website&quot;</span></span></span><span class="-punctuation-bracket">)</span>
<span class="Statement"><span class="Keyword">return</span></span>
<span class="-punctuation-bracket">}</span>
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
After taking a short look at the output of this we need to handle
multiple different "url formats" like the mailto: and / links. For right
now I'm going to respect peoples privacy and not index their email
addresses.
</p>
<pre>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">&gt;</span> <span class="Number"><span class="Number">1</span></span> <span class="Operator">&amp;&amp;</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">2</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;//&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;https:&quot;</span></span></span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;/&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">site</span><span class="-punctuation-bracket">[</span><span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="Operator">-</span> <span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;/&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">&gt;</span> <span class="Number"><span class="Number">4</span></span> <span class="Operator">&amp;&amp;</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">4</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;http&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
Now that we've gotten the actual links from the website, it's time to
store and get the links from their sites too. For now I've decided that
because this is already just a toy project I will not be storing all the
info I would if this were a real project. Instead I will only be storing
the link to the site, and a boolean representing whether I'd fetched it's
contents yet. So, let's go impl...
</p>
<p>
Step #1 is to find a sql library. I just went with Go's built in
database/sql mappings, and then visited
<a href="https://golang.org/s/sqldrivers">golang.org/s/sqldrivers</a>
and decided on
<a href="https://github.com/mattn/go-sqlite3">github.com/mattn/go-sqlite3</a>
because sqlite is a name I'm familiar with, and I really don't want to go
through the hassle of looking into different dbs for a toy project.
<a href="#footnote-1">[1]</a>
</p>
<p>
Now that we've chosen our db I'll setup our table like I mentioned
earlier, with one string and one boolean:
</p>
<pre>
<span class="-variable">db</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">=</span> <span class="-variable">sql</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Open</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;sqlite3&quot;</span></span></span><span class="-punctuation-delimiter">,</span> <span class="String"><span class="String"><span class="-spell">&quot;./sites.db&quot;</span></span></span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">log</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Fatal</span></span><span class="-punctuation-bracket">(</span><span class="-variable">err</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`</span>
<span class="String"> create table if not exists</span>
<span class="String"> urls (url text not null primary key, indexed boolean not null);</span>
<span class="String"> `</span></span><span class="-punctuation-bracket">)</span>
</pre>
<p>
Now we need to start adding entries to the db. To do this I wanted to
ensure I wouldn't end up shooting myself in the foot therefore I decided
to go with a small tiny function to make it a teensy tiny bit safer:
</p>
<pre>
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">db_insert_url</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">url</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">seen</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">bool</span></span></span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`insert into urls values (?, ?)`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">url</span><span class="-punctuation-delimiter">,</span> <span class="-variable">seen</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
It could use some error handling, but if you look at section 3 part 4 of
the software engineers manual it reads "side projects aren't stable
because if it's stable it's not a side project".
</p>
<p>
Now that we've gotten all the easy stuff out of the way it's time to work
on making this run forever, or close to it at least. For now I'm going to
keep this project in an inefficient state and we're not going to use any
worker pools or something fancy like that. To get started we first need
to make a decision: depth or breadth first searching? incase you're not
sure what I mean by this let me give you an example:
</p>
<p>
Let's say we have site example-a.com which contains the following links:
</p>
<ul>
<li>example-a.com/blog</li>
<li>example-b.com</li>
<li>example-c.com</li>
</ul>
<p>
With a breadth first search we would first go to either example-b or
example-c wheras with a depth first search we would go with
example-a.com/blog. For my use case I want to find as many sites as
possible therefore I will be targeting sites with other base urls.
</p>
<p>
Now that we know how we want to decide the next url to fetch let's impl
the loop which handles this.
</p>
<pre>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">i</span> <span class="Operator">:=</span> <span class="Number"><span class="Number">0</span></span><span class="-punctuation-delimiter">;</span><span class="-punctuation-delimiter">;</span> <span class="-variable">i</span><span class="Operator">++</span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">i</span> <span class="Operator">&gt;</span> <span class="Number"><span class="Number">0</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">rows</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Query</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`select url from urls where indexed is false`</span></span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="Statement"><span class="Keyword">return</span></span>
<span class="-punctuation-bracket">}</span>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Next</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">test</span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span>
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Scan</span></span><span class="-punctuation-bracket">(</span><span class="Operator">&amp;</span><span class="-variable">test</span><span class="-punctuation-bracket">)</span>
<span class="-variable">site</span> <span class="Operator">=</span> <span class="-variable">test</span>
<span class="Comment"><span class="Comment"><span class="-spell">/* we can't just check if the site is the same because then when we're</span></span><span class="-spell">
<span class="Comment"> * checking squi.bid/example it won't register squi.bid as the same</span>
<span class="Comment"> * domain, although maybe that's what we want.</span>
<span class="Comment"> */</span></span><span class="Comment"></span></span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Operator">!</span><span class="-variable">strings</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Contains</span></span><span class="-punctuation-bracket">(</span><span class="-variable">test</span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="Statement"><span class="Keyword">break</span></span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;fetching &quot;</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;Error getting&quot;</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="-variable">os</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exit</span></span><span class="-punctuation-bracket">(</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
<span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
If you read through my code you might've seen the comment about how our
check doesn't actually prevent accessing the same site, the solution I'm
currently thinking of is to add a column to the db which keeps the
highest point in the site for example: squi.bid/example/1/2/3/4 would
have a highest point of squi.bid. But currently this isn't something I'm
too concerned about so for now we'll just leave it as is and deal with
another issue you might've spotted.
</p>
<p>
We don't modify the db, after fetching a site successfully at no point do
we actually say that we fetched it. Therefore whenever we try and fetch
a new site the program with query the db and find the same route as
before. Thankfully this is a simple fix which just takes adding this line
right after where we index a new site:
</p>
<pre>
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`update urls set indexed = true where url == ?`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
</pre>
<p>
Remember when I referenced section 3 part 4 of the software engineers
manual? Well I regret it:
</p>
<pre>
fetching https://squi.bid/
fetching https://github.com/EggbertFluffle/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source=header
panic: runtime error: slice bounds out of range [:1] with length 0
goroutine 1 [running]:
main.deal_html.func1(0xc000315b90)
/home/squibid/documents/coding/go/scraper/main.go:40 +0x2fe
main.deal_html.func1(0x0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0xc0002dd110?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0xc00032e000?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x8c36e0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x7fd180fd9db8?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html({0xc00002a280, 0x7c}, {0x7fd180fd9db8?, 0xc0004bd200?})
/home/squibid/documents/coding/go/scraper/main.go:58 +0x11e
main.main()
/home/squibid/documents/coding/go/scraper/main.go:105 +0x12d
exit status 2
</pre>
<p>
Turns out we need some stability if we want to actually use the code.
This like most bugs is another simple fix which just takes guarding our
url handler with a call to len to make sure we're not doing anything
stupid on empty strings.
</p>
<p>
And now it works! With a small exception, but here's a clean run showing
my lil web crawler doing it's thing... and failing pretty fast.
</p>
<pre>
fetching https://squi.bid/
fetching https://eggbert.xyz/
fetching https://www.linkedin.com/in/harrison-diambrosio-505443229/
fetching https://github.com/EggbertFluffle/
fetching https://support.github.com?tags=dotcom-footer
fetching https://docs.github.com/
fetching https://services.github.com
fetching https://github.com/github
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
...
</pre>
<p>
I'm sure you can use your imagination to figure out how long that was
going to happen for. This bug would be partially fixed by switching
which site we're searching, but ultimately we wouldn't make it far if we
keep falling for these redirections that just keep going. For now that's
fine though, and I have a semi-working web crawler. All code can be found
here:
<a href="https://git.squi.bid/squibid/web-crawler">git.squi.bid/squibid/web-crawler</a>.
Thank you for reading, I'll probably write a followup when I find some
time.
</p>
<p id="footnote-1">
[1] Yes I'm just including more links to make my site a good starting point
why do you ask?
</p>
</body>
</html>

View File

@@ -11,6 +11,336 @@
<!-- LB --> <!-- LB -->
<item>
<title> 'Writing my own web crawler in go'</title>
<guid>https://squi.bid/blog/Writing-my-own-web-crawler-in-go/index.html</guid>
<link>https://squi.bid/blog/Writing-my-own-web-crawler-in-go/index.html</link>
<pubDate>Sat, 23 Aug 2025 20:25:26 -0400</pubDate>
<description><![CDATA[<!DOCTYPE HTML>
<html lang="en">
<title>'Writing my own web crawler in go'</title>
<meta name="date" content="2025/08/21">
<link rel="stylesheet" href="/style.css">
<style>
.-spell {}
.Number {color: #e29eca}
.Function {color: #c1c0d4}
.String {color: #90b99f}
.Character {color: #90b99f}
.PreCondit {color: #ea83a5}
.-property {color: #c1c0d4}
.Comment {color: #757581; font-style: italic}
.Macro {color: #ea83a5}
.-type-builtin {color: #e29eca}
.Type {color: #b9aeda}
.Keyword {color: #aca1cf}
.Constant {color: #ea83a5}
.-variable-parameter {color: #e29eca}
.-punctuation-delimiter {color: #9998a8}
.-punctuation-bracket {color: #9998a8}
.Include {color: #aca1cf}
.Structure {color: #e6b99d}
.-variable {color: #c9c7cd}
.Operator {color: #e6b99d}
</style>
<body id="blog">
<h1>Writing my own web crawler in go</h1>
<p>
I got bored, it happens to everyone (especially software developers).
So like every software developer I've started a new side-project: a web
crawler. This isn't for any actual usecase, I kinda just wanna learn how
to use go and (hopefully) sharpen my SQL skills in the process. At this
point in the writing process I've just started so let me show you what
I've currently gotten working.
</p>
<p>
To start I'm just searching through the hrefs of all &lt;a&gt; tags on a
site and printing them out. On my site that looks like this:
</p>
<pre>
/
mailto:me@zacharyscheiman.com
https://github.com/squibid
https://codeberg.org/squibid
https://git.squi.bid/squibid/wiz
https://git.squi.bid/squibid
/blog/rss.xml
/blog/New-Keyboard!
/blog/Serializing-data-in-C
/blog/Why-"suckless"-software-is-important
/blog/What-is-a-squibid
/blog/librex-and-dots
/?all_blog
https://lunarflame.dev
https://eggbert.xyz/
</pre>
<p>
Here's the code which fetched this list:
</p>
<pre>
<span class="Statement"><span class="Keyword">package</span></span> <span class="Structure">main</span>
<span class="Statement"><span class="Keyword">import</span></span> <span class="-punctuation-bracket">(</span>
<span class="String"><span class="String">&quot;fmt&quot;</span></span>
<span class="String"><span class="String">&quot;io&quot;</span></span>
<span class="String"><span class="String">&quot;net/http&quot;</span></span>
<span class="String"><span class="String">&quot;golang.org/x/net/html&quot;</span></span>
<span class="-punctuation-bracket">)</span>
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter"><span class="DiagnosticUnderlineInfo">site</span></span><span class="DiagnosticUnderlineInfo"></span></span><span class="DiagnosticUnderlineInfo"> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span></span><span class="Type"><span class="Type"><span class="-type-builtin"></span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">reader</span></span> <span class="Structure">io</span><span class="-punctuation-delimiter">.</span><span class="Type">Reader</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="-variable">doc</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Parse</span></span><span class="-punctuation-bracket">(</span><span class="-variable">reader</span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;Error parsing HTML:&quot;</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span><span class="-punctuation-bracket">)</span>
<span class="Statement"><span class="Keyword">return</span></span>
<span class="-punctuation-bracket">}</span>
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">walk</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span>
<span class="-variable">walk</span> <span class="Operator">=</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Type</span> <span class="Operator">==</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property">ElementNode</span> <span class="Operator">&amp;&amp;</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Data</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;a&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">_</span><span class="-punctuation-delimiter">,</span> <span class="-variable">attr</span> <span class="Operator">:=</span> <span class="Repeat"><span class="Keyword">range</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Attr</span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Key</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;href&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">c</span> <span class="Operator">:=</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">FirstChild</span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">=</span> <span class="-variable">c</span><span class="-punctuation-delimiter">.</span><span class="-property">NextSibling</span> <span class="-punctuation-bracket">{</span>
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">c</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">doc</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">main</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="-variable">site</span> <span class="Operator">:=</span> <span class="String"><span class="String"><span class="-spell">&quot;https://squi.bid/&quot;</span></span></span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;fetching &quot;</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;Error getting the website&quot;</span></span></span><span class="-punctuation-bracket">)</span>
<span class="Statement"><span class="Keyword">return</span></span>
<span class="-punctuation-bracket">}</span>
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
After taking a short look at the output of this we need to handle
multiple different "url formats" like the mailto: and / links. For right
now I'm going to respect peoples privacy and not index their email
addresses.
</p>
<pre>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">&gt;</span> <span class="Number"><span class="Number">1</span></span> <span class="Operator">&amp;&amp;</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">2</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;//&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;https:&quot;</span></span></span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;/&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">site</span><span class="-punctuation-bracket">[</span><span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="Operator">-</span> <span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;/&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">&gt;</span> <span class="Number"><span class="Number">4</span></span> <span class="Operator">&amp;&amp;</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">4</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">&quot;http&quot;</span></span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
Now that we've gotten the actual links from the website, it's time to
store and get the links from their sites too. For now I've decided that
because this is already just a toy project I will not be storing all the
info I would if this were a real project. Instead I will only be storing
the link to the site, and a boolean representing whether I'd fetched it's
contents yet. So, let's go impl...
</p>
<p>
Step #1 is to find a sql library. I just went with Go's built in
database/sql mappings, and then visited
<a href="https://golang.org/s/sqldrivers">golang.org/s/sqldrivers</a>
and decided on
<a href="https://github.com/mattn/go-sqlite3">github.com/mattn/go-sqlite3</a>
because sqlite is a name I'm familiar with, and I really don't want to go
through the hassle of looking into different dbs for a toy project.
<a href="#footnote-1">[1]</a>
</p>
<p>
Now that we've chosen our db I'll setup our table like I mentioned
earlier, with one string and one boolean:
</p>
<pre>
<span class="-variable">db</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">=</span> <span class="-variable">sql</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Open</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;sqlite3&quot;</span></span></span><span class="-punctuation-delimiter">,</span> <span class="String"><span class="String"><span class="-spell">&quot;./sites.db&quot;</span></span></span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">log</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Fatal</span></span><span class="-punctuation-bracket">(</span><span class="-variable">err</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`</span>
<span class="String"> create table if not exists</span>
<span class="String"> urls (url text not null primary key, indexed boolean not null);</span>
<span class="String"> `</span></span><span class="-punctuation-bracket">)</span>
</pre>
<p>
Now we need to start adding entries to the db. To do this I wanted to
ensure I wouldn't end up shooting myself in the foot therefore I decided
to go with a small tiny function to make it a teensy tiny bit safer:
</p>
<pre>
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">db_insert_url</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">url</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">seen</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">bool</span></span></span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`insert into urls values (?, ?)`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">url</span><span class="-punctuation-delimiter">,</span> <span class="-variable">seen</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
It could use some error handling, but if you look at section 3 part 4 of
the software engineers manual it reads "side projects aren't stable
because if it's stable it's not a side project".
</p>
<p>
Now that we've gotten all the easy stuff out of the way it's time to work
on making this run forever, or close to it at least. For now I'm going to
keep this project in an inefficient state and we're not going to use any
worker pools or something fancy like that. To get started we first need
to make a decision: depth or breadth first searching? incase you're not
sure what I mean by this let me give you an example:
</p>
<p>
Let's say we have site example-a.com which contains the following links:
</p>
<ul>
<li>example-a.com/blog</li>
<li>example-b.com</li>
<li>example-c.com</li>
</ul>
<p>
With a breadth first search we would first go to either example-b or
example-c wheras with a depth first search we would go with
example-a.com/blog. For my use case I want to find as many sites as
possible therefore I will be targeting sites with other base urls.
</p>
<p>
Now that we know how we want to decide the next url to fetch let's impl
the loop which handles this.
</p>
<pre>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">i</span> <span class="Operator">:=</span> <span class="Number"><span class="Number">0</span></span><span class="-punctuation-delimiter">;</span><span class="-punctuation-delimiter">;</span> <span class="-variable">i</span><span class="Operator">++</span> <span class="-punctuation-bracket">{</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">i</span> <span class="Operator">&gt;</span> <span class="Number"><span class="Number">0</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">rows</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Query</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`select url from urls where indexed is false`</span></span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="Statement"><span class="Keyword">return</span></span>
<span class="-punctuation-bracket">}</span>
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Next</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">test</span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span>
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Scan</span></span><span class="-punctuation-bracket">(</span><span class="Operator">&amp;</span><span class="-variable">test</span><span class="-punctuation-bracket">)</span>
<span class="-variable">site</span> <span class="Operator">=</span> <span class="-variable">test</span>
<span class="Comment"><span class="Comment"><span class="-spell">/* we can't just check if the site is the same because then when we're</span></span><span class="-spell">
<span class="Comment"> * checking squi.bid/example it won't register squi.bid as the same</span>
<span class="Comment"> * domain, although maybe that's what we want.</span>
<span class="Comment"> */</span></span><span class="Comment"></span></span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Operator">!</span><span class="-variable">strings</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Contains</span></span><span class="-punctuation-bracket">(</span><span class="-variable">test</span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
<span class="Statement"><span class="Keyword">break</span></span>
<span class="-punctuation-bracket">}</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;fetching &quot;</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">&quot;Error getting&quot;</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
<span class="-variable">os</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exit</span></span><span class="-punctuation-bracket">(</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
<span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
<span class="-punctuation-bracket">}</span>
</pre>
<p>
If you read through my code you might've seen the comment about how our
check doesn't actually prevent accessing the same site, the solution I'm
currently thinking of is to add a column to the db which keeps the
highest point in the site for example: squi.bid/example/1/2/3/4 would
have a highest point of squi.bid. But currently this isn't something I'm
too concerned about so for now we'll just leave it as is and deal with
another issue you might've spotted.
</p>
<p>
We don't modify the db, after fetching a site successfully at no point do
we actually say that we fetched it. Therefore whenever we try and fetch
a new site the program with query the db and find the same route as
before. Thankfully this is a simple fix which just takes adding this line
right after where we index a new site:
</p>
<pre>
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`update urls set indexed = true where url == ?`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
</pre>
<p>
Remember when I referenced section 3 part 4 of the software engineers
manual? Well I regret it:
</p>
<pre>
fetching https://squi.bid/
fetching https://github.com/EggbertFluffle/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source=header
panic: runtime error: slice bounds out of range [:1] with length 0
goroutine 1 [running]:
main.deal_html.func1(0xc000315b90)
/home/squibid/documents/coding/go/scraper/main.go:40 +0x2fe
main.deal_html.func1(0x0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0xc0002dd110?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0xc00032e000?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x8c36e0?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html.func1(0x7fd180fd9db8?)
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
main.deal_html({0xc00002a280, 0x7c}, {0x7fd180fd9db8?, 0xc0004bd200?})
/home/squibid/documents/coding/go/scraper/main.go:58 +0x11e
main.main()
/home/squibid/documents/coding/go/scraper/main.go:105 +0x12d
exit status 2
</pre>
<p>
Turns out we need some stability if we want to actually use the code.
This like most bugs is another simple fix which just takes guarding our
url handler with a call to len to make sure we're not doing anything
stupid on empty strings.
</p>
<p>
And now it works! With a small exception, but here's a clean run showing
my lil web crawler doing it's thing... and failing pretty fast.
</p>
<pre>
fetching https://squi.bid/
fetching https://eggbert.xyz/
fetching https://www.linkedin.com/in/harrison-diambrosio-505443229/
fetching https://github.com/EggbertFluffle/
fetching https://support.github.com?tags=dotcom-footer
fetching https://docs.github.com/
fetching https://services.github.com
fetching https://github.com/github
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
...
</pre>
<p>
I'm sure you can use your imagination to figure out how long that was
going to happen for. This bug would be partially fixed by switching
which site we're searching, but ultimately we wouldn't make it far if we
keep falling for these redirections that just keep going. For now that's
fine though, and I have a semi-working web crawler. All code can be found
here:
<a href="https://git.squi.bid/squibid/web-crawler">git.squi.bid/squibid/web-crawler</a>.
Thank you for reading, I'll probably write a followup when I find some
time.
</p>
<p id="footnote-1">
[1] Yes I'm just including more links to make my site a good starting point
why do you ask?
</p>
]]></description>
</item>
<item> <item>
<title> 'New Keyboard!'</title> <title> 'New Keyboard!'</title>
<guid>https://squi.bid/blog/New-Keyboard!/index.html</guid> <guid>https://squi.bid/blog/New-Keyboard!/index.html</guid>

View File

@@ -35,13 +35,24 @@
html, body { html, body {
max-height: 100%; max-height: 100%;
background-color: var(--site-bg); background-color: var(--site-bg);
max-width: 80ch; max-width: 80ch;
margin: auto; margin: auto;
} }
@media (orientation: portrait) {
html, body {
max-width: 100% !important;
padding: 5px !important;
}
}
#font, p, ul, ol, h1, h2, h3, h4, h5, table { #font, p, ul, ol, h1, h2, h3, h4, h5, table {
font-family: sans-serif; font-family: sans-serif;
color: white; color: white;
} }
pre {
color: white;
}
h1 { h1 {
font-size: 3em; font-size: 3em;
} }