Compare commits
5 Commits
3c0c510a0d
...
master
Author | SHA1 | Date | |
---|---|---|---|
c0fefae605
|
|||
1b746ad8ec
|
|||
5363547637
|
|||
7b3715039a
|
|||
1061b92d28
|
332
blog/Writing-my-own-web-crawler-in-go/index.html
Normal file
332
blog/Writing-my-own-web-crawler-in-go/index.html
Normal file
@@ -0,0 +1,332 @@
|
||||
<!DOCTYPE HTML>
|
||||
<html lang="en">
|
||||
<title>'Writing my own web crawler in go'</title>
|
||||
<meta name="date" content="2025/08/23">
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
<style>
|
||||
.-spell {}
|
||||
.Number {color: #e29eca}
|
||||
.Function {color: #c1c0d4}
|
||||
.String {color: #90b99f}
|
||||
.Character {color: #90b99f}
|
||||
.PreCondit {color: #ea83a5}
|
||||
.-property {color: #c1c0d4}
|
||||
.Comment {color: #757581; font-style: italic}
|
||||
.Macro {color: #ea83a5}
|
||||
.-type-builtin {color: #e29eca}
|
||||
.Type {color: #b9aeda}
|
||||
.Keyword {color: #aca1cf}
|
||||
.Constant {color: #ea83a5}
|
||||
.-variable-parameter {color: #e29eca}
|
||||
.-punctuation-delimiter {color: #9998a8}
|
||||
.-punctuation-bracket {color: #9998a8}
|
||||
.Include {color: #aca1cf}
|
||||
.Structure {color: #e6b99d}
|
||||
.-variable {color: #c9c7cd}
|
||||
.Operator {color: #e6b99d}
|
||||
</style>
|
||||
<body id="blog">
|
||||
<h1>Writing my own web crawler in go</h1>
|
||||
<p>
|
||||
I got bored, it happens to everyone (especially software developers).
|
||||
So like every software developer I've started a new side-project: a web
|
||||
crawler. This isn't for any actual usecase, I kinda just wanna learn how
|
||||
to use go and (hopefully) sharpen my SQL skills in the process. At this
|
||||
point in the writing process I've just started so let me show you what
|
||||
I've currently gotten working.
|
||||
</p>
|
||||
<p>
|
||||
To start I'm just searching through the hrefs of all <a> tags on a
|
||||
site and printing them out. On my site that looks like this:
|
||||
</p>
|
||||
<pre>
|
||||
/
|
||||
mailto:me@zacharyscheiman.com
|
||||
https://github.com/squibid
|
||||
https://codeberg.org/squibid
|
||||
https://git.squi.bid/squibid/wiz
|
||||
https://git.squi.bid/squibid
|
||||
/blog/rss.xml
|
||||
/blog/New-Keyboard!
|
||||
/blog/Serializing-data-in-C
|
||||
/blog/Why-"suckless"-software-is-important
|
||||
/blog/What-is-a-squibid
|
||||
/blog/librex-and-dots
|
||||
/?all_blog
|
||||
https://lunarflame.dev
|
||||
https://eggbert.xyz/
|
||||
</pre>
|
||||
<p>
|
||||
Here's the code which fetched this list:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Statement"><span class="Keyword">package</span></span> <span class="Structure">main</span>
|
||||
|
||||
<span class="Statement"><span class="Keyword">import</span></span> <span class="-punctuation-bracket">(</span>
|
||||
<span class="String"><span class="String">"fmt"</span></span>
|
||||
<span class="String"><span class="String">"io"</span></span>
|
||||
<span class="String"><span class="String">"net/http"</span></span>
|
||||
<span class="String"><span class="String">"golang.org/x/net/html"</span></span>
|
||||
<span class="-punctuation-bracket">)</span>
|
||||
|
||||
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter"><span class="DiagnosticUnderlineInfo">site</span></span><span class="DiagnosticUnderlineInfo"></span></span><span class="DiagnosticUnderlineInfo"> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span></span><span class="Type"><span class="Type"><span class="-type-builtin"></span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">reader</span></span> <span class="Structure">io</span><span class="-punctuation-delimiter">.</span><span class="Type">Reader</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">doc</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Parse</span></span><span class="-punctuation-bracket">(</span><span class="-variable">reader</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error parsing HTML:"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Statement"><span class="Keyword">return</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
|
||||
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">walk</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">walk</span> <span class="Operator">=</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Type</span> <span class="Operator">==</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property">ElementNode</span> <span class="Operator">&&</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Data</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"a"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">_</span><span class="-punctuation-delimiter">,</span> <span class="-variable">attr</span> <span class="Operator">:=</span> <span class="Repeat"><span class="Keyword">range</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Attr</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Key</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"href"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">c</span> <span class="Operator">:=</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">FirstChild</span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">=</span> <span class="-variable">c</span><span class="-punctuation-delimiter">.</span><span class="-property">NextSibling</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">c</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">doc</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
|
||||
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">main</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">site</span> <span class="Operator">:=</span> <span class="String"><span class="String"><span class="-spell">"https://squi.bid/"</span></span></span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"fetching "</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error getting the website"</span></span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Statement"><span class="Keyword">return</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
|
||||
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
After taking a short look at the output of this we need to handle
|
||||
multiple different "url formats" like the mailto: and / links. For right
|
||||
now I'm going to respect peoples privacy and not index their email
|
||||
addresses.
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">></span> <span class="Number"><span class="Number">1</span></span> <span class="Operator">&&</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">2</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"//"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"https:"</span></span></span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"/"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">site</span><span class="-punctuation-bracket">[</span><span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="Operator">-</span> <span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"/"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">></span> <span class="Number"><span class="Number">4</span></span> <span class="Operator">&&</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">4</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"http"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
Now that we've gotten the actual links from the website, it's time to
|
||||
store and get the links from their sites too. For now I've decided that
|
||||
because this is already just a toy project I will not be storing all the
|
||||
info I would if this were a real project. Instead I will only be storing
|
||||
the link to the site, and a boolean representing whether I'd fetched it's
|
||||
contents yet. So, let's go impl...
|
||||
</p>
|
||||
<p>
|
||||
Step #1 is to find a sql library. I just went with Go's built in
|
||||
database/sql mappings, and then visited
|
||||
<a href="https://golang.org/s/sqldrivers">golang.org/s/sqldrivers</a>
|
||||
and decided on
|
||||
<a href="https://github.com/mattn/go-sqlite3">github.com/mattn/go-sqlite3</a>
|
||||
because sqlite is a name I'm familiar with, and I really don't want to go
|
||||
through the hassle of looking into different dbs for a toy project.
|
||||
<a href="#footnote-1">[1]</a>
|
||||
</p>
|
||||
<p>
|
||||
Now that we've chosen our db I'll setup our table like I mentioned
|
||||
earlier, with one string and one boolean:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">=</span> <span class="-variable">sql</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Open</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"sqlite3"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="String"><span class="String"><span class="-spell">"./sites.db"</span></span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">log</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Fatal</span></span><span class="-punctuation-bracket">(</span><span class="-variable">err</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`</span>
|
||||
<span class="String"> create table if not exists</span>
|
||||
<span class="String"> urls (url text not null primary key, indexed boolean not null);</span>
|
||||
<span class="String"> `</span></span><span class="-punctuation-bracket">)</span>
|
||||
</pre>
|
||||
<p>
|
||||
Now we need to start adding entries to the db. To do this I wanted to
|
||||
ensure I wouldn't end up shooting myself in the foot therefore I decided
|
||||
to go with a small tiny function to make it a teensy tiny bit safer:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">db_insert_url</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">url</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">seen</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">bool</span></span></span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`insert into urls values (?, ?)`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">url</span><span class="-punctuation-delimiter">,</span> <span class="-variable">seen</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
It could use some error handling, but if you look at section 3 part 4 of
|
||||
the software engineers manual it reads "side projects aren't stable
|
||||
because if it's stable it's not a side project".
|
||||
</p>
|
||||
<p>
|
||||
Now that we've gotten all the easy stuff out of the way it's time to work
|
||||
on making this run forever, or close to it at least. For now I'm going to
|
||||
keep this project in an inefficient state and we're not going to use any
|
||||
worker pools or something fancy like that. To get started we first need
|
||||
to make a decision: depth or breadth first searching? incase you're not
|
||||
sure what I mean by this let me give you an example:
|
||||
</p>
|
||||
<p>
|
||||
Let's say we have site example-a.com which contains the following links:
|
||||
</p>
|
||||
<ul>
|
||||
<li>example-a.com/blog</li>
|
||||
<li>example-b.com</li>
|
||||
<li>example-c.com</li>
|
||||
</ul>
|
||||
<p>
|
||||
With a breadth first search we would first go to either example-b or
|
||||
example-c wheras with a depth first search we would go with
|
||||
example-a.com/blog. For my use case I want to find as many sites as
|
||||
possible therefore I will be targeting sites with other base urls.
|
||||
</p>
|
||||
<p>
|
||||
Now that we know how we want to decide the next url to fetch let's impl
|
||||
the loop which handles this.
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">i</span> <span class="Operator">:=</span> <span class="Number"><span class="Number">0</span></span><span class="-punctuation-delimiter">;</span><span class="-punctuation-delimiter">;</span> <span class="-variable">i</span><span class="Operator">++</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">i</span> <span class="Operator">></span> <span class="Number"><span class="Number">0</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">rows</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Query</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`select url from urls where indexed is false`</span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Statement"><span class="Keyword">return</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Next</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">test</span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span>
|
||||
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Scan</span></span><span class="-punctuation-bracket">(</span><span class="Operator">&</span><span class="-variable">test</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">site</span> <span class="Operator">=</span> <span class="-variable">test</span>
|
||||
<span class="Comment"><span class="Comment"><span class="-spell">/* we can't just check if the site is the same because then when we're</span></span><span class="-spell">
|
||||
<span class="Comment"> * checking squi.bid/example it won't register squi.bid as the same</span>
|
||||
<span class="Comment"> * domain, although maybe that's what we want.</span>
|
||||
<span class="Comment"> */</span></span><span class="Comment"></span></span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Operator">!</span><span class="-variable">strings</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Contains</span></span><span class="-punctuation-bracket">(</span><span class="-variable">test</span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Statement"><span class="Keyword">break</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"fetching "</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error getting"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">os</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exit</span></span><span class="-punctuation-bracket">(</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
|
||||
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
|
||||
|
||||
<span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
If you read through my code you might've seen the comment about how our
|
||||
check doesn't actually prevent accessing the same site, the solution I'm
|
||||
currently thinking of is to add a column to the db which keeps the
|
||||
highest point in the site for example: squi.bid/example/1/2/3/4 would
|
||||
have a highest point of squi.bid. But currently this isn't something I'm
|
||||
too concerned about so for now we'll just leave it as is and deal with
|
||||
another issue you might've spotted.
|
||||
</p>
|
||||
<p>
|
||||
We don't modify the db, after fetching a site successfully at no point do
|
||||
we actually say that we fetched it. Therefore whenever we try and fetch
|
||||
a new site the program with query the db and find the same route as
|
||||
before. Thankfully this is a simple fix which just takes adding this line
|
||||
right after where we index a new site:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`update urls set indexed = true where url == ?`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
</pre>
|
||||
<p>
|
||||
Remember when I referenced section 3 part 4 of the software engineers
|
||||
manual? Well I regret it:
|
||||
</p>
|
||||
<pre>
|
||||
fetching https://squi.bid/
|
||||
fetching https://github.com/EggbertFluffle/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source=header
|
||||
panic: runtime error: slice bounds out of range [:1] with length 0
|
||||
|
||||
goroutine 1 [running]:
|
||||
main.deal_html.func1(0xc000315b90)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:40 +0x2fe
|
||||
main.deal_html.func1(0x0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0xc0002dd110?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0xc00032e000?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x8c36e0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x7fd180fd9db8?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html({0xc00002a280, 0x7c}, {0x7fd180fd9db8?, 0xc0004bd200?})
|
||||
/home/squibid/documents/coding/go/scraper/main.go:58 +0x11e
|
||||
main.main()
|
||||
/home/squibid/documents/coding/go/scraper/main.go:105 +0x12d
|
||||
exit status 2
|
||||
</pre>
|
||||
<p>
|
||||
Turns out we need some stability if we want to actually use the code.
|
||||
This like most bugs is another simple fix which just takes guarding our
|
||||
url handler with a call to len to make sure we're not doing anything
|
||||
stupid on empty strings.
|
||||
</p>
|
||||
<p>
|
||||
And now it works! With a small exception, but here's a clean run showing
|
||||
my lil web crawler doing it's thing... and failing pretty fast.
|
||||
</p>
|
||||
<pre>
|
||||
fetching https://squi.bid/
|
||||
fetching https://eggbert.xyz/
|
||||
fetching https://www.linkedin.com/in/harrison-diambrosio-505443229/
|
||||
fetching https://github.com/EggbertFluffle/
|
||||
fetching https://support.github.com?tags=dotcom-footer
|
||||
fetching https://docs.github.com/
|
||||
fetching https://services.github.com
|
||||
fetching https://github.com/github
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
|
||||
...
|
||||
</pre>
|
||||
<p>
|
||||
I'm sure you can use your imagination to figure out how long that was
|
||||
going to happen for. This bug would be partially fixed by switching
|
||||
which site we're searching, but ultimately we wouldn't make it far if we
|
||||
keep falling for these redirections that just keep going. For now that's
|
||||
fine though, and I have a semi-working web crawler. All code can be found
|
||||
here:
|
||||
<a href="https://git.squi.bid/squibid/web-crawler">git.squi.bid/squibid/web-crawler</a>.
|
||||
Thank you for reading, I'll probably write a followup when I find some
|
||||
time.
|
||||
</p>
|
||||
<p id="footnote-1">
|
||||
[1] Yes I'm just including more links to make my site a good starting point
|
||||
why do you ask?
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
330
blog/rss.xml
330
blog/rss.xml
@@ -11,6 +11,336 @@
|
||||
|
||||
<!-- LB -->
|
||||
|
||||
<item>
|
||||
<title> 'Writing my own web crawler in go'</title>
|
||||
<guid>https://squi.bid/blog/Writing-my-own-web-crawler-in-go/index.html</guid>
|
||||
<link>https://squi.bid/blog/Writing-my-own-web-crawler-in-go/index.html</link>
|
||||
<pubDate>Sat, 23 Aug 2025 20:25:26 -0400</pubDate>
|
||||
<description><![CDATA[<!DOCTYPE HTML>
|
||||
<html lang="en">
|
||||
<title>'Writing my own web crawler in go'</title>
|
||||
<meta name="date" content="2025/08/23">
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
<style>
|
||||
.-spell {}
|
||||
.Number {color: #e29eca}
|
||||
.Function {color: #c1c0d4}
|
||||
.String {color: #90b99f}
|
||||
.Character {color: #90b99f}
|
||||
.PreCondit {color: #ea83a5}
|
||||
.-property {color: #c1c0d4}
|
||||
.Comment {color: #757581; font-style: italic}
|
||||
.Macro {color: #ea83a5}
|
||||
.-type-builtin {color: #e29eca}
|
||||
.Type {color: #b9aeda}
|
||||
.Keyword {color: #aca1cf}
|
||||
.Constant {color: #ea83a5}
|
||||
.-variable-parameter {color: #e29eca}
|
||||
.-punctuation-delimiter {color: #9998a8}
|
||||
.-punctuation-bracket {color: #9998a8}
|
||||
.Include {color: #aca1cf}
|
||||
.Structure {color: #e6b99d}
|
||||
.-variable {color: #c9c7cd}
|
||||
.Operator {color: #e6b99d}
|
||||
</style>
|
||||
<body id="blog">
|
||||
<h1>Writing my own web crawler in go</h1>
|
||||
<p>
|
||||
I got bored, it happens to everyone (especially software developers).
|
||||
So like every software developer I've started a new side-project: a web
|
||||
crawler. This isn't for any actual usecase, I kinda just wanna learn how
|
||||
to use go and (hopefully) sharpen my SQL skills in the process. At this
|
||||
point in the writing process I've just started so let me show you what
|
||||
I've currently gotten working.
|
||||
</p>
|
||||
<p>
|
||||
To start I'm just searching through the hrefs of all <a> tags on a
|
||||
site and printing them out. On my site that looks like this:
|
||||
</p>
|
||||
<pre>
|
||||
/
|
||||
mailto:me@zacharyscheiman.com
|
||||
https://github.com/squibid
|
||||
https://codeberg.org/squibid
|
||||
https://git.squi.bid/squibid/wiz
|
||||
https://git.squi.bid/squibid
|
||||
/blog/rss.xml
|
||||
/blog/New-Keyboard!
|
||||
/blog/Serializing-data-in-C
|
||||
/blog/Why-"suckless"-software-is-important
|
||||
/blog/What-is-a-squibid
|
||||
/blog/librex-and-dots
|
||||
/?all_blog
|
||||
https://lunarflame.dev
|
||||
https://eggbert.xyz/
|
||||
</pre>
|
||||
<p>
|
||||
Here's the code which fetched this list:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Statement"><span class="Keyword">package</span></span> <span class="Structure">main</span>
|
||||
<span class="Statement"><span class="Keyword">import</span></span> <span class="-punctuation-bracket">(</span>
|
||||
<span class="String"><span class="String">"fmt"</span></span>
|
||||
<span class="String"><span class="String">"io"</span></span>
|
||||
<span class="String"><span class="String">"net/http"</span></span>
|
||||
<span class="String"><span class="String">"golang.org/x/net/html"</span></span>
|
||||
<span class="-punctuation-bracket">)</span>
|
||||
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter"><span class="DiagnosticUnderlineInfo">site</span></span><span class="DiagnosticUnderlineInfo"></span></span><span class="DiagnosticUnderlineInfo"> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span></span><span class="Type"><span class="Type"><span class="-type-builtin"></span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">reader</span></span> <span class="Structure">io</span><span class="-punctuation-delimiter">.</span><span class="Type">Reader</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">doc</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Parse</span></span><span class="-punctuation-bracket">(</span><span class="-variable">reader</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error parsing HTML:"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Statement"><span class="Keyword">return</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">walk</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">walk</span> <span class="Operator">=</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Type</span> <span class="Operator">==</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property">ElementNode</span> <span class="Operator">&&</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Data</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"a"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">_</span><span class="-punctuation-delimiter">,</span> <span class="-variable">attr</span> <span class="Operator">:=</span> <span class="Repeat"><span class="Keyword">range</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Attr</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Key</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"href"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">c</span> <span class="Operator">:=</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">FirstChild</span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">=</span> <span class="-variable">c</span><span class="-punctuation-delimiter">.</span><span class="-property">NextSibling</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">c</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">doc</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">main</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">site</span> <span class="Operator">:=</span> <span class="String"><span class="String"><span class="-spell">"https://squi.bid/"</span></span></span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"fetching "</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error getting the website"</span></span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Statement"><span class="Keyword">return</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
After taking a short look at the output of this we need to handle
|
||||
multiple different "url formats" like the mailto: and / links. For right
|
||||
now I'm going to respect peoples privacy and not index their email
|
||||
addresses.
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">></span> <span class="Number"><span class="Number">1</span></span> <span class="Operator">&&</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">2</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"//"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"https:"</span></span></span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"/"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">site</span><span class="-punctuation-bracket">[</span><span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="Operator">-</span> <span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"/"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">></span> <span class="Number"><span class="Number">4</span></span> <span class="Operator">&&</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">4</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"http"</span></span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
Now that we've gotten the actual links from the website, it's time to
|
||||
store and get the links from their sites too. For now I've decided that
|
||||
because this is already just a toy project I will not be storing all the
|
||||
info I would if this were a real project. Instead I will only be storing
|
||||
the link to the site, and a boolean representing whether I'd fetched it's
|
||||
contents yet. So, let's go impl...
|
||||
</p>
|
||||
<p>
|
||||
Step #1 is to find a sql library. I just went with Go's built in
|
||||
database/sql mappings, and then visited
|
||||
<a href="https://golang.org/s/sqldrivers">golang.org/s/sqldrivers</a>
|
||||
and decided on
|
||||
<a href="https://github.com/mattn/go-sqlite3">github.com/mattn/go-sqlite3</a>
|
||||
because sqlite is a name I'm familiar with, and I really don't want to go
|
||||
through the hassle of looking into different dbs for a toy project.
|
||||
<a href="#footnote-1">[1]</a>
|
||||
</p>
|
||||
<p>
|
||||
Now that we've chosen our db I'll setup our table like I mentioned
|
||||
earlier, with one string and one boolean:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">=</span> <span class="-variable">sql</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Open</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"sqlite3"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="String"><span class="String"><span class="-spell">"./sites.db"</span></span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">log</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Fatal</span></span><span class="-punctuation-bracket">(</span><span class="-variable">err</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`</span>
|
||||
<span class="String"> create table if not exists</span>
|
||||
<span class="String"> urls (url text not null primary key, indexed boolean not null);</span>
|
||||
<span class="String"> `</span></span><span class="-punctuation-bracket">)</span>
|
||||
</pre>
|
||||
<p>
|
||||
Now we need to start adding entries to the db. To do this I wanted to
|
||||
ensure I wouldn't end up shooting myself in the foot therefore I decided
|
||||
to go with a small tiny function to make it a teensy tiny bit safer:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">db_insert_url</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">url</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">seen</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">bool</span></span></span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`insert into urls values (?, ?)`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">url</span><span class="-punctuation-delimiter">,</span> <span class="-variable">seen</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
It could use some error handling, but if you look at section 3 part 4 of
|
||||
the software engineers manual it reads "side projects aren't stable
|
||||
because if it's stable it's not a side project".
|
||||
</p>
|
||||
<p>
|
||||
Now that we've gotten all the easy stuff out of the way it's time to work
|
||||
on making this run forever, or close to it at least. For now I'm going to
|
||||
keep this project in an inefficient state and we're not going to use any
|
||||
worker pools or something fancy like that. To get started we first need
|
||||
to make a decision: depth or breadth first searching? incase you're not
|
||||
sure what I mean by this let me give you an example:
|
||||
</p>
|
||||
<p>
|
||||
Let's say we have site example-a.com which contains the following links:
|
||||
</p>
|
||||
<ul>
|
||||
<li>example-a.com/blog</li>
|
||||
<li>example-b.com</li>
|
||||
<li>example-c.com</li>
|
||||
</ul>
|
||||
<p>
|
||||
With a breadth first search we would first go to either example-b or
|
||||
example-c wheras with a depth first search we would go with
|
||||
example-a.com/blog. For my use case I want to find as many sites as
|
||||
possible therefore I will be targeting sites with other base urls.
|
||||
</p>
|
||||
<p>
|
||||
Now that we know how we want to decide the next url to fetch let's impl
|
||||
the loop which handles this.
|
||||
</p>
|
||||
<pre>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">i</span> <span class="Operator">:=</span> <span class="Number"><span class="Number">0</span></span><span class="-punctuation-delimiter">;</span><span class="-punctuation-delimiter">;</span> <span class="-variable">i</span><span class="Operator">++</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">i</span> <span class="Operator">></span> <span class="Number"><span class="Number">0</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">rows</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Query</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`select url from urls where indexed is false`</span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Statement"><span class="Keyword">return</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Next</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">test</span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span>
|
||||
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Scan</span></span><span class="-punctuation-bracket">(</span><span class="Operator">&</span><span class="-variable">test</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">site</span> <span class="Operator">=</span> <span class="-variable">test</span>
|
||||
<span class="Comment"><span class="Comment"><span class="-spell">/* we can't just check if the site is the same because then when we're</span></span><span class="-spell">
|
||||
<span class="Comment"> * checking squi.bid/example it won't register squi.bid as the same</span>
|
||||
<span class="Comment"> * domain, although maybe that's what we want.</span>
|
||||
<span class="Comment"> */</span></span><span class="Comment"></span></span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Operator">!</span><span class="-variable">strings</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Contains</span></span><span class="-punctuation-bracket">(</span><span class="-variable">test</span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="Statement"><span class="Keyword">break</span></span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"fetching "</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
||||
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error getting"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">os</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exit</span></span><span class="-punctuation-bracket">(</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
||||
<span class="-punctuation-bracket">}</span>
|
||||
</pre>
|
||||
<p>
|
||||
If you read through my code you might've seen the comment about how our
|
||||
check doesn't actually prevent accessing the same site, the solution I'm
|
||||
currently thinking of is to add a column to the db which keeps the
|
||||
highest point in the site for example: squi.bid/example/1/2/3/4 would
|
||||
have a highest point of squi.bid. But currently this isn't something I'm
|
||||
too concerned about so for now we'll just leave it as is and deal with
|
||||
another issue you might've spotted.
|
||||
</p>
|
||||
<p>
|
||||
We don't modify the db, after fetching a site successfully at no point do
|
||||
we actually say that we fetched it. Therefore whenever we try and fetch
|
||||
a new site the program with query the db and find the same route as
|
||||
before. Thankfully this is a simple fix which just takes adding this line
|
||||
right after where we index a new site:
|
||||
</p>
|
||||
<pre>
|
||||
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`update urls set indexed = true where url == ?`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
||||
</pre>
|
||||
<p>
|
||||
Remember when I referenced section 3 part 4 of the software engineers
|
||||
manual? Well I regret it:
|
||||
</p>
|
||||
<pre>
|
||||
fetching https://squi.bid/
|
||||
fetching https://github.com/EggbertFluffle/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source=header
|
||||
panic: runtime error: slice bounds out of range [:1] with length 0
|
||||
goroutine 1 [running]:
|
||||
main.deal_html.func1(0xc000315b90)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:40 +0x2fe
|
||||
main.deal_html.func1(0x0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0xc0002dd110?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0xc00032e000?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x8c36e0?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html.func1(0x7fd180fd9db8?)
|
||||
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
||||
main.deal_html({0xc00002a280, 0x7c}, {0x7fd180fd9db8?, 0xc0004bd200?})
|
||||
/home/squibid/documents/coding/go/scraper/main.go:58 +0x11e
|
||||
main.main()
|
||||
/home/squibid/documents/coding/go/scraper/main.go:105 +0x12d
|
||||
exit status 2
|
||||
</pre>
|
||||
<p>
|
||||
Turns out we need some stability if we want to actually use the code.
|
||||
This like most bugs is another simple fix which just takes guarding our
|
||||
url handler with a call to len to make sure we're not doing anything
|
||||
stupid on empty strings.
|
||||
</p>
|
||||
<p>
|
||||
And now it works! With a small exception, but here's a clean run showing
|
||||
my lil web crawler doing it's thing... and failing pretty fast.
|
||||
</p>
|
||||
<pre>
|
||||
fetching https://squi.bid/
|
||||
fetching https://eggbert.xyz/
|
||||
fetching https://www.linkedin.com/in/harrison-diambrosio-505443229/
|
||||
fetching https://github.com/EggbertFluffle/
|
||||
fetching https://support.github.com?tags=dotcom-footer
|
||||
fetching https://docs.github.com/
|
||||
fetching https://services.github.com
|
||||
fetching https://github.com/github
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
|
||||
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
|
||||
...
|
||||
</pre>
|
||||
<p>
|
||||
I'm sure you can use your imagination to figure out how long that was
|
||||
going to happen for. This bug would be partially fixed by switching
|
||||
which site we're searching, but ultimately we wouldn't make it far if we
|
||||
keep falling for these redirections that just keep going. For now that's
|
||||
fine though, and I have a semi-working web crawler. All code can be found
|
||||
here:
|
||||
<a href="https://git.squi.bid/squibid/web-crawler">git.squi.bid/squibid/web-crawler</a>.
|
||||
Thank you for reading, I'll probably write a followup when I find some
|
||||
time.
|
||||
</p>
|
||||
<p id="footnote-1">
|
||||
[1] Yes I'm just including more links to make my site a good starting point
|
||||
why do you ask?
|
||||
</p>
|
||||
|
||||
]]></description>
|
||||
</item>
|
||||
|
||||
|
||||
<item>
|
||||
<title> 'New Keyboard!'</title>
|
||||
<guid>https://squi.bid/blog/New-Keyboard!/index.html</guid>
|
||||
|
@@ -18,8 +18,9 @@
|
||||
Welcome to my website. I do a bunch of coding, I try to lean towards
|
||||
lower level languages (mostly C) as I find it more fun when there's a
|
||||
challenge. As for the content of this website: I put blog posts up
|
||||
when I've got something interesting to talk about and host my own git
|
||||
server with stuff I make (you can find the things I'm proud of below).
|
||||
when I've got something interesting to talk about. In addition to my
|
||||
blog I host my own git server with stuff I make (you can find links
|
||||
to the things I'm proud of below).
|
||||
</p>
|
||||
<p>
|
||||
Thank you for visiting, if you've got something you wanna say to me,
|
||||
|
13
style.css
13
style.css
@@ -35,13 +35,24 @@
|
||||
html, body {
|
||||
max-height: 100%;
|
||||
background-color: var(--site-bg);
|
||||
max-width: 80ch;
|
||||
max-width: 80ch;
|
||||
margin: auto;
|
||||
}
|
||||
|
||||
@media (orientation: portrait) {
|
||||
html, body {
|
||||
max-width: 100% !important;
|
||||
padding: 5px !important;
|
||||
}
|
||||
}
|
||||
|
||||
#font, p, ul, ol, h1, h2, h3, h4, h5, table {
|
||||
font-family: sans-serif;
|
||||
color: white;
|
||||
}
|
||||
pre {
|
||||
color: white;
|
||||
}
|
||||
h1 {
|
||||
font-size: 3em;
|
||||
}
|
||||
|
Reference in New Issue
Block a user