333 lines
34 KiB
HTML
333 lines
34 KiB
HTML
<!DOCTYPE HTML>
|
|
<html lang="en">
|
|
<title>'Writing my own web crawler in go'</title>
|
|
<meta name="date" content="2025/08/21">
|
|
<link rel="stylesheet" href="/style.css">
|
|
<style>
|
|
.-spell {}
|
|
.Number {color: #e29eca}
|
|
.Function {color: #c1c0d4}
|
|
.String {color: #90b99f}
|
|
.Character {color: #90b99f}
|
|
.PreCondit {color: #ea83a5}
|
|
.-property {color: #c1c0d4}
|
|
.Comment {color: #757581; font-style: italic}
|
|
.Macro {color: #ea83a5}
|
|
.-type-builtin {color: #e29eca}
|
|
.Type {color: #b9aeda}
|
|
.Keyword {color: #aca1cf}
|
|
.Constant {color: #ea83a5}
|
|
.-variable-parameter {color: #e29eca}
|
|
.-punctuation-delimiter {color: #9998a8}
|
|
.-punctuation-bracket {color: #9998a8}
|
|
.Include {color: #aca1cf}
|
|
.Structure {color: #e6b99d}
|
|
.-variable {color: #c9c7cd}
|
|
.Operator {color: #e6b99d}
|
|
</style>
|
|
<body id="blog">
|
|
<h1>Writing my own web crawler in go</h1>
|
|
<p>
|
|
I got bored, it happens to everyone (especially software developers).
|
|
So like every software developer I've started a new side-project: a web
|
|
crawler. This isn't for any actual usecase, I kinda just wanna learn how
|
|
to use go and (hopefully) sharpen my SQL skills in the process. At this
|
|
point in the writing process I've just started so let me show you what
|
|
I've currently gotten working.
|
|
</p>
|
|
<p>
|
|
To start I'm just searching through the hrefs of all <a> tags on a
|
|
site and printing them out. On my site that looks like this:
|
|
</p>
|
|
<pre>
|
|
/
|
|
mailto:me@zacharyscheiman.com
|
|
https://github.com/squibid
|
|
https://codeberg.org/squibid
|
|
https://git.squi.bid/squibid/wiz
|
|
https://git.squi.bid/squibid
|
|
/blog/rss.xml
|
|
/blog/New-Keyboard!
|
|
/blog/Serializing-data-in-C
|
|
/blog/Why-"suckless"-software-is-important
|
|
/blog/What-is-a-squibid
|
|
/blog/librex-and-dots
|
|
/?all_blog
|
|
https://lunarflame.dev
|
|
https://eggbert.xyz/
|
|
</pre>
|
|
<p>
|
|
Here's the code which fetched this list:
|
|
</p>
|
|
<pre>
|
|
<span class="Statement"><span class="Keyword">package</span></span> <span class="Structure">main</span>
|
|
|
|
<span class="Statement"><span class="Keyword">import</span></span> <span class="-punctuation-bracket">(</span>
|
|
<span class="String"><span class="String">"fmt"</span></span>
|
|
<span class="String"><span class="String">"io"</span></span>
|
|
<span class="String"><span class="String">"net/http"</span></span>
|
|
<span class="String"><span class="String">"golang.org/x/net/html"</span></span>
|
|
<span class="-punctuation-bracket">)</span>
|
|
|
|
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter"><span class="DiagnosticUnderlineInfo">site</span></span><span class="DiagnosticUnderlineInfo"></span></span><span class="DiagnosticUnderlineInfo"> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span></span><span class="Type"><span class="Type"><span class="-type-builtin"></span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">reader</span></span> <span class="Structure">io</span><span class="-punctuation-delimiter">.</span><span class="Type">Reader</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">doc</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Parse</span></span><span class="-punctuation-bracket">(</span><span class="-variable">reader</span><span class="-punctuation-bracket">)</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error parsing HTML:"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span><span class="-punctuation-bracket">)</span>
|
|
<span class="Statement"><span class="Keyword">return</span></span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
|
|
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">walk</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable">walk</span> <span class="Operator">=</span> <span class="Keyword"><span class="-keyword-function">func</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">n</span></span> <span class="Operator">*</span><span class="Structure">html</span><span class="-punctuation-delimiter">.</span><span class="Type">Node</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Type</span> <span class="Operator">==</span> <span class="-variable">html</span><span class="-punctuation-delimiter">.</span><span class="-property">ElementNode</span> <span class="Operator">&&</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Data</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"a"</span></span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">_</span><span class="-punctuation-delimiter">,</span> <span class="-variable">attr</span> <span class="Operator">:=</span> <span class="Repeat"><span class="Keyword">range</span></span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">Attr</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Key</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"href"</span></span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">c</span> <span class="Operator">:=</span> <span class="-variable">n</span><span class="-punctuation-delimiter">.</span><span class="-property">FirstChild</span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span><span class="-punctuation-delimiter">;</span> <span class="-variable">c</span> <span class="Operator">=</span> <span class="-variable">c</span><span class="-punctuation-delimiter">.</span><span class="-property">NextSibling</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">c</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-variable"><span class="Function">walk</span></span><span class="-punctuation-bracket">(</span><span class="-variable">doc</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
|
|
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">main</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">site</span> <span class="Operator">:=</span> <span class="String"><span class="String"><span class="-spell">"https://squi.bid/"</span></span></span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"fetching "</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error getting the website"</span></span></span><span class="-punctuation-bracket">)</span>
|
|
<span class="Statement"><span class="Keyword">return</span></span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
|
|
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
</pre>
|
|
<p>
|
|
After taking a short look at the output of this we need to handle
|
|
multiple different "url formats" like the mailto: and / links. For right
|
|
now I'm going to respect peoples privacy and not index their email
|
|
addresses.
|
|
</p>
|
|
<pre>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">></span> <span class="Number"><span class="Number">1</span></span> <span class="Operator">&&</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">2</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"//"</span></span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"https:"</span></span></span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"/"</span></span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">site</span><span class="-punctuation-bracket">[</span><span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="Operator">-</span> <span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"/"</span></span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-delimiter">:</span><span class="-punctuation-bracket">]</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span> <span class="Operator">+</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-punctuation-bracket">}</span> <span class="Conditional"><span class="Keyword">else</span></span> <span class="Conditional"><span class="Keyword">if</span></span> <span class="Identifier"><span class="-variable"><span class="Function"><span class="Special">len</span></span></span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span> <span class="Operator">></span> <span class="Number"><span class="Number">4</span></span> <span class="Operator">&&</span> <span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">[</span><span class="-punctuation-delimiter">:</span><span class="Number"><span class="Number">4</span></span><span class="-punctuation-bracket">]</span> <span class="Operator">==</span> <span class="String"><span class="String"><span class="-spell">"http"</span></span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="-variable">attr</span><span class="-punctuation-delimiter">.</span><span class="-property">Val</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
</pre>
|
|
<p>
|
|
Now that we've gotten the actual links from the website, it's time to
|
|
store and get the links from their sites too. For now I've decided that
|
|
because this is already just a toy project I will not be storing all the
|
|
info I would if this were a real project. Instead I will only be storing
|
|
the link to the site, and a boolean representing whether I'd fetched it's
|
|
contents yet. So, let's go impl...
|
|
</p>
|
|
<p>
|
|
Step #1 is to find a sql library. I just went with Go's built in
|
|
database/sql mappings, and then visited
|
|
<a href="https://golang.org/s/sqldrivers">golang.org/s/sqldrivers</a>
|
|
and decided on
|
|
<a href="https://github.com/mattn/go-sqlite3">github.com/mattn/go-sqlite3</a>
|
|
because sqlite is a name I'm familiar with, and I really don't want to go
|
|
through the hassle of looking into different dbs for a toy project.
|
|
<a href="#footnote-1">[1]</a>
|
|
</p>
|
|
<p>
|
|
Now that we've chosen our db I'll setup our table like I mentioned
|
|
earlier, with one string and one boolean:
|
|
</p>
|
|
<pre>
|
|
<span class="-variable">db</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">=</span> <span class="-variable">sql</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Open</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"sqlite3"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="String"><span class="String"><span class="-spell">"./sites.db"</span></span></span><span class="-punctuation-bracket">)</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">log</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Fatal</span></span><span class="-punctuation-bracket">(</span><span class="-variable">err</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="Statement"><span class="Keyword">defer</span></span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`</span>
|
|
<span class="String"> create table if not exists</span>
|
|
<span class="String"> urls (url text not null primary key, indexed boolean not null);</span>
|
|
<span class="String"> `</span></span><span class="-punctuation-bracket">)</span>
|
|
</pre>
|
|
<p>
|
|
Now we need to start adding entries to the db. To do this I wanted to
|
|
ensure I wouldn't end up shooting myself in the foot therefore I decided
|
|
to go with a small tiny function to make it a teensy tiny bit safer:
|
|
</p>
|
|
<pre>
|
|
<span class="Keyword"><span class="-keyword-function">func</span></span> <span class="-variable"><span class="Function">db_insert_url</span></span><span class="-punctuation-bracket">(</span><span class="-variable"><span class="-variable-parameter">url</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable"><span class="-variable-parameter">seen</span></span> <span class="Type"><span class="Type"><span class="-type-builtin">bool</span></span></span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`insert into urls values (?, ?)`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">url</span><span class="-punctuation-delimiter">,</span> <span class="-variable">seen</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
</pre>
|
|
<p>
|
|
It could use some error handling, but if you look at section 3 part 4 of
|
|
the software engineers manual it reads "side projects aren't stable
|
|
because if it's stable it's not a side project".
|
|
</p>
|
|
<p>
|
|
Now that we've gotten all the easy stuff out of the way it's time to work
|
|
on making this run forever, or close to it at least. For now I'm going to
|
|
keep this project in an inefficient state and we're not going to use any
|
|
worker pools or something fancy like that. To get started we first need
|
|
to make a decision: depth or breadth first searching? incase you're not
|
|
sure what I mean by this let me give you an example:
|
|
</p>
|
|
<p>
|
|
Let's say we have site example-a.com which contains the following links:
|
|
</p>
|
|
<ul>
|
|
<li>example-a.com/blog</li>
|
|
<li>example-b.com</li>
|
|
<li>example-c.com</li>
|
|
</ul>
|
|
<p>
|
|
With a breadth first search we would first go to either example-b or
|
|
example-c wheras with a depth first search we would go with
|
|
example-a.com/blog. For my use case I want to find as many sites as
|
|
possible therefore I will be targeting sites with other base urls.
|
|
</p>
|
|
<p>
|
|
Now that we know how we want to decide the next url to fetch let's impl
|
|
the loop which handles this.
|
|
</p>
|
|
<pre>
|
|
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">i</span> <span class="Operator">:=</span> <span class="Number"><span class="Number">0</span></span><span class="-punctuation-delimiter">;</span><span class="-punctuation-delimiter">;</span> <span class="-variable">i</span><span class="Operator">++</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">i</span> <span class="Operator">></span> <span class="Number"><span class="Number">0</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">rows</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Query</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`select url from urls where indexed is false`</span></span><span class="-punctuation-bracket">)</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Statement"><span class="Keyword">return</span></span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
|
|
<span class="Repeat"><span class="Keyword">for</span></span> <span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Next</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Keyword"><span class="Keyword">var</span></span><span class="goSingleDecl"> </span><span class="-variable">test</span> <span class="Type"><span class="Type"><span class="-type-builtin">string</span></span></span>
|
|
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Scan</span></span><span class="-punctuation-bracket">(</span><span class="Operator">&</span><span class="-variable">test</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable">site</span> <span class="Operator">=</span> <span class="-variable">test</span>
|
|
<span class="Comment"><span class="Comment"><span class="-spell">/* we can't just check if the site is the same because then when we're</span></span><span class="-spell">
|
|
<span class="Comment"> * checking squi.bid/example it won't register squi.bid as the same</span>
|
|
<span class="Comment"> * domain, although maybe that's what we want.</span>
|
|
<span class="Comment"> */</span></span><span class="Comment"></span></span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="Operator">!</span><span class="-variable">strings</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Contains</span></span><span class="-punctuation-bracket">(</span><span class="-variable">test</span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span> <span class="-punctuation-bracket">{</span>
|
|
<span class="Statement"><span class="Keyword">break</span></span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
<span class="-variable">rows</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"fetching "</span></span></span> <span class="Operator">+</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable">resp</span><span class="-punctuation-delimiter">,</span> <span class="-variable">err</span> <span class="Operator">:=</span> <span class="-variable">http</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Get</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
|
<span class="Conditional"><span class="Keyword">if</span></span> <span class="-variable">err</span> <span class="Operator">!=</span> <span class="Constant"><span class="-constant-builtin">nil</span></span> <span class="-punctuation-bracket">{</span>
|
|
<span class="-variable">fmt</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Println</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String"><span class="-spell">"Error getting"</span></span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-variable">os</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exit</span></span><span class="-punctuation-bracket">(</span><span class="Number"><span class="Number">1</span></span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
|
|
<span class="-variable"><span class="Function">deal_html</span></span><span class="-punctuation-bracket">(</span><span class="-variable">site</span><span class="-punctuation-delimiter">,</span> <span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-bracket">)</span>
|
|
|
|
<span class="-variable">resp</span><span class="-punctuation-delimiter">.</span><span class="-property">Body</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Close</span></span><span class="-punctuation-bracket">(</span><span class="-punctuation-bracket">)</span>
|
|
<span class="-punctuation-bracket">}</span>
|
|
</pre>
|
|
<p>
|
|
If you read through my code you might've seen the comment about how our
|
|
check doesn't actually prevent accessing the same site, the solution I'm
|
|
currently thinking of is to add a column to the db which keeps the
|
|
highest point in the site for example: squi.bid/example/1/2/3/4 would
|
|
have a highest point of squi.bid. But currently this isn't something I'm
|
|
too concerned about so for now we'll just leave it as is and deal with
|
|
another issue you might've spotted.
|
|
</p>
|
|
<p>
|
|
We don't modify the db, after fetching a site successfully at no point do
|
|
we actually say that we fetched it. Therefore whenever we try and fetch
|
|
a new site the program with query the db and find the same route as
|
|
before. Thankfully this is a simple fix which just takes adding this line
|
|
right after where we index a new site:
|
|
</p>
|
|
<pre>
|
|
<span class="-variable">db</span><span class="-punctuation-delimiter">.</span><span class="-property"><span class="Function">Exec</span></span><span class="-punctuation-bracket">(</span><span class="String"><span class="String">`update urls set indexed = true where url == ?`</span></span><span class="-punctuation-delimiter">,</span> <span class="-variable">site</span><span class="-punctuation-bracket">)</span>
|
|
</pre>
|
|
<p>
|
|
Remember when I referenced section 3 part 4 of the software engineers
|
|
manual? Well I regret it:
|
|
</p>
|
|
<pre>
|
|
fetching https://squi.bid/
|
|
fetching https://github.com/EggbertFluffle/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source=header
|
|
panic: runtime error: slice bounds out of range [:1] with length 0
|
|
|
|
goroutine 1 [running]:
|
|
main.deal_html.func1(0xc000315b90)
|
|
/home/squibid/documents/coding/go/scraper/main.go:40 +0x2fe
|
|
main.deal_html.func1(0x0?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html.func1(0xc0002dd110?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html.func1(0x0?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html.func1(0xc00032e000?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html.func1(0x0?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html.func1(0x8c36e0?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html.func1(0x7fd180fd9db8?)
|
|
/home/squibid/documents/coding/go/scraper/main.go:55 +0x83
|
|
main.deal_html({0xc00002a280, 0x7c}, {0x7fd180fd9db8?, 0xc0004bd200?})
|
|
/home/squibid/documents/coding/go/scraper/main.go:58 +0x11e
|
|
main.main()
|
|
/home/squibid/documents/coding/go/scraper/main.go:105 +0x12d
|
|
exit status 2
|
|
</pre>
|
|
<p>
|
|
Turns out we need some stability if we want to actually use the code.
|
|
This like most bugs is another simple fix which just takes guarding our
|
|
url handler with a call to len to make sure we're not doing anything
|
|
stupid on empty strings.
|
|
</p>
|
|
<p>
|
|
And now it works! With a small exception, but here's a clean run showing
|
|
my lil web crawler doing it's thing... and failing pretty fast.
|
|
</p>
|
|
<pre>
|
|
fetching https://squi.bid/
|
|
fetching https://eggbert.xyz/
|
|
fetching https://www.linkedin.com/in/harrison-diambrosio-505443229/
|
|
fetching https://github.com/EggbertFluffle/
|
|
fetching https://support.github.com?tags=dotcom-footer
|
|
fetching https://docs.github.com/
|
|
fetching https://services.github.com
|
|
fetching https://github.com/github
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
|
|
fetching https://github.com/github/search?q=topic%3Aactions+org%3Agithub+fork%3Atrue&type=repositories/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides/git-guides
|
|
...
|
|
</pre>
|
|
<p>
|
|
I'm sure you can use your imagination to figure out how long that was
|
|
going to happen for. This bug would be partially fixed by switching
|
|
which site we're searching, but ultimately we wouldn't make it far if we
|
|
keep falling for these redirections that just keep going. For now that's
|
|
fine though, and I have a semi-working web crawler. All code can be found
|
|
here:
|
|
<a href="https://git.squi.bid/squibid/web-crawler">git.squi.bid/squibid/web-crawler</a>.
|
|
Thank you for reading, I'll probably write a followup when I find some
|
|
time.
|
|
</p>
|
|
<p id="footnote-1">
|
|
[1] Yes I'm just including more links to make my site a good starting point
|
|
why do you ask?
|
|
</p>
|
|
</body>
|
|
</html>
|