A crawler allows you to extract similar data from each page of a website. For example, you might want all of the movie data from a popular movie database such as IMDB. A crawler allows you to map the data you are interested on one page (with a few examples) and get all the data from similar pages.
Step 1 - Start extraction
Once signed in, you will land on your data page. Here you can see all of the APIs that you have already made. To begin extraction, click on the “New” drop down menu in the top left hand side and select “Crawler”
Step 2 - Find Data
Use the URL bar or the Google powered search box to navigate to the page you want data from.
In this example we are going to use movies from IMDB.
When you have arrived at your data press “I’m there!”
Step 3 - Multi or Single Row
Depending on what your data looks like you will need to select either single or multi row.
For our example extractor, because there are multiple results on one page, we will choose multi.
Hint: In general, one example of data = one row.
Step 4 - Detect Optimal Settings
Step 5 - Train Rows and Columns
When you have all your data click “I’ve got what I need!”
For this example we will choose single row, because there is only one movie per page.
Note: If you selected single row in step 6 you will not need to map rows (because there is only one) and you can move straight to mapping columns.
Step 6 - Add Pages
You will now need to add at least 5 examples to make sure that the crawler has understood your data correctly.
Click “Add another page”
Navigate to another example of the type of page you want data from, and press “I’m there”.
The data should be pulled into the table automatically however, it may need extra training.
When you are satisfied that the crawler is returning data correctly, and you have at least 5 examples press “I’m done training”.
Step 7 - Upload Crawler Configuration
Upload your crawler’s configuration to our cloud server by pressing the “Upload to import·io” button and giving your crawler a name.
Step 8 - Run Crawler
Start crawling by clicking on “Run Crawler”.
You’ll arrive to this screen:
Step 9 - Crawler Settings
Before running the crawler you should check the settings in the right-hand pane.
Where to start: by default, the crawler will start from the pages you gave as examples. However, it is sometimes more efficient to start from somewhere more central to the site (like the homepage).
To change where the crawler starts, click in the box and type in the URL you would like the crawler to start form.
Page depth: this is the maximum number of clicks from the start URL the crawler will travel to to find data. To learn about page depth, please look at the diagram below:
In this example, we have taken one of the start URLs of the crawler (orange circle). The first page depth would look at all of the links on this page (the arrows) and go to the resulting pages (yellow circles) in this instance there are 5, a page depth of 2 will then look at all of the links in these 5 examples and go to those pages (green circles). You can see that even with a page depth of just 2, the crawler has returned 25 results from one page. By default it is set to 10 (the maximum allowed) to enable you to get all the data. However, the fewer clicks the crawler needs to travel, the quicker your data will be returned so if possible.
You can also switch the toggle to Advanced to have even more control over your crawler!
Hint: you can always change your crawlers settings back to the default by pressing “Defaults” (just below the big pink “Go!” button), or load a previously saved setting by pressing “Reload settings”.
Step 10 - Crawl!
When you are satisfied with your crawler’s settings click the “Go” button to begin crawling!
The crawler will begin pulling data into the table. If you did not change “Where to Crawl” from the default, it will begin with the pages you mapped when building the crawler.
Step 11 - Monitor the Crawler
Crawling can sometimes be a slow process, be patient with your crawler and wait for it to return data.
While the crawler is running there will be a number of information graphics displayed above the table. From left to right they are as follows:
Estimated time: This tells you how long the crawler has left.
Pages/hour: indicates how fast your crawler is returning data. The higher this number, the faster your crawler is running.
Queued: shows how many pages that match your data pattern are left to be converted into data.
Requests: the number of pages retrieved from the web.
Unavailable: indicates the number of pages the crawler traveled to that are no longer available (ie 404 errors).
Converted: shows how many pages that match your data pattern have been successfully converted into data.
Failed: shows how many pages that match your data pattern failed to be converted into data.
Rows: the number of rows now in your table.
Blocked: the number of pages that the crawler has been blocked from visiting.
Step 12 - Upload Data
When you are finished crawling, or you want to stop for any reason, press the “Stop” button.
Hint: This will stop the crawl entirely and you will not be able to pick up where you left off. Once you have confirmed stop, in order to continue crawling, hit the “Cancel” button which will take you back to the “Go” in step 10. You will then have to re-crawl.
If you are satisfied with your crawled data, you can upload it to our cloud server by pressing “Upload data”. If not, press “Cancel”, adjust your settings and try crawling again.
Step 13 - Open in Dataset
Your crawled data will now open in a new dataset.
What is a Crawler Anyway?
- A crawler is a software program which navigates web pages and domains following hypertext links to discover and access web page content. Once it has accessed these pages, it can perform tasks such as indexing the content, analysing the data, or otherwise interacting with the information. Common synonyms for a web crawler are: Bot, Spider, Robot, Indexer, and ‘web crawler’; these terms are used interchangeably.
A crawler is defined by its ability to move between web pages, and it’s ability to ‘visit’ pages without needing user input. This allows it to automatically traverse and consume a much larger number of pages that a human could ever hope to.
The most common crawlers operate on the web to index or scrape content. The most notable example of a crawler is ‘Google Bot’, which is used to create a large index of web data used to provide search services.
How import.io crawling is different
An import.io Crawler access webpages like other crawlers, but once it has found these pages it uses import.io extractions to turn the information from pages into structured data. These crawlers are used to create APIs of web data which enable consumers to access the data, and regularly update it by triggering a targeted re-crawl of the sites.
The mechanism to control a crawler’s access to web site is a server config file called robots.txt. Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. Some crawlers obey robots.txt; this is at the discretion of the crawler's owner.