A crawler allows you to extract similar data from each page of a website. For example, you might want all of the movie data from a popular movie database such as IMDB. A crawler allows you to map the data you are interested on one page (with a few examples) and get all the data from similar pages.
1 - Add New Data Source
Open the app.
Click “+ New Data Source”.
This will open a new tab and launch you into the workflow. Familiarize yourself with the instructions in the bottom right-hand pane and when you’re ready press “Let’s get cracking!”. This is where the instructions will appear while you are in the workflow.
Step 2 - Choose Crawler
Select the crawler picture in the middle (the one that looks like a bug) to start building your crawler.
Step 3 - Find Data
Use the URL bar or the Google powered search box to navigate to the page you want data from.
In this example we are going to use movies from IMDB.
When you have arrived at your data press “I’m there!”
Step 4 - Multi or Single Row
Depending on what your data looks like you will need to select either single or multi row.
For our example extractor, because there are multiple results on one page, we will choose multi.
Hint: In general, one example of data = one row.
Step 5 - Detect Optimal Settings
Step 6 - Train Rows and Columns
When you have all your data click “I’ve got what I need!”
For this example we will choose single row, because there is only one movie per page.
Note: If you selected single row in step 6 you will not need to map rows (because there is only one) and you can move straight to mapping columns.
Step 7 - Add Pages
You will now need to add at least 5 examples to make sure that the crawler has understood your data correctly.
Click “Add another page”
Navigate to another example of the type of page you want data from, and press “I’m there”.
The data should be pulled into the table automatically however, it may need extra training.
When you are satisfied that the crawler is returning data correctly, and you have at least 5 examples press “I’m done training”.
Step 8 - Upload Crawler Configuration
Upload your crawler’s configuration to our cloud server by pressing the “Upload to import·io” button and giving your crawler a name.
Step 9 - Run Crawler
Start crawling by clicking on “Run Crawler”.
You’ll arrive to this screen:
Step 10 - Crawler Settings
Before running the crawler you should check the settings in the right-hand pane.
Where to start: by default, the crawler will start from the pages you gave as examples. However, it is sometimes more efficient to start from somewhere more central to the site (like the homepage).
To change where the crawler starts, click in the box and type in the URL you would like the crawler to start form.
Page depth: this is the maximum number of clicks from the start URL the crawler will travel to to find data. By default it is set to 10 (the maximum allowed) to enable you to get all the data. However, the fewer clicks the crawler needs to travel, the quicker your data will be returned so if possible, it is a good idea to set this to a lower number.
To change the page depth, click the drop down menu and select the number you want.
You can also switch the toggle to Advanced to have even more control over your crawler!
Hint: you can always change your crawlers settings back to the default by pressing “Defaults” (just below the big pink “Go!” button), or load a previously saved setting by pressing “Reload settings”.
Step 11 - Crawl!
When you are satisfied with your crawler’s settings click the “Go” button to begin crawling!
The crawler will begin pulling data into the table. If you did not change “Where to Crawl” from the default, it will begin with the pages you mapped when building the crawler.
Step 12 - Monitor the Crawler
Crawling can sometimes be a slow process, be patient with your crawler and wait for it to return data.
While the crawler is running there will be a number of information graphics displayed above the table. From left to right they are as follows:
Estimated time: This tells you how long the crawler has left.
Pages/hour: indicates how fast your crawler is returning data. The higher this number, the faster your crawler is running.
Queued: shows how many pages that match your data pattern are left to be converted into data.
Requests: the number of pages retrieved from the web.
Unavailable: indicates the number of pages the crawler traveled to that are no longer available (ie 404 errors).
Converted: shows how many pages that match your data pattern have been successfully converted into data.
Failed: shows how many pages that match your data pattern failed to be converted into data.
Rows: the number of rows now in your table.
Blocked: the number of pages that the crawler has been blocked from visiting.
Step 13 - Upload Data
When you are finished crawling, or you want to stop for any reason, press the “Stop” button.
Hint: This will stop the crawl entirely and you will not be able to pick up where you left off. Once you have confirmed stop, in order to continue crawling, hit the “Cancel” button which will take you back to the “Go” in step 10. You will then have to re-crawl.
If you are satisfied with your crawled data, you can upload it to our cloud server by pressing “Upload data”. If not, press “Cancel”, adjust your settings and try crawling again.
Step 14 - Open in Dataset
Your crawled data will now open in a new dataset.
What is a Crawler Anyway?
- A crawler is a software program which navigates web pages and domains following hypertext links to discover and access web page content. Once it has accessed these pages, it can perform tasks such as indexing the content, analysing the data, or otherwise interacting with the information. Common synonyms for a web crawler are: Bot, Spider, Robot, Indexer, and ‘web crawler’; these terms are used interchangeably.
A crawler is defined by its ability to move between web pages, and it’s ability to ‘visit’ pages without needing user input. This allows it to automatically traverse and consume a much larger number of pages that a human could ever hope to.
The most common crawlers operate on the web to index or scrape content. The most notable example of a crawler is ‘Google Bot’, which is used to create a large index of web data used to provide search services.
How import.io crawling is different
An import.io Crawler access webpages like other crawlers, but once it has found these pages it uses import.io extractions to turn the information from pages into structured data. These crawlers are used to create APIs of web data which enable consumers to access the data, and regularly update it by triggering a targeted re-crawl of the sites.
The mechanism to control a crawler’s access to web site is a server config file called robots.txt. Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. Some crawlers obey robots.txt; this is at the discretion of the crawler's owner.