Scraping website data without software for absolute beginners

How to scrape webpages without subscription fees
Posted on Fri, 2023-04-14 04:42 | by Mr Davido, Teacher
Want to learn the subject "money management" 1-to-1?
Get on top of your finances with a coach and a structure that works

Watch My Video

The thumbnail of this video where you can see Mr Davido in a circle, title text on the left and a page with code as the background. The background is black or dark, the text is white or deep blue and some lines are pinkish purple.
15 min read

Introduction

You can do this. For a moment, scroll this page up and down and feel how you are intimidated (at least for most of you). Still, I say: YOU CAN DO THIS. This Added Value Post was written with an absolute beginner in mind. These are the steps:

1 Focus.  2 Try.  3 Fail. 4 Fail again. 5 One more time. 6 Succeed. 

I got your back!
You with me?

For this example we want to get the names of all the Restaurants in Palo Alto, California and their corresponding addresses and contact numbers from this page but we don’t want to do that manually.

I will use a little HTML, CSS and jQuery but don’t worry if you don't know what that is, this is a video for absolute beginners, anyone should be able to follow the exact steps and recreate exactly what I do. If you can copy things correctly, you’re golden!

Inspecting the DOM

Open the "web inspector" (Option + Command + I on a Mac) Or simply right click and choose “inspect element”.

If you do this you’ll see how the content on this page has been structured.  You can also click this target icon in Safari and then simply hover over the elements, ... click et voilá there you have the HTML of that specific element. 

As you can see this is a div element with the class attribute of “v-card”. Think of a div as a “division”, a “container”, or “box” with some stuff inside. Let’s open that box and see what’s in it: We see that there are 2 divs inside, one with the class of “media-thumbnail”, we don’t need that and one with the class “info” and that’s the one which holds the content we want to extract.

Most HTML elements have an opening and a closing tag: <div> </div>

Some elements like the image element don't have that: <img src="" />

But we need to drill down a bit further into the div with the classes “info-section” and “info-primary” then go inside the h2 or header 2 element which contains an A element or a hyperlink element and inside that link element there is a span element and inside that is the text we want to capture. We’ll come back to this in a minute when we’re writing the code.

We also want to go inside the div with the classes “info-section” and “info-secondary”. And inside that div we see a div with the classes: “phones phone primary” which contains the phone number that we want to extract.

There is also a sibling (or brother or sister element) with the class “adr” which holds 2 more divs: one with the class: “street-address” and another with the class “locality”. We'll have to combine these 2 divs to get the complete address we want to extract.

And usually that is the same structure for all the cards you see on websites like these. 

 

Writing our jQuery script

So now we need to write some code to extract all that info and paste it in our spreadsheet.

For this, we’re using jQuery which is a javascript library that is easier to read, use and learn than vanilla javascript.  I use Sublime Text 2 on a Mac, and also BBEdit You can also use Notepad++ on a PC. There are lots of free editors out there that you can use for this. Just don’t use Pages or MS Word. That simply won’t work. 

A jQuery file always uses the extension: .js

So that would then be: my-jQuery-file.js

If jQuery is loaded on the page, the jQuery file opens with:

(function ($) { 

and closes off with:

})(jQuery);"

And in between you write your code. 

Many webpages load jQuery by default because it’s so popular. But if it isn’t loaded, you will get an error. Don’t worry, I have fix for that. I’ll explain that in another video called “Loading jQuery for data scraping without software for absolute beginners”.

Here's an easy way to determine if jQuery is loaded on a page that you want to extract data from:
Try pasting this code below in the console of that page. Use the shortcut Option + Command + C to open the console and paste this code at the bottom of the console:

(function ($) {
$("body").css("background-color","yellow");
})(jQuery);

The background should turn into yellow.

Did it change to yellow?
If so, jQuery is loaded and you're good to go.
If not, you might want to check out my other video/tutorial mentioned above or try it on another website, mine for example.

Querying the DOM

So the 3 things we need to collect for each of the v-cards are: 
The name, the phone number, and the address

We are going to create an “array” for each of these 3. The simplest way to understand or visualise an array, is to think of it, as a container that contains a list of stuff.  We can give our array a name, just like you would name your child. It can be “Bob” or “Mindy”, or whatever you want it to be but I recommend to start your array name with a letter. 

And to declare an array you must use: 
the name of the array followed by a SPACE, an "=" EQUAL SIGN, a SPACE, "[ ]" SQUARE BRACKETS and close off with a SEMICOLON.

So for our purpose we are going to call them: 

arrayNames, for the list of names we want to extract
arrayPhoneNumbers, for the list of phone numbers we want to extract 
arrayAddresses, for the list of addresses we want to extract 

arrayNames = [];
arrayPhoneNumbers = [];
arrayAddresses = [];

Until here in our code, these arrays are empty.

Now we need to write the code to grab all the relevant data and push it into one of these arrays. That’s the point of creating the arrays. And later on we're going to display the content of the arrays and then copy and paste it in our spreadsheet. You with me?

So what the heck is this? 

$(".v-card .info-section.info-primary h2 > a > span")

The "$" DOLLAR sign is used to access the jQuery library which allows us to use jQuery and all its methods and functions. Mostly we will be “querying the DOM”, in simple terms this means SEARCHING the HTML structure for “stuff”. 

This code drills down to the span element, just like we’ve done previously in the video. 
First, we find the div with the class “v-card” that contains a div with 2 classes: “info-section" and "info-secondary”. 

Note:
A class in CSS and jQuery is preceded by a "." DOT, a PERIOD sign, a FULL STOP. 
This means that multiple HTML elements can share the same class name and that is exactly what we want because there are so many elements with the class “v-card” on this page. 

An ID is preceded by a '#' POUND sign. 
An ID is unique, there should be only 1 of its kind in the entire HTML document. If you find more than 1 of the same ID name, you're dealing with a baaaaad coder. And yeah, that sometimes happens to me as well. 

Both the Class and the ID are attributes of an element. 

When you see that there is no space between the .info-section and the .info-secondary, it means that this div MUST have both classes to match our query (or for jQuery to say: "Yes, found you!"), okay?

We drill down further and look for the h2 element.

The '>' GREATER THAN sign indicates that one element is a direct child of the other.

In this case, the A element is a DIRECT child of the h2 element. 

Or also: the the h2 element is a DIRECT parent of the A element. 

Note:
If there is only a space between these classes, like here for example, the element can be anywhere in the downward structure of that HTML element.

.v-card .info-section

so in this case it doesn't have to be a direct child of the class "v-card" and the structure could as well be: 

<div class="v-card"> 
 <div> 
  <div> 
   <div class="info-section">
     Here is the data I want to extract
   </div>
  </div>
 </div>
</div>

As you can see the A element is also followed by the GREATER THAN sign and that means that the span element is a direct child of the A element.

<div class="v-card">
 <div class="info">
  <div class="info-section info-primary">
    <h2 class="n">
        <a class="business-name">
        <span>Garden Fresh</span>
        </a>
    </h2>
  </div>
 </div>
</div>

So that is the kind of structure you should be looking for and deducting to see if this matches our query: 

$(".v-card .info-section.info-primary h2 > a > span")

 

Scraping the data

Now that we have defined the exact query we are looking for we need to perform an action and that is going to be an "each function".

An .each(function) loops over each of the matching results that were found from the query code.

In our case that simply means that for each of these matching search queries that resulted in a span, we want to do something. Okay?  What we need it to do is to "push it" in the array called arrayNames.

Pushing via the array.push() method is used to add STUFF to the end of a list.

In our example we are adding it to the end our array arrayNames. What we are going to push is the text of THIS element.

$(this) refers to the matching element that was found, or is acted upon.

.text() gets the text of that element.

"This" refers to the text of the SPAN element that was found and in our case that’s going to be “Garden Fresh”. 

Closing a jQuery statement is done by using the ; SEMICOLON.

Closing a function is writing the reverse of opening the function and ending with a ; SEMICOLON. 

Example: .each( myFunction() {    //...your code goes here... }    );

Now we close off our query and our .each function and we run a console.log function to print out the list called arrayNames in the console. This means “all the names of restaurants” that were found using this very specific query.

console.log(arrayNames);

And it’s exactly the same process for the arrayPhoneNumbers.

And because the full address is split between 2 divs, street-address and locality, we first need to capture each of them in a temporary variable and then concatenate or combine these variables in a 3rd temporary variable.

Defining or declaring a variable happens before you use it or declare it on the go, like this one for example.

var mySillyVariable123 = $("#something").text();

This variable will store the text value of the ID "something" (only 1 remember?).

As you can see I gave my variables complete ludicrous names just to show you that you can name a variable whatever you want.

Note: Make sure you start your variable with a LETTER and don't use spaces.

And from there it’s the same thing as with the arrayNames and the arrayPhoneNumbers

Now all you need to do is copy the code and paste it in the console. If there is stuff in the console and you want to clear it, just click the trash bin icon and then paste it in, hit enter, et voilá, there you have it. 

Twirl the little triangle down to see the list of all the fetched matches. It’s a bit tricky to select all results by dragging with your mouse. But there is an easier way: right click this list icon here and choose “copy selected”. However, pasting that in Numbers won’t work right away. You have to "treat it" a little. So paste it in a blank text document like for example in BBEdit. And what I like to do is to get rid of the "log info" at the top and the "prototype" at the bottom and now do: shortcut Command + F on a Mac and replace all quotation marks with a semicolon, hit replace all, select all and copy and finally paste them in a blank Numbers sheet. Boom!

And we can do this with the other results too. 

Twirl this one down, right click, copy selected, paste in BBEdit, remove the top and the bottom info, Command + F hit replace all, Command + A to select all, Command + C to copy, go to Numbers and Command + V to paste.

Now for the last one, the arrayAddresses, the street address and the locality, it’s a bit tricky. We must adjust the import settings to only use the semicolon as the separator.

Boom. 

And there you have your data. 

Clean it up and present to your boss or start calling if you're in sales. 

Let me know if this was valuable to you. 

Stay curious, stay hungry.
Check ya later!
Peace!

Introductie

 

Hier komt de Nederlandse versie

Post created by Mr Davido , Teacher

Watch My Video

The thumbnail of this video where you can see Mr Davido in a circle, title text on the left and a page with code as the background. The background is black or dark, the text is white or deep blue and some lines are pinkish purple.
Mr Davido , Teacher
Teacher Screening
Screened & Approved
the black instagram logo representing an old analog camera. Some parts are transparent Note: when using Instagram, use the system browser to download.
http://5starteachers.net/sites/default/files/upload/free-downloads/video_script_-_viral_video_-_webscraping_without_subscription.js_.zip
Want to learn the subject "money management" 1-to-1?
Get on top of your finances with a coach and a structure that works