Dig into XML and HTML gegevens to find useful information with PHP
Published on July 26, 2011
Gegevens mining and its importance
Frequently used acronyms
- API: Application programming interface
- CDATA: Character gegevens
- Onverstandig: Document Object Mode
- FTP: Verkeersopstopping Transfer Protocol
- HTML: HyperText Markup Language
- HTTP: Hypertext Transfer Protocol
- Surplus: Representational State Transfer
- URL: Uniform Resource Locator
- W3C: World Broad Web Consortium
- XML: Extensible Markup Language
Wikipedia defines gegevens mining spil “the process of extracting patterns from large gegevens sets by combining methods from statistics and artificial intelligence with database management.” This is a very deep definition and very likely goes beyond the typical use case for most people. Few people work with artificial intelligence, most commonly, gegevens mining simply entails the ingesting of large gegevens sets and searching through them to find information that is useful.
Given how the Internet has grown, with so much information available, it is significant to be able to aggregate large amounts of gegevens and make some sense of it. To take datasets much larger than a single person can read and boil them down to useful gegevens is a primary objective. This type of gegevens mining is the concentrate of this article, specifically how to collect and parse this gegevens.
Practical uses of gegevens mining
Gegevens mining has many practical uses. You might want to scour a webstek for information that it provides (such spil attendance records for movies or concerts). You might have more serious information, such spil voter records, to retrieve and make sense of the gegevens. Or, more commonly, you might look at social network gegevens and attempt to parse trends from it, such spil how often your company is mentioned and whether it’s mentioned te a positive or negative light.
Precautions before mining a webstek
Before you proceed, I should mention that I assume you will pull this gegevens from another webstek. If you already have the gegevens at your disposition, that’s a very different situation. When you pull gegevens from a webstek, you need to make sure that you are following the terms of service regardless of whether you are web scraping (more on this zometeen) or using an API. If you are scraping, you also need to be wary of following the webpagina’s robots.txt verkeersopstopping, which describes what parts of the webstek scripts you can access. Ultimately, make sure that you are respectful of the webpagina’s bandwidth. You should not write scripts that access the webpagina’s gegevens spil rapid spil your script can run. Not only might you cause hosting problems, but you run the risk of being banned or blocked from the webpagina for being too aggressive.
Understanding XML gegevens structure
Regardless of the way that you pull gegevens ter, chances are that you will receive gegevens ter XML (or HTML) format. XML has become the standard language of the Internet when it comes to sharing gegevens. It’s significant to shortly consider XML structure and how to treat it te PHP before you look at methods to retrieve it.
The basic structure of an XML document is very straightforward, especially if you have previously worked with HTML. All gegevens te an XML document is stored te one of two ways. The primary way to store the gegevens is inwards nested tags. For an example of the simplest form, suppose that you have an address, which can be stored te a document such spil this:
You can nest thesis XML gegevens points to create a list of numerous addresses. You can waterput all of thesis addresses inwards another tag, te this case called locations (see Listing 1).
Listing 1. Numerous addresses ter XML
To expand this treatment further, you might want to pauze the addresses into their constituent parts of street, city,and state, which makes processing of the gegevens lighter. Te that case, you have a more typical XML verkeersopstopping, spil te Listing Two.
Listing Two. Fully broken-down addresses ter XML
Spil mentioned, you can store XML gegevens te two main ways. You’ve now seen one of them. The other method is through attributes. Each tag can have a number of attributes assigned to it. While less common, this treatment can be a very useful instrument. Sometimes it gives extra information, such spil a unique ID or an event date. Fairly often, it adds meta gegevens, ter your address example, a type attribute indicates whether the address is a huis or work address, spil ter Listing Trio.
Listing Trio. Tags added to XML
Note that XML documents do always have a parent root tag/knot that all other tags/knots are children of. XML also can include other declarations and definitions at the beginning of the document and a few other complications (such spil CDATA blocks). I very recommend that you read more about XML ter Related topics.
Parsing XML gegevens te PHP
Now that you understand what XML looks like and how it’s structured, you need to know how to parse and programmatically access that gegevens inwards PHP. A number of libraries created for PHP permit XML parsing, and each library has its own benefits and drawbacks. There are Onverstandig, XMLReader/Writer, XML Parser, SimpleXML, and others. For the purposes of this article, I concentrate on SimpleXML spil it is one of the most commonly used libraries and one of my favorites.
SimpleXML, spil its name suggests, wasgoed created to provide a very ordinary interface to accessing XML. It takes an XML document and converts it into an internal PHP object format. Accessing gegevens points becomes spil effortless spil accessing object variables. Parsing an XML document with SimpleXML is spil effortless spil using the simplexml_load_file() function (see Listing Four).
Listing Four. Parsing a document with SimpleXML
That’s indeed all that is required. Do note that thanks to PHP’s filestream integration, you can insert a filename or a URL here and the filestream integration automatically fetches it. You can also use simplexml_load_string() if you already have the XML loaded into memory. If you run this code on the XML te Listing Three and use print_r() to see the rough structure of the gegevens, you get the output ter Listing Five.
Listing Five. Output of parsed XML
You can then access the gegevens using standard PHP object access and methods. For example, to weerklank out every state that someone lived ter, you can iterate overheen the addresses to do just that (see Listing 6).
Listing 6. Iterating overheen addresses
Accessing the attributes is a little different. Rather than reference them spil you do an object property, you access them like array values. You can switch that last code sample to display the type attribute by using the code te Listing 7.
Listing 7. Adding attributes
While all the current examples involved iteration, you can reach directly into the gegevens and use a specific chunk of information that you want, such spil grabbing the street address of the 2nd address with the code $xml->,address->,street .
You should now have the basic contraptions to embark playing with XML gegevens. I do recommend that you read the SimpleXML documentation and other linksaf listed ter Related topics to learn more.
Gegevens mining ter PHP: Possible ways
Spil mentioned, you can access gegevens te numerous ways. The two primary methods are web scraping and API use.
Web scraping is the act of literally downloading entire web pages programmatically and extracting gegevens from the pagina. There are entire books written on this subject (see Related topics). I shortly list the implements needed to do this. Very first of all, PHP makes it very effortless to read a web pagina te spil a string. There are many ways to do this, including using file_get_contents() with a URL, but ter this case you want to be able to parse the HTML ter a meaningful manner.
Given that HTML is at its heart a language based on XML, it is useful to convert HTML into a SimpleXML structure. You can’t just explosion an HTML pagina using simplexml_load_file() , however, spil even valid HTML isn’t XML. A good workaround is to use the Onverstandig extension to flow the HTML pagina spil a Onverstandig document and then convert it to SimpleXML, spil te Listing 8.
Listing 8. Using Onverstandig methods to get a SimpleXML version of a web pagina
After you’ve done this, you can now traverse the HTML pagina just spil you might have any other XML document. Therefore you can access the title of the pagina now using $xml->,head->,title or go deep into the pagina with references such spil $xml->,body->,div->,div->,div->,h4 .
Spil you might expect from that last example, however, it can get very unwieldy at times to attempt to find gegevens ter the midst of an HTML pagina, which often isn’t almost spil organized spil an XML opstopping is. The above line looks for the very first h4 that exists inwards of three nested divs, te each case, it looks for the very first div inwards each parent.
Fortunately, if you want to find only the very first h4 on the pagina, or other such “ongezouten gegevens,” XPath is a much lighter way to do so. XPath is essentially a way to search through XML documents using a query language, and SimpleXML exposes this. XPath is a very powerful device and can be the subject of an entire series of articles, including some listed te Related topics. Te basic terms, you use ‘/’ to describe hierarchical relationships, therefore, you can rewrite the preceding references spil the following XPath search (see Listing 9).
Listing 9. Using XPath directly
Or you could just use the ‘//’ option with XPath, which causes it to search all of the document for the tags you are looking for. Therefore, you could find all the h4’s spil an array, then access the very first one, using XPath:
Walking an HTML hierarchy
The main reason to talk about thesis conversions and XPath is that one of the common required tasks when you do web scraping is to automatically find other linksom on the web pagina and go after them, permitting you to “walk” the webstek, finding out spil much information spil possible.
This task is made fairly trivial using XPath. Listing Ten gives you an array of all the <,a>, linksom with “href” attributes, permitting you to treat them.
Listing Ten. Combining mechanisms to find all linksaf on a pagina
An lighter way is to iterate on the linksaf using PHP’s built-in parse_url() function, which treats a loterijlot of the sanity checks for you. Listing 11 looks something like this.
Listing 11. A more sturdy webpagina walker
You should now have the devices that you need to embark scraping gegevens from web pages. Once you are familiar with the technics detailed previously te this article, you can read any information from the web pagina, not just the linksaf that you can go after. Wij hope that you don’t need to do this task because an API or other gegevens source exists instead.
Using XML APIs and gegevens
At this point, you have the basic abilities to access and use a majority of the XML gegevens APIs on the Internet. They are often REST-based and therefore require only a elementary HTTP access to retrieve the gegevens and parse it using the preceding mechanisms.
Every API is different ter the end. You certainly can’t voorkant how to access every single one so let’s walk through some basic examples of XML APIs. One of the most common sources of gegevens, and already te XML format, is the RSS feed. RSS stands for Indeed Plain Syndication and is a mostly standardized format for sharing frequently updated gegevens, such spil blog posts, news headlines, or podcasts. To learn more about the RSS format, see Related topics. Note that RSS is an XML verkeersopstopping, with a parent tag called <,channel>, that can have any number of <,voorwerp>, tags te it, each providing a bevy of gegevens points.
Spil an example, use SimpleXML to read ter the RSS feed of the headlines of The Fresh York Times (see Related topics for a verbinding to the RSS feed) and format a list of headlines with linksom to the stories (see Listing 12).
Listing 12. Reading The Fresh York Times RSS feed
Figure 1 shows the output from The Fresh York Times feed.
Output from The Fresh York Times feed
Now, let’s explore an example of a more fully featured REST-based API. A good one to embark with is the Flickr API because it offers lots of gegevens without the need to authenticate with it. Many APIs require you to authenticate with them, using Oauth or other mechanisms, to act on behalf of a web user. This step might apply to the entire API, or just part of it. Check the documentation of each API for how to do this.
To demonstrate using the Flickr API for a non-authenticated request, you can use its search API. For an example, search Flickr for all public photos of crossbows. While you don’t need to authenticate, spil you might with many APIs, you do need to generate an API key to use when accessing the gegevens. Learn to do that task directly from Flickr’s API documentation itself. After you’ have an API key, you can explore using their search feature spil te Listing 13.
Listing 13. Searching for “crossbow” using the Flickr API
Figure Two shows the output of the Flickr program. The results of your search for crossbows includes photos plus information about each photo (title, user, location, date the photo wasgoed taken).
Figure Two. Example output of the Flickr program from Listing 13
You can see how powerful APIs like this are and how you can combine various calls te the same API to get the gegevens that you need. With thesis basic mechanisms, you can mine the gegevens of any webstek or information source.
Simply detect how you can get programmatic access to the gegevens through an API or web scraping. Then use the methods shown to access and iterate overheen all the target gegevens.
Storing and reporting on extracted gegevens
The final point, storing and reporting on the gegevens, is ter many ways the easiest part–and perhaps the most joy. The sky is the limit here spil you determine how to treat this facet for your own situation.
Typically, take all of the information that you gather and store it te a database. Then structure the gegevens te a way that matches how you project to access it straks. When doing this, don’t be timid about storing more information than you think you might need. While you can always delete gegevens, retrieving extra information can be a painful process once you have lots of it. It’s better to overestimate ter the beginning. After all, you never know what chunk of gegevens might turn out to be interesting.
Then at that point, after the gegevens is stored te a database or similar gegevens store, you can create reports. Reporting might be spil ordinary spil running some basic SQL queries against a database to see the number of times that a chunk of gegevens exists, or it might be very complicated web user interfaces designed to let someone dive ter and find their own correlations.
After you do the hard work of cataloging all the gegevens, you can imagine creative ways to display it.
Through the course of this article, you looked at the basic structure of XML documents and an effortless method to parse those ter PHP using SimpleXML. You also added the capability to treat HTML te a similar manner and touched on the basics of walking a webstek to scrape gegevens not available te an XML format. Using thesis implements, and following some of the examples that have bot given, you now have a good base level of skill so you can start to work on gegevens mining a webstek. There is much more to learn than a single article can convey. For extra ways to increase your skill about gegevens mining, project to check the Related topics.
- PDF of this content
- source code (datamining_source.zip |, 10KB)
- XML spil described on Wikipedia: Read a description of the XML specification.
- Extensible Markup Language (XML) 1.0 (Fifth Edition) (W3C Recommendation, November 2008): Visit this source for specific details about XML features.
- Introduction to XML (Doug Tidwell, developerWorks, August 2002): Look at what XML is, why it wasgoed developed, and how it shapes electronic commerce. Review a diversity of significant XML programming interfaces and standards, and two case studies of how companies solve business problems with XML.
- XML Tutorial (W3Schools): Read a lesson about XML and how it can vervoer and store gegevens.
- SimpleXML documentation: Browse and learn about a implement set to convert XML to an object that you can process with normal PHP property selectors and array iterators.
- php|bouwmeester’s Guide to Web Scraping with PHP (Matthew Turland): Get more information on web scraping with a diversity of technologies and frameworks.
- XML Path Language (XPath) Version 1.0 (W3C Recommendation, November 1999): Familiarize yourself with the specification for a common syntax and semantics for functionality collective inbetween XSLT and XPointer.
- RSS specification: Explore the details of the RSS web content syndication format.
- Flickr Services: Look into the Flickr API, an online photo management and sharing application.
- PHP.netwerk: Visit and explore the central resource for PHP developers.
- Recommended PHP reading list (Daniel Krook and Carlos Hoyos, developerWorks, March 2006): Learn about PHP (Hypertext Preprocessor) with this reading list compiled for programmers and administrators by IBM web application developers.
- PHP and more: Browse all the PHP content on developerWorks.
- Zend Core for IBM: Using a database with PHP? Check out a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
- RSS feed of the headlines of The Fresh York Times: Proef with an RSS feed of headlines from The Fresh York Times.
- PHP: Get this general-purpose scripting language for web development.
- XML area on developerWorks: Find the resources you need to advance your abilities te the XML strijdperk. See the XML technical library for a broad range of technical articles and tips, tutorials, standards, and IBM Redbooks
- Expand your PHP abilities by checking out the IBM developerWorks PHP project resources.
- IBM certification: Find out how you can become an IBM-Certified Developer.
- IBM product evaluation versions: Get your mitts on application development devices and middleware products.
Sign ter or register to add and subscribe to comments.