With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich knowledge base. The knowledge comes not only from the content of the pages themselves, but also from the unique characteristics of the Web, such as its hyperlink structure and its diversity of content and languages.
Analysis of these characteristics often reveals interesting patterns and new knowledge. Such knowledge can be used to improve users’
efficiency and effectiveness in searching for information on the Web, and also for applications unrelated to the Web, such as support for decision making or business management
The Web’s size and its unstructured and dynamic content, as well as its multilingual nature, make the extraction of useful knowledge a challenging research problem. Furthermore, the Web generates a large amount of data in other formats that contain valuable information. For example, Web server logs’ information about user access patterns can be
used for information personalization or improving Web page design.
Machine learning techniques represent one possible approach to addressing the problem. Artificial intelligence and machine learning techniques have been applied in many important applications in both scientific and business domains, and data mining research has become a significant subfield in this area Machine learning techniques also have been used in information retrieval (IR) and text mining applications. The various activities and efforts in this area are referred to as
Web mining. The term Web mining was coined by Etzioni (1996) to denote the use of data mining techniques to automatically discover Web documents and services, extract information from Web resources, and uncover general patterns on the Web. Over the years, Web mining research has been extended to cover the use of data mining and similar techniques to discover resources, patterns, and knowledge from the Web and Web-related data (such as Web usage data or Web server logs).
Web mining to be “the discovery and analysis of useful information from the World Wide Web”
Web mining research overlaps substantially with other areas, including data mining, text mining, information retrieval, and Web retrieval
A classification of retrieval and mining techniques and applications
A possible classification of research in these areas is shown in Table above The classification is based on two aspects: the purpose and the data sources
Get started quickly with these instructions for formatting and publishing the proper markup code if your company has service phone numbers that are national or global in scope.
The Organization record is specified first. The only required properties on the Organization are
Get started quickly with these instructions for formatting and publishing the proper markup code.
Google algorithms process the social profiles you specify and then display the most relevant ones in response to users' queries. (For sites that have a verification process, Google will only show verified profiles.) The social profiles in your markup must correspond to the ones that users can see on the same page.
To verify that your markup is well-formed and can be processed by Google, paste the HTML source of your marked-up page (or just the <script> block) into Google's Structured Data Testing Tool.
When Google next crawls the page, its indexing algorithms will process the profiles from your markup and make them eligible to be used in search results. You can ask Google to crawl the page.
Web servers, proxies, and client applications can quite easily capture data about Web usage. Web server logs contain information about every visit to the pages hosted on a server. Some of the useful information includes what files have been requested from the server, when they were requested, the Internet Protocol (IP) address of the request, the error code, the number of bytes sent to the user, and the type of browser used. Web servers can also capture referrer logs, which show the page from
One of the major goals of Web usage mining is to reveal interesting trends and patterns. Such patterns and statistics can often provide important knowledge about a company’s customers or the users of a system. Srivastava, Cooley, Despande, and Tan (2000) provided a framework for Web usage mining, consisting of three major steps: preprocessing, pattern discovery, and pattern analysis. As in other data mining applications, preprocessing involves data cleansing. However, one of the major challenges faced by Web usage mining applications is that Web server log data are anonymous, making it difficult to identify users and user sessions from the data. Techniques like Web cookies and
A possible classification of research in these areas is shown in Table above The classification is based on two aspects: the purpose and the data sources
Retrieval research focuses on retrieving relevant, existing data
or documents from a large database or document repository, while mining
research focuses on discovering new information or knowledge in the
data. For example, data retrieval techniques are mainly concerned with
improving the speed of retrieving data from a database, whereas data
mining techniques analyze the data and try to identify interesting patternsmining
research. Machine learning is the basis for most data mining and
text mining techniques, and information retrieval research has largely
influenced the research directions of Web mining applications. In this
chapter, we review the field from the perspectives of machine learning
and information retrieval. The review emphasizes machine learning and
traditional information retrieval techniques and how they have been
applied in Web mining systems.
Many machine learning systems have been developed over the past decades. Macnine learning can broadly be identified five major areas ofmachine learning research, namely neural networks, case-based learning,genetic algorithms,rule induction, and analytic learning.
Three classes of machine learning techniques: symbolic learning, neural networks, and evolution-based algorithms.
Drawing on these two classifications and a review of the field, we have adopted a similar framework and have identified the following five major paradigms: (1) probabilistic models, (2) symbolic learning and rule induction, (3) neural networks, (4) evolution-based models, and
(5) analytic learning and fuzzy logic.
Probabilistic Models
The most popular example is the Bayesian method,Originating in pattern recognition research (Duda & Hart, 19731, this method was often used to classify different objects into predefined classes based on a set of features. A Bayesian model stores the probability of each class, the probability of each feature, and the probability of each feature given each class, based on the training data.
When a new instance is encountered, it can be classified according to these probabilities A variation of the Bayesian model, called the naive Bayesian model, assumes that all features
are mutually independent within each class. Because of its simplicity, the naive Bayesian model has been widely used in various applications in different domains
Symbolic Learning and Rule Induction
Symbolic learning can be classified according to the underlying learning strategy, such as rote learning, learning by instruction, learning by analogy, learning from examples, and learning from discovery (Carbonell, Michalski, & Mitchell, 1983; Cohen & Feigenbaum, 1982). Among these,
learning from examples appears to be the most promising symbolic learning technique for knowledge discovery and data mining. It is implemented by applying an algorithm that attempts to induce the general concept description, which best describes the different classes of the training examples. Numerous algorithms have been developed, each using one or more techniques to identlfy patterns that are helpful in generating a concept description. Quinlan’s ID3 decision-tree building algorithm (Quinlan, 1983), and variations such as C4.5 (Quinlan, 1993), have become some of the most widely used symbolic learning techniques. Given a set of objects, ID3 produces a decision tree that attempts to classify all the objects correctly. At each step, the algorithm finds the attribute that best divides the objects into the different classes by minimizing entropy (information uncertainty). After all objects have been classified, or all attributes have been used, the results can be represented by a decision tree or a set of production rules.
Neural Networks
Artificial neural networks attempt to achieve human-like performance by modeling the human nervous system. A neural network is a graph of many active nodes (neurons), which are connected to each other by weighted links (synapses). Although knowledge is represented by symbolic descriptions such as decision tree and production rules in symbolic learning, knowledge is learned and remembered by a Kohonen’s self-organizing map and the Hopfield network.network of interconnected neurons, weighted synapses, and threshold logic units Based on training examples, learning algorithms can be used to adjust the connection weights in the network so that it can predict or classify unknown examples correctly. Activation algorithms over the nodes can then be used to retrieve concepts and knowledge from the network Many different types of neural networks have been developed, among which the feedforwardhackpropagation model is the most widely used. Backpropagation networks are fully connected, layered, feed-forward networks in which activations flow from the input layer through the hidden
layer and then to the output layer (Rumelhart, Hinton, & Williams, 1986). The network usually starts with a set of random weights and adjusts its weights according to each learning example. Each learning example is passed through the network to activate the nodes. The network’s actual output is then compared with the target output and the error estimates are propagated back to the hidden and input layers. The network updates its weights incrementally according to these error estimates until the network stabilizes. Other popular neural network models include. Self-organizing maps have been widely used in unsupervised learning,clustering, and pattern recognition; Hopfield networks have been used mostly in search and optimization applications
Evolution-Based Algorithms
Another class of machine learning algorithms consists of evolutionbased algorithms that rely on analogies with natural processes and the Darwinian notion of survival of the fittest. Fogel (1994) identifies three categories of evolution-based algorithms: genetic algorithms, evolution strategies, and evolutionary programming. Genetic algorithms have proved popular and have been successfully applied to various optimization problems. They are based on genetic principles (Goldberg, 1989;
Michalewicz, 1992). A population of individuals in which each individual represents a potential solution is first initiated. This population undergoes a set of genetic operations known as crossover and mutation. Crossover is a high-level process that aims at exploitation, and mutation is a unary process that aims at exploration.Individuals strive for survival based on a selection scheme that is biased toward selecting fitter individuals (individuals that represent better solutions). The selected
individuals form the next generation and the process continues. After a number of generations, the program converges and the optimum solution is represented by the best individual.
Analytic Learning
Analytic learning represents knowledge as logical rules and performs reasoning on these rules to search for proofs. Proofs can be compiled into more complex rules to solve problems with a small number of searches required. For example, Samuelson and Rayner (1991) used analytic
learning to represent grammatical rules that improve the speed of a
parsing system. Although traditional analytic learning systems depend on hard computing
rules, usually no clear distinction exists between values and classes in the real world. To address this problem, fuzzy systems and fuzzy logic have been proposed. Fuzzy systems allow the values of
“false” or “true” to operate over the range of real numbers from zero to one (Zedah, 1965). Fuzziness accommodates imprecision and approximate reasoning.
Hybrid Approaches
As Langley and Simon (1995, p. 56) have pointed out, the reasons for differentiating the paradigms are “more historical than scientific.” The boundaries between the different paradigms are usually unclear, and many systems combine different approaches. For example, fuzzy logic has been applied to rule induction and genetic algorithms (e.g., Mendes, Voznika, Freitas, & Nievola, 2001), genetic algorithms have been combined with neural networks (e.g., Maniezzo, 1994), and because the
neural network approach has a close resemblance to the probabilistic and fuzzy logic models, they can be easily combined .
Evaluation Methodologies
The accuracy of a learning system needs to be evaluated before it can be useful, and the limited availability of data often makes estimating accuracy a difficult task. A bad testing method could give a result of zero percent accuracy for a system with an estimated accuracy of 33 percent (Kohavi, 1995). Therefore, choosing a good methodology is very important to the evaluation of machine learning systems.Several popular evaluation methods are in use, including holdout sampling, cross validation, leave-one-out, and bootstrap sampling (Efron & Tibshirani, 1993; Stone, 1974). In the holdout method, the data are divided into a training set and a testing set. Usually two-thirds of the
data are assigned to the training set and one-third to the testing set. After the system is trained by the training data, it needs to predict the output value of each instance in the testing set. These values are then compared with the real output values to determine accuracy.In cross-validation, the data set is randomly divided into a number of subsets of roughly equal size. Ten-fold cross validation, in which the data set is divided into ten subsets, is most commonly used. The system is then trained and tested for ten iterations, and in each iteration nine subsets of data are used as training data and the remaining set as testing data. In rotation, each subset of data serves as the testing set in one iteration. The accuracy of the system is the average accuracy over the ten iterations. Leave-one-out is the extreme case of cross-validation, where the original data are split into n subsets, where n is the number of observations in the original data. The system is trained and tested for n iterations, in each of which n-1 instances are used for training and the remaining instance is used for testing.In the bootstrap method, n independent random samples are taken from the original data set of size n. Because the samples are taken with replacement, the number of unique instances will be less than n. These samples are then used as the training set for the learning system, and the remaining data that have not been sampled are used to test the system
Machine learning for Information Retrieval: Pre-Web
Learning techniques had been applied in information retrieval applications long before the emergence of the Web. In their ARIST chapter, Cunningham, Kitten, and Litten (1999) provided an extensive review of applications of machine learning techniques in IR. In this section, we briefly survey some of the research in this area, covering the use of machine learning in information extraction, relevance feedback, information filtering, text classification, and text clustering.
Information extraction is one area in which machine learning is applied in IR, by means of techniques designed to identify useful information from text documents automatically. Named-entity extraction is one of the most widely studied sub-fields. It refers to the automatic identification
from text documents of the names of entities of interest, such as persons (e.g., “John Doe”), locations (e.g., ‘Washington, D.C.”), and organizations (e.g., “National Science Foundation”). It also includes the identification of other patterns, such as dates, times, number expressions, dollar amounts, e-mail addresses, and Web addresses (URLs). The Message Understanding Conference (MUG) series has been the primary forum where researchers in this area meet and compare the performance
of their entity extraction systems (Chinchor, 1998).
Machine learning is
one of the major approaches. Machine-learning-based entity extraction systems rely on algorithms rather than human-created rules to extract knowledge or identify patterns from texts. Examples of machine learning algorithms include neural networks, decision trees (Baluja, Mittal, &
Sukthankar, 1999), hidden Markov model (Miller, Crystal, Fox, Ramshaw, Schwartz, Stone, et al., 1998), and entropy maximization (Borthwick, Sterline, Agichtein, & Grishman, 1998). Instead of relying on a single approach, most existing information extraction systems combine
machine learning with other approaches (such as a rule-based or statistical approach). Many systems using a combined approach were evaluated at the MUC-7 conference. The best systems were able to achieve over 90 percent in both precision and recall rates in extracting persons, locations, organizations, dates, times, currencies, and percentages from a collection of New Yo& limes news articles Relevance feedback is a well known method used in IR systems to help
users conduct searches iteratively and reformulate search queries based on evaluation of previously retrieved documents (Ide, 1971; Rocchio, 1971). The main assumption is that documents relevant to a particular query are represented by a set of similar keywords (Salton, 1989). After a user rates the relevance of a set of retrieved documents, the query can be reformulated by adding terms from the relevant documents and subtracting terms from the irrelevant documents. It has been shown that a
single iteration of relevance feedback can significantly improve search precision and recall (Salton, 1989). Probabilistic techniques have been applied to relevance feedback by estimating the probability of relevance of a given document to a user. Using relevance feedback, a model can learn the common characteristics of a set of relevant documents in order to estimate the probability of relevance for the remaining documents in a collection (Fuhr & Buckley, 1991; Fuhr & Pfeifer, 1994). Various
machine learning algorithms, such as genetic algorithms, ID3, and simulated annealing, have been used in relevance feedback applications (Chen, Shankaranarayanan, Iyer, & She, 1998; Kraft, Petry, Buckles, & Sadasivan, 1995, 1997).
Information filtering and recommendation techniques also apply user evaluation to improve IR system performance. The main difference is that, although relevance feedback helps users reformulate their search queries, information filtering techniques try to learn about users’ interests
from their evaluations and actions and then to use this information to analyze new documents. Information filtering systems are usually designed to alleviate the problem of information overload in IR systems.The Newsweeder system allows users to give an article a rating from one to five. After a user has rated a sufficient number of articles, the system learns the user’s interests from these examples and identifies Usenet news articles that the system predicts will be interesting to the user (Lang, 1995). Decision trees also have been used for news-article filtering
(Green & Edwards, 1996).Another approach is collaborative filtering
or recommender systems, in which collaboration is achieved as
the system allows users to help one another perform filtering by recording
their reactions to documents they read (Goldberg, Nichols, Oki, &
Terry, 1992). One example is the GroupLens system, which performs collaborative filtering on Usenet news articles (Konstan, Miller, Maltz, Herlocker, Gordon, & Riedl, 1997). GroupLens recommends articles that may be of interest to a user based on the preferences of other users who
have demonstrated similar interests. Many personalization and collaborative systems have been implemented as software agents to help users
(Maes, 1994)
Text classification and text clustering studies have been reported extensively in the traditional IR literature. Text classification is the classification of textual documents into predefined categories (supervised learning), and text clustering groups documents into categories defined dynamically, based on their similarities (unsupervised learning). Although their usefulness continues to be debated (Hearst & Pedersen, 1996; Voorhees, 1985; Wu, Fuller, & Wilkinson, 2001), the use of classification and clustering is based on the cluster hypothesis: “closely associated documents tend to be relevant to the same requests” Machine learning is the basis of most text classification and clustering applications. Text classification has been extensively reported at the Association for Computing Machinery's (ACM) Special Interest Group on Information Retrieval (SIGIR) conferences and evaluated on standard test beds. For example, the na'ive Bayesian method has been widely used (e.g., Koller & Sahami, 1997; Lewis & Ringuette, 1994; McCallum, Nigam, Rennie, & Seymore, 1999).
Using the joint probabilities of words and categories calculated by considering all documents, this method estimates the probability that a document belongs to a given category. Documents with a probability above a certain threshold are considered relevant. The k-nearest neighbor
method is another widely used approach to text classification. For a given document, the k neighbors that are most similar to a given document are first identified (Iwayama & Tokunaga, 1995; Masand, Linoff, & Waltz, 1992). The categories of these neighbors are then used to categorize categorize
the given document. A threshold is used for each category. Neural network programs have also been applied to text classification, usually employing the feedforwardhackpropagation neural network model (Lam & Lee, 1999; Ng, Goh, & Low, 1997; Wiener, Pedersen, & Weigend, 1995). Term frequencies, or tfXidf scores (term frequency multiplied by inverse document frequency), of the terms are used to form a vector (Salton, 19891, which can be used as the input to the network. Using
learning examples, the network will be trained to predict the category of a document. Another new technique used in text classification is support vector machine (SVM), a statistical method that tries to find a hyperplane that best separates two classes Wapnik, 1995, 1998). Joachims first applied SVM to text classification (Joachims, 1998). SVM achieved the best performance on the Reuters-21578 data set for document classification(Yang & Liu, 1999).
As with text classification, text clustering tries to place documents into different categories based on their similarities. However, in text clustering no predefined categories are set; all categories are dynamically defined. Two types of clustering algorithms are generally used, namely hierarchical clustering and non-hierarchical clustering. "he k-nearest neighbor method and Ward's algorithm (Ward, 1963) are the most widely used hierarchical clustering methods. Willet (1988) has provided an
excellent review of hierarchical agglomerative clustering algorithms for document retrieval. For non-hierarchical clustering, one of the most common approaches is the K-means algorithm. It uses the technique oflocal optimization, in which a neighborhood of other partitions is defined
for each partition. The algorithm starts with an initial set of clusters, examines each document, searches through the set of clusters, and moves to that cluster for which the distance between the document and the centroid is smallest. The centroid position is recalculated every time a document is added. The algorithm stops when all documents have been grouped into the final required number of clusters (Rocchio, 1966). The Single-Pass method (Hill, 1968) is also widely used. However, its performance depends on the order of the input vectors and it tends to produce large clusters (Rasmussen, 1992). Suffix Tree Clustering, a linear time clustering algorithm that identifies phrases common to groups of documents, is another incremental clustering technique (Zamir & Etzioni,
1998). Kraft, Bordogna, and Pasi (1999) and Chen, Mikulic, and Kraft. (2000) also have proposed an approach to applying fuzzy clustering to information retrieval systems.
Another classification method much used in recent years is the neural network approach. For example, Kohonen’s self-organizing map (SOM), a type of neural network that produces a two-dimensional grid representation for n-dimensional features, has been widely applied in IR
(Kohonen, 1995; Lin, Soergel, & Marchionini, 1991; Orwig, Chen, & Nunamaker, 1997). The self-organizing map can be either multi-layered or single-layered. First, the input nodes, output nodes, and connection weights are initialized. Each element is then represented by a vector of
N terms and is presented to the system. The distance dj between the input and each output node j is computed. A winning node with minimum dj is then selected. After the network stabilizes, the top phrase from each node is selected as the label, and adjacent nodes with the same label are combined to form clusters.
Web Mining
Web mining research can be divided into three categories: Web content mining, Web structure mining, and Web usage mining,
Web content mining refers to the discovery of useful information from Web content, including text, images, audio, and video.
Web content mining research includes resource discovery from the Web (e.g., Chakrabarti, van den Berg, & Dom, 1999; Cho, Garcia-Molina, & Page, 1998), document categorization and clustering (e.g., Zamir & Etzioni, 1999; Kohonen, Kaski, Lagus, Salojarvi, Honkela, Paatero, et
al., ZOOO), and information extraction from Web pages (e.g., Hurst, 2001). Web structure mining studies potential models underlying the link structures of the Web. It usually involves the analysis of in-links and out-links, and has been used for search engine result ranking and other Web applications (e.g., Brin & Page, 1998; Kleinberg, 1998). Web usage mining focuses on using data mining techniques to analyze search or other activity logs to find interesting patterns. One of the main applications of Web usage mining is to develop user profiles (e.g., Armstrong, Freitag, Joachims, & Mitchell, 1995; Wasfi, 1999).
Several major challenges apply to Web mining research. First, most Web documents are in HTML (HyperText Markup Language) format and contain many markup tags, mainly used for formatting. Although Web mining applications must parse HTML documents to deal with these markup tags, the tags can also provide additional information about the document. For example, a bold typeface markup (<b>) may indicate that a term is more important than other terms, which appear
in normal typeface. Such formatting cues have been widely used to determine the relevance of terms (Arasu, Cho, Garcia-Molina, Paepcke, & Raghavan, 2001).
Second, traditional IR systems often contain structured and wellwritten documents (e.g., news articles, research papers, metadata), but this is not the case on the Web. Web documents are much more diverse in terms of length, structure, and writing style, and many Web pages contain grammatical and spelling errors. Web pages are also diverse in terms of language and subject matter; one can find almost any language and any topic on the Web. In addition, the Web has many different types of content, including: text, image, audio, video, and executable. Numerous formats feature: HTML; Extensible Markup Language (XML); Portable Document Format (PDF); Microsoft Word; Moving Picture Experts group, audio layer 3 (mp3); Waveform audio file (wav); RealAudio (ra); and Audio Video Interleaved (avi) animation file, to name just a few. Web applications have to deal with these different formats and retrieve the desired information.
Third, although most documents in traditional IR systems tend to remain static over time, Web pages are much more dynamic; they can be updated every day, every hour, or even every minute. Some Web pages do not in fact have a static form; they are dynamically generated on request, with content varying according to the user and the time of the request. This makes it much more difficult for retrieval systems such as search engines to generate an up-to-date search index of the Web. Another characteristic of the Web, perhaps the most important one, is the hyperlink structure. Web pages are hyperlinked to each other; it is through hyperlinking that a Web page author “cites” other Web pages.
Intuitively, the author of a Web page places a link to another Web page if he or she believes that it contains a relevant topic or is of good quality (Kleinberg, 1998). Anchor text, the underlined, clickable text of an outgoing link in a Web page, also provides a good description of the target page because it represents how other people linking to the page actually describe it. Several studies have tried to make use of anchor text or the adjacent text to predict the content of the target page (Amitay, 1998; Rennie & McCallum, 1999).
Lastly, the Web is larger than traditional data sources or document collections by orders of magnitude. The number of indexable Web pages exceeds two billion, and has been estimated to be growing at a rate of roughly one million pages per day (Lawrence & Giles, 1999; Lyman & Varian, 2000). Collecting, indexing, and analyzing these documents presents a great challenge. Similarly, the population of Web users is much larger than that of traditional information systems. Collaboration
among users is more feasible because of the availability of a large user base, but it can also be more difficult because of the heterogeneity of the user base.
In the next section, we review how machine learning techniques for traditional IR systems have been improved and adapted for Web mining applications, based on the characteristics of the Web. Significant work has been undertaken both in academia and industry. However, because most commercial applications do not disclose technical or algorithmic details, our review will focus largely on academic research,
Web Content Mining
Web content mining is mainly based on research in information retrieval and text mining, such as information extraction, text classification and clustering, and information visualization. However, it also includes some new applications, such as Web resource discovery. Some important Web content mining techniques and applications are reviewed in this subsection.
Text Mining for Web Documents
As discussed earlier, text mining is often considered a sub-field of data mining and refers to the extraction of knowledge from text documents (Chen, 2001; Hearst, 1999). Because the majority of documents on the Web are text documents, text mining for Web documents can be considered a sub-field of Web mining, or, more specifically, Web content mining. Information extraction, text classification, and text clustering are examples of text-mining applications that have been applied to Web documents.Although information extraction techniques have been applied to plain text documents, extracting information from HTML Web pages can present a quite different problem. As has been mentioned, HTML documents contain many markup tags that can identify useful information. However, Web pages are also comparatively unstructured. Instead of a document consisting of paragraphs, a Web page can be a document composed of a sidebar with navigation links, tables with textual and numerical data, capitalized sentences, and repetitive words. The range of
formats and structures is very diverse across the Web. If a system could parse and understand such structures, it would effectively acquire additional information for each piece of text. For example, a set of links with a heading “Link to my friends’ homepages” may indicate a set of people’s
names and corresponding personal home page links. The header row of a table can also provide additional information about the text in the table cells. On the other hand, if these tags are not processed correctly but simply stripped off, the document may become much noisier.
Chang and Lui (2001) used a PAT tree to construct automatically a set of rules for information extraction. The system, called IEPAD (Information Extraction Based on Pattern Discovery), reads an input Web page and looks for repetitive HTML markup patterns. After unwanted patterns have been filtered out, each pattern is used to form an extraction rule in regular expression. IEPAD has been tested in an experiment to extract search results from different search engines and achieved a high retrieval rate and accuracy. Wang and Hu (2002) used both decision tree and SVM to learn the patterns of table layouts in HTML documents. Layout features, content type, and word group features
are combined and used as a document’s features. Experimental results show that both decision tree and SVM can detect tables in HTML documents with high accuracy. Borodogna and Pasi (2001) proposed a fuzzy indexing model that allows users to retrieve sections of structured documents such as HTML and XML. Doorenbos, Etzioni, and Weld (1997) also have applied machine learning in the ShopBot system to extract product information from Web pages.
Some commercial applications also extract useful information from Web pages. For instance, FlipDog (http://www.flipdog.com), developed by the Whizbang! Labs
(http://www.inxight.com/whizbang)c,r awls the Web to identify job openings on employer Web sites. Lencom Software (http://www.lencom.com) also developed several products that can extract e-mail addresses and image information from the Web.
Although information extraction analyzes individual Web pages, text classification and text clustering analyze a set of Web pages. Again, Web pages consist mostly of HTML documents and are often noisier and less structured than traditional documents such as news articles and academic abstracts. In some applications the HTML tags are simply stripped from the Web documents and traditional algorithms are then applied to perform text classification and clustering. However, some useful characteristics of Web page design would be ignored. For example, Web page hyperlinks would be lost, but “Home,” “Click here,’’ and “Contact us,” would be included as a document’s features. This creates a unique problem for performing text classification and clustering of Web documents
because the format of HTML documents and the structure of the Web provide additional information for analysis. For example, text from neighboring documents has been used in an attempt to improve classification performance However, experimental results show that this method does not improve performance because, often, too many neighbor terms and too many cross-linkages occur between different classes.
Likewise, text clustering algorithms have been applied to Web applications.
The Suffix-Tree Clustering algorithm described earlier to the search results of the HuskySearch system. The self-organizing map (SOM) technique also has been applied to Web applications Chen and colleagues (Chen, Fan, Chau, & Zeng, 2001; Chen, Chau, & Zeng, 2002)
used a combination of noun phrasing and SOM to cluster the search results of search agents that collect Web pages by meta-searching popular search engines or performing a breadth-first search on particular Web sites. He, Zha, Ding, and Simon (2002) use a combination of content, hyperlink structure, and co-citation analysis in Web document clustering. Two Web pages are considered similar if they have similar content, they point to a similar set of pages, or many other pages point to both of them..
Intelligent Web Spiders......
Web spiders, also known as crawlers, wanderers, or Webbots, have been defined as “software programs that traverse the World Wide Web information space by following hypertext links and retrieving Web documents by standard HTTP protocol” .Since the early days of the Web, spiders have been widely used to build the underlying databases of search engines perform personal searches archive particular Web sites or even the whole Web or collect Web statistics (e.g., Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, et al., 2000). Chau and Chen (2003) provide a review of Web spider research.Although most spiders use simple algorithms such as breadth-first search , some use more advanced algorithms.These spiders are very useful for Web resource discovery. For example, the Itsy Bitsy Spider searches the Web using a best-first search and a genetic algorithm approach Each URL is modeled as an individual in the initial population. Crossover is defined as extracting the URLs that are pointed to by multiple starting URLs. Mutation is modeled by retrieving random URLs from Yahoo!. Because the genetic algorithm approach is an optimization process, it is well-suited to finding the best Web pages according to particular criteria.
Webnaut is another spider that uses a genetic algorithm (Zacharis & Panayiotopoulos, 2001).
Other advanced search algorithms have been used in personal spiders. Yang, Yen, and Chen (2000) applied hybrid simulated annealing in a personal spider application. Focused Crawler located Web pages relevant to a predefined set of topics based on example pages provided by the user
(Chakrabarti, van den Berg, & Dom, 1999). It determined the relevance of each page using a na'ive Bayesian model and the analysis of the link structures among the Web pages collected using the HITS
algorithm (discussed in more detail in the section on Web structure mining). These values are used to judge which URL links to follow. Another similar system, Context Focused Crawler, also uses a nai've Bayesian classifier to guide the search process (Diligenti, Coetzee, Lawrence, Giles, & Gori, 2000).
Chau and Chen (in press) apply the Hopfield Net spreading activation to collect Web pages in particular domains. Each Web page is represented as a node in the network and hyperlinks are represented simply as links between the nodes. Each node is assigned an activation score, which is a weighted sum of a content and link scores. The content score is calculated by comparing the content of the page with a domainspecific lexicon, and the link score is based on the number of outgoing
links in a page. Each node also inherits the scores from its parent nodes. Nodes are then activated in parallel and activation values from different sources are combined for each individual node until the activation scores of nodes on the network reach a stable state (convergence). Relevance feedback also has been applied in spiders (Balabanovic & Shoham, 1995; Vrettos & Stafylopoatis, 2001). These spiders determine the next URL to visit based on the user’s ratings of the relevance of the Web pages returned.
Google Webmaster give you information about your owned page crawled by Googlebot spider .
You can see in Crawl section we have robots.txt Tester,Which on the right have been shown in Green means Google Web Crawler called Googlebot has crawled the given page .
If You click on robots.txt Tester you will see the following page.
You can see the sitemap which is used to locate or define your site by Google Robot called Googlebot.You can see that my site is allowed by the Google Spider called Googlebot.
Multilingual Web Mining :
The number of non-English documents on the Web continues to grow-more than 30 percent of Web pages are in a language other than English. In order to extract non-English knowledge from the Web, Web mining systems have to deal with issues in language-specific text processing. One might think that this would not be a problem because the base algorithms behind most machine learning systems are languageindependent. Most algorithms, such as text classification and clustering, need only a set of features (a vector of keywords) for the learning process. However, the algorithms usually depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent Web documents. Many existing extraction programs, especially those employing a linguistic approach (e.g., Church, 1988), are language-dependent and work only with English texts. In order to perform analysis on non-English documents, Web mining systems must use the corresponding phrase extraction program for each language. Other learning algorithms, such as information extraction and entity extraction, also have to be tailored for different languages.
Some segmentation and extraction programs are language-independent. These programs usually employ a statistical or a machine learning approach. For example, the mutual-information-based PAT-Tree algorithm is a language-independent technique for key phrase extraction and has been tested on Chinese documents (Chien, 1997; Ong & Chen, 1999). Similarly, Church and Yamamoto (2001) use suffix arrays to perform phrase extraction. Because these programs do not rely on specific linguistic rules, they can be easily modified to work with different languages.
Web Visualization
Because it is often difficult to extract useful content from the Web, visualization tools have been used to help users maintain a “big picture” of a set of retrieval results from search engines, particular Web
sites, a subset of the Web, or even the whole Web. Various techniques have been developed in the past decade. For example, many systems visualize the Web as a tree structure based on the outgoing links of a set of starting nodes (e.g., Huang, Eades, & Cohen, 1998). The bestknown example of this approach is the hyperbolic tree developed by Xerox PARC (Lamping & Rao, 19961, which employs the “focus+context” technique to show Web sites as a tree structure using a hyperbolic view. Users can focus on the document they are looking at and maintain an overview of the context at the same time. A map is another metaphor widely used for Web visualization. The ET-Map provides a visualization of the manually cataloged Entertainment hierarchy of Yahoo! as a twodimensional
map (Chen, Schuffles, & Orwig, 1996). Some 110,000 Webpages are clustered into labeled regions based on the self-organizing map approach, in which larger regions represent more important topics,
and regions close to each other represent topics that are similar (Lin, Chen, & Nunmaker, 2000). The WEBSOM system also utilizes the SOM algorithm to cluster over a million Usenet newsgroup documents (Kohonen, 1995; Lagus, Honkela, Kaski, & Kohonen, 1999). Other examples of Web visualization include WebQuery, which uses a bullseye’s view to visualize Web search results based on link structure (CarriBre & Kazman, 1997), WebPath, which visualizes a user’s trail as he or she browses the Web (FrBcon & Smith, 1998), and threedimensional models such as Natto View (Shiozawa & Matsushita, 1997) and Narcissus (Hendley, Drew, Wood, & Beale, 1995). Dodge and
Kitchin (2001) provide a comprehensive review of cybermaps generated since the inception of the Internet.
In these visualization systems, machine learning techniques are often used to determine how Web pages should be placed in the 2-D or 3-D space. One example is the SOM algorithm described in the section on pre-Web IR (Chen et al., 1996). Web pages are represented as vectors of keywords and used to train the network that contains a two-dimensional grid of output nodes. The distance between the input and each output node is then computed and the node with the least distance is selected After the network is trained through repeated presentation of all inputs, the documents are submitted to the trained network and each region is labeled by a phrase, the key concept that best represents the cluster of documents in that region. Multidimensional scaling (MDS) is another method that can position documents on a map. It tries to map high dimensionality (e.g., document vectors) to low dimensionality (usually 2D) by solving a minimization problem (Cox & Cox, 1994). It has been
tested with document mapping and the results are encouraging (McQuaid, Ong, Chen, & Nunamaker, 1999).
The Semantic Web
A recent significant extension of the Web is the Semantic Web (Berners-Lee, Hendler, & Lassila, 2001), which seeks to add metadata to describe data and information, based on such standards as RDF (Resource Description Framework) and XML. The idea is that Web documents will no longer be unstructured text; they will be labeled with meaning that can be understood by computers. Machine learning can play three important roles in the Semantic Web. First, machine learning
can be used to automatically create the markup or metadata for existing unstructured textual documents on the Web. It is very difficult and timeconsuming for Web page authors to generate Web pages manually, according to the Semantic Web representation. To address this problem, information extraction techniques, such as entity extraction, can be applied to automate or semi-automate tasks such as identifying entities in Web pages and generating the corresponding XML tags. Second,
machine learning techniques can be used to create, merge, update, and maintain ontologies. Ontology, the explicit representation of knowledge combined with domain theories, is one of the key elements in the Semantic Web (Berners-Lee et al., 2001; Fensel & Musen, 2001). Maedche and Staab (2001) propose a framework for knowledge acquisition using machine learning. In that framework, machine learning techniques, such as association rule mining or clustering, are used to extract knowledge from Web documents in order to create new ontologies or improve existing ones. Third, machine learning can understand and perform reasoning on the metadata provided by the Semantic Web in order to extract knowledge from the Web more effectively. The documents in the Semantic Web are much more precise, more structured, and less “noisy” than the general, syntactic Web. The Semantic Web also provides context and background information for analyzing Web pages. It is believed that the Semantic Web can greatly improve the performance of Web mining systems (Berendt, Hotho, & Stumme, 2002).
Web Structure Mining(Practical implementation on any web page can be seen in the Google Webmaster tool which can fetch your page HTML code and can locate structure within the page like Structured Data Testing Tool,Structured Data Markup Helper).
You can See that on your Google Webmaster Tool as following
Structured Data
In recent years, Web link structure has been widely used to infer important information about Web pages. Web structure mining has been largely influenced by research in social network analysis and citation analysis (bibliometrics). Citations (linkages) among Web pages are usually indicators of high relevance or good quality. We use the term in-links to indicate the hyperlinks pointing to a page and the term out-links to indicate the hyperlinks found in a page. Usually, the larger the number
of in-links, the more useful a page is considered to be. The rationale is that a page referenced by many people is likely to be more important than a page that is seldom referenced. As in citation analysis, an oftencited article is presumed to be better than one that is never cited. In addition, it is reasonable to give a link from an authoritative source (such as Yahoo!) a higher weight than a link from an unimportant personal home page.
By analyzing the pages containing a URL, we can also obtain the anchor text that describes it. Anchor text shows how other Web page authors annotate a page and can be useful in predicting the content of the target page. Several algorithms have been developed to address this issue.
Among various Web-structure mining algorithms, PageRank and HITS (Hyperlinked Induced Topic Search) are the two most widely used.The PageRank algorithm is computed by weighting each in-link to a page proportionally to the quality of the page containing the in-link (Brin & Page, 1998). The qualities of these referring pages also are determined by PageRank. Thus, the PageRank of a page p is calculated recursively as follows:
A Web page has a high PageRank score if it is linked from many other pages, and the scores will be even higher if these referring pages are also good pages (pages that have high PageRank scores). It is also interesting to note that the PageRank algorithm follows a random walk model the PageRank of a page is proportional to the probability that a random surfer clicking on random links will arrive at that page.
Many machine learning systems have been developed over the past decades. Macnine learning can broadly be identified five major areas ofmachine learning research, namely neural networks, case-based learning,genetic algorithms,rule induction, and analytic learning.
Three classes of machine learning techniques: symbolic learning, neural networks, and evolution-based algorithms.
Drawing on these two classifications and a review of the field, we have adopted a similar framework and have identified the following five major paradigms: (1) probabilistic models, (2) symbolic learning and rule induction, (3) neural networks, (4) evolution-based models, and
(5) analytic learning and fuzzy logic.
Probabilistic Models
The most popular example is the Bayesian method,Originating in pattern recognition research (Duda & Hart, 19731, this method was often used to classify different objects into predefined classes based on a set of features. A Bayesian model stores the probability of each class, the probability of each feature, and the probability of each feature given each class, based on the training data.
When a new instance is encountered, it can be classified according to these probabilities A variation of the Bayesian model, called the naive Bayesian model, assumes that all features
are mutually independent within each class. Because of its simplicity, the naive Bayesian model has been widely used in various applications in different domains
Symbolic Learning and Rule Induction
Symbolic learning can be classified according to the underlying learning strategy, such as rote learning, learning by instruction, learning by analogy, learning from examples, and learning from discovery (Carbonell, Michalski, & Mitchell, 1983; Cohen & Feigenbaum, 1982). Among these,
learning from examples appears to be the most promising symbolic learning technique for knowledge discovery and data mining. It is implemented by applying an algorithm that attempts to induce the general concept description, which best describes the different classes of the training examples. Numerous algorithms have been developed, each using one or more techniques to identlfy patterns that are helpful in generating a concept description. Quinlan’s ID3 decision-tree building algorithm (Quinlan, 1983), and variations such as C4.5 (Quinlan, 1993), have become some of the most widely used symbolic learning techniques. Given a set of objects, ID3 produces a decision tree that attempts to classify all the objects correctly. At each step, the algorithm finds the attribute that best divides the objects into the different classes by minimizing entropy (information uncertainty). After all objects have been classified, or all attributes have been used, the results can be represented by a decision tree or a set of production rules.
Neural Networks
Artificial neural networks attempt to achieve human-like performance by modeling the human nervous system. A neural network is a graph of many active nodes (neurons), which are connected to each other by weighted links (synapses). Although knowledge is represented by symbolic descriptions such as decision tree and production rules in symbolic learning, knowledge is learned and remembered by a Kohonen’s self-organizing map and the Hopfield network.network of interconnected neurons, weighted synapses, and threshold logic units Based on training examples, learning algorithms can be used to adjust the connection weights in the network so that it can predict or classify unknown examples correctly. Activation algorithms over the nodes can then be used to retrieve concepts and knowledge from the network Many different types of neural networks have been developed, among which the feedforwardhackpropagation model is the most widely used. Backpropagation networks are fully connected, layered, feed-forward networks in which activations flow from the input layer through the hidden
layer and then to the output layer (Rumelhart, Hinton, & Williams, 1986). The network usually starts with a set of random weights and adjusts its weights according to each learning example. Each learning example is passed through the network to activate the nodes. The network’s actual output is then compared with the target output and the error estimates are propagated back to the hidden and input layers. The network updates its weights incrementally according to these error estimates until the network stabilizes. Other popular neural network models include. Self-organizing maps have been widely used in unsupervised learning,clustering, and pattern recognition; Hopfield networks have been used mostly in search and optimization applications
Evolution-Based Algorithms
Another class of machine learning algorithms consists of evolutionbased algorithms that rely on analogies with natural processes and the Darwinian notion of survival of the fittest. Fogel (1994) identifies three categories of evolution-based algorithms: genetic algorithms, evolution strategies, and evolutionary programming. Genetic algorithms have proved popular and have been successfully applied to various optimization problems. They are based on genetic principles (Goldberg, 1989;
Michalewicz, 1992). A population of individuals in which each individual represents a potential solution is first initiated. This population undergoes a set of genetic operations known as crossover and mutation. Crossover is a high-level process that aims at exploitation, and mutation is a unary process that aims at exploration.Individuals strive for survival based on a selection scheme that is biased toward selecting fitter individuals (individuals that represent better solutions). The selected
individuals form the next generation and the process continues. After a number of generations, the program converges and the optimum solution is represented by the best individual.
Analytic Learning
Analytic learning represents knowledge as logical rules and performs reasoning on these rules to search for proofs. Proofs can be compiled into more complex rules to solve problems with a small number of searches required. For example, Samuelson and Rayner (1991) used analytic
learning to represent grammatical rules that improve the speed of a
parsing system. Although traditional analytic learning systems depend on hard computing
rules, usually no clear distinction exists between values and classes in the real world. To address this problem, fuzzy systems and fuzzy logic have been proposed. Fuzzy systems allow the values of
“false” or “true” to operate over the range of real numbers from zero to one (Zedah, 1965). Fuzziness accommodates imprecision and approximate reasoning.
Hybrid Approaches
As Langley and Simon (1995, p. 56) have pointed out, the reasons for differentiating the paradigms are “more historical than scientific.” The boundaries between the different paradigms are usually unclear, and many systems combine different approaches. For example, fuzzy logic has been applied to rule induction and genetic algorithms (e.g., Mendes, Voznika, Freitas, & Nievola, 2001), genetic algorithms have been combined with neural networks (e.g., Maniezzo, 1994), and because the
neural network approach has a close resemblance to the probabilistic and fuzzy logic models, they can be easily combined .
Evaluation Methodologies
The accuracy of a learning system needs to be evaluated before it can be useful, and the limited availability of data often makes estimating accuracy a difficult task. A bad testing method could give a result of zero percent accuracy for a system with an estimated accuracy of 33 percent (Kohavi, 1995). Therefore, choosing a good methodology is very important to the evaluation of machine learning systems.Several popular evaluation methods are in use, including holdout sampling, cross validation, leave-one-out, and bootstrap sampling (Efron & Tibshirani, 1993; Stone, 1974). In the holdout method, the data are divided into a training set and a testing set. Usually two-thirds of the
data are assigned to the training set and one-third to the testing set. After the system is trained by the training data, it needs to predict the output value of each instance in the testing set. These values are then compared with the real output values to determine accuracy.In cross-validation, the data set is randomly divided into a number of subsets of roughly equal size. Ten-fold cross validation, in which the data set is divided into ten subsets, is most commonly used. The system is then trained and tested for ten iterations, and in each iteration nine subsets of data are used as training data and the remaining set as testing data. In rotation, each subset of data serves as the testing set in one iteration. The accuracy of the system is the average accuracy over the ten iterations. Leave-one-out is the extreme case of cross-validation, where the original data are split into n subsets, where n is the number of observations in the original data. The system is trained and tested for n iterations, in each of which n-1 instances are used for training and the remaining instance is used for testing.In the bootstrap method, n independent random samples are taken from the original data set of size n. Because the samples are taken with replacement, the number of unique instances will be less than n. These samples are then used as the training set for the learning system, and the remaining data that have not been sampled are used to test the system
Machine learning for Information Retrieval: Pre-Web
Learning techniques had been applied in information retrieval applications long before the emergence of the Web. In their ARIST chapter, Cunningham, Kitten, and Litten (1999) provided an extensive review of applications of machine learning techniques in IR. In this section, we briefly survey some of the research in this area, covering the use of machine learning in information extraction, relevance feedback, information filtering, text classification, and text clustering.
Information extraction is one area in which machine learning is applied in IR, by means of techniques designed to identify useful information from text documents automatically. Named-entity extraction is one of the most widely studied sub-fields. It refers to the automatic identification
from text documents of the names of entities of interest, such as persons (e.g., “John Doe”), locations (e.g., ‘Washington, D.C.”), and organizations (e.g., “National Science Foundation”). It also includes the identification of other patterns, such as dates, times, number expressions, dollar amounts, e-mail addresses, and Web addresses (URLs). The Message Understanding Conference (MUG) series has been the primary forum where researchers in this area meet and compare the performance
of their entity extraction systems (Chinchor, 1998).
Machine learning is
one of the major approaches. Machine-learning-based entity extraction systems rely on algorithms rather than human-created rules to extract knowledge or identify patterns from texts. Examples of machine learning algorithms include neural networks, decision trees (Baluja, Mittal, &
Sukthankar, 1999), hidden Markov model (Miller, Crystal, Fox, Ramshaw, Schwartz, Stone, et al., 1998), and entropy maximization (Borthwick, Sterline, Agichtein, & Grishman, 1998). Instead of relying on a single approach, most existing information extraction systems combine
machine learning with other approaches (such as a rule-based or statistical approach). Many systems using a combined approach were evaluated at the MUC-7 conference. The best systems were able to achieve over 90 percent in both precision and recall rates in extracting persons, locations, organizations, dates, times, currencies, and percentages from a collection of New Yo& limes news articles Relevance feedback is a well known method used in IR systems to help
users conduct searches iteratively and reformulate search queries based on evaluation of previously retrieved documents (Ide, 1971; Rocchio, 1971). The main assumption is that documents relevant to a particular query are represented by a set of similar keywords (Salton, 1989). After a user rates the relevance of a set of retrieved documents, the query can be reformulated by adding terms from the relevant documents and subtracting terms from the irrelevant documents. It has been shown that a
single iteration of relevance feedback can significantly improve search precision and recall (Salton, 1989). Probabilistic techniques have been applied to relevance feedback by estimating the probability of relevance of a given document to a user. Using relevance feedback, a model can learn the common characteristics of a set of relevant documents in order to estimate the probability of relevance for the remaining documents in a collection (Fuhr & Buckley, 1991; Fuhr & Pfeifer, 1994). Various
machine learning algorithms, such as genetic algorithms, ID3, and simulated annealing, have been used in relevance feedback applications (Chen, Shankaranarayanan, Iyer, & She, 1998; Kraft, Petry, Buckles, & Sadasivan, 1995, 1997).
Information filtering and recommendation techniques also apply user evaluation to improve IR system performance. The main difference is that, although relevance feedback helps users reformulate their search queries, information filtering techniques try to learn about users’ interests
from their evaluations and actions and then to use this information to analyze new documents. Information filtering systems are usually designed to alleviate the problem of information overload in IR systems.The Newsweeder system allows users to give an article a rating from one to five. After a user has rated a sufficient number of articles, the system learns the user’s interests from these examples and identifies Usenet news articles that the system predicts will be interesting to the user (Lang, 1995). Decision trees also have been used for news-article filtering
(Green & Edwards, 1996).Another approach is collaborative filtering
or recommender systems, in which collaboration is achieved as
the system allows users to help one another perform filtering by recording
their reactions to documents they read (Goldberg, Nichols, Oki, &
Terry, 1992). One example is the GroupLens system, which performs collaborative filtering on Usenet news articles (Konstan, Miller, Maltz, Herlocker, Gordon, & Riedl, 1997). GroupLens recommends articles that may be of interest to a user based on the preferences of other users who
have demonstrated similar interests. Many personalization and collaborative systems have been implemented as software agents to help users
(Maes, 1994)
Text classification and text clustering studies have been reported extensively in the traditional IR literature. Text classification is the classification of textual documents into predefined categories (supervised learning), and text clustering groups documents into categories defined dynamically, based on their similarities (unsupervised learning). Although their usefulness continues to be debated (Hearst & Pedersen, 1996; Voorhees, 1985; Wu, Fuller, & Wilkinson, 2001), the use of classification and clustering is based on the cluster hypothesis: “closely associated documents tend to be relevant to the same requests” Machine learning is the basis of most text classification and clustering applications. Text classification has been extensively reported at the Association for Computing Machinery's (ACM) Special Interest Group on Information Retrieval (SIGIR) conferences and evaluated on standard test beds. For example, the na'ive Bayesian method has been widely used (e.g., Koller & Sahami, 1997; Lewis & Ringuette, 1994; McCallum, Nigam, Rennie, & Seymore, 1999).
Using the joint probabilities of words and categories calculated by considering all documents, this method estimates the probability that a document belongs to a given category. Documents with a probability above a certain threshold are considered relevant. The k-nearest neighbor
method is another widely used approach to text classification. For a given document, the k neighbors that are most similar to a given document are first identified (Iwayama & Tokunaga, 1995; Masand, Linoff, & Waltz, 1992). The categories of these neighbors are then used to categorize categorize
the given document. A threshold is used for each category. Neural network programs have also been applied to text classification, usually employing the feedforwardhackpropagation neural network model (Lam & Lee, 1999; Ng, Goh, & Low, 1997; Wiener, Pedersen, & Weigend, 1995). Term frequencies, or tfXidf scores (term frequency multiplied by inverse document frequency), of the terms are used to form a vector (Salton, 19891, which can be used as the input to the network. Using
learning examples, the network will be trained to predict the category of a document. Another new technique used in text classification is support vector machine (SVM), a statistical method that tries to find a hyperplane that best separates two classes Wapnik, 1995, 1998). Joachims first applied SVM to text classification (Joachims, 1998). SVM achieved the best performance on the Reuters-21578 data set for document classification(Yang & Liu, 1999).
As with text classification, text clustering tries to place documents into different categories based on their similarities. However, in text clustering no predefined categories are set; all categories are dynamically defined. Two types of clustering algorithms are generally used, namely hierarchical clustering and non-hierarchical clustering. "he k-nearest neighbor method and Ward's algorithm (Ward, 1963) are the most widely used hierarchical clustering methods. Willet (1988) has provided an
excellent review of hierarchical agglomerative clustering algorithms for document retrieval. For non-hierarchical clustering, one of the most common approaches is the K-means algorithm. It uses the technique oflocal optimization, in which a neighborhood of other partitions is defined
for each partition. The algorithm starts with an initial set of clusters, examines each document, searches through the set of clusters, and moves to that cluster for which the distance between the document and the centroid is smallest. The centroid position is recalculated every time a document is added. The algorithm stops when all documents have been grouped into the final required number of clusters (Rocchio, 1966). The Single-Pass method (Hill, 1968) is also widely used. However, its performance depends on the order of the input vectors and it tends to produce large clusters (Rasmussen, 1992). Suffix Tree Clustering, a linear time clustering algorithm that identifies phrases common to groups of documents, is another incremental clustering technique (Zamir & Etzioni,
1998). Kraft, Bordogna, and Pasi (1999) and Chen, Mikulic, and Kraft. (2000) also have proposed an approach to applying fuzzy clustering to information retrieval systems.
Another classification method much used in recent years is the neural network approach. For example, Kohonen’s self-organizing map (SOM), a type of neural network that produces a two-dimensional grid representation for n-dimensional features, has been widely applied in IR
(Kohonen, 1995; Lin, Soergel, & Marchionini, 1991; Orwig, Chen, & Nunamaker, 1997). The self-organizing map can be either multi-layered or single-layered. First, the input nodes, output nodes, and connection weights are initialized. Each element is then represented by a vector of
N terms and is presented to the system. The distance dj between the input and each output node j is computed. A winning node with minimum dj is then selected. After the network stabilizes, the top phrase from each node is selected as the label, and adjacent nodes with the same label are combined to form clusters.
Web Mining
Web mining research can be divided into three categories: Web content mining, Web structure mining, and Web usage mining,
Web content mining refers to the discovery of useful information from Web content, including text, images, audio, and video.
Web content mining research includes resource discovery from the Web (e.g., Chakrabarti, van den Berg, & Dom, 1999; Cho, Garcia-Molina, & Page, 1998), document categorization and clustering (e.g., Zamir & Etzioni, 1999; Kohonen, Kaski, Lagus, Salojarvi, Honkela, Paatero, et
al., ZOOO), and information extraction from Web pages (e.g., Hurst, 2001). Web structure mining studies potential models underlying the link structures of the Web. It usually involves the analysis of in-links and out-links, and has been used for search engine result ranking and other Web applications (e.g., Brin & Page, 1998; Kleinberg, 1998). Web usage mining focuses on using data mining techniques to analyze search or other activity logs to find interesting patterns. One of the main applications of Web usage mining is to develop user profiles (e.g., Armstrong, Freitag, Joachims, & Mitchell, 1995; Wasfi, 1999).
Several major challenges apply to Web mining research. First, most Web documents are in HTML (HyperText Markup Language) format and contain many markup tags, mainly used for formatting. Although Web mining applications must parse HTML documents to deal with these markup tags, the tags can also provide additional information about the document. For example, a bold typeface markup (<b>) may indicate that a term is more important than other terms, which appear
in normal typeface. Such formatting cues have been widely used to determine the relevance of terms (Arasu, Cho, Garcia-Molina, Paepcke, & Raghavan, 2001).
Second, traditional IR systems often contain structured and wellwritten documents (e.g., news articles, research papers, metadata), but this is not the case on the Web. Web documents are much more diverse in terms of length, structure, and writing style, and many Web pages contain grammatical and spelling errors. Web pages are also diverse in terms of language and subject matter; one can find almost any language and any topic on the Web. In addition, the Web has many different types of content, including: text, image, audio, video, and executable. Numerous formats feature: HTML; Extensible Markup Language (XML); Portable Document Format (PDF); Microsoft Word; Moving Picture Experts group, audio layer 3 (mp3); Waveform audio file (wav); RealAudio (ra); and Audio Video Interleaved (avi) animation file, to name just a few. Web applications have to deal with these different formats and retrieve the desired information.
Third, although most documents in traditional IR systems tend to remain static over time, Web pages are much more dynamic; they can be updated every day, every hour, or even every minute. Some Web pages do not in fact have a static form; they are dynamically generated on request, with content varying according to the user and the time of the request. This makes it much more difficult for retrieval systems such as search engines to generate an up-to-date search index of the Web. Another characteristic of the Web, perhaps the most important one, is the hyperlink structure. Web pages are hyperlinked to each other; it is through hyperlinking that a Web page author “cites” other Web pages.
Intuitively, the author of a Web page places a link to another Web page if he or she believes that it contains a relevant topic or is of good quality (Kleinberg, 1998). Anchor text, the underlined, clickable text of an outgoing link in a Web page, also provides a good description of the target page because it represents how other people linking to the page actually describe it. Several studies have tried to make use of anchor text or the adjacent text to predict the content of the target page (Amitay, 1998; Rennie & McCallum, 1999).
Lastly, the Web is larger than traditional data sources or document collections by orders of magnitude. The number of indexable Web pages exceeds two billion, and has been estimated to be growing at a rate of roughly one million pages per day (Lawrence & Giles, 1999; Lyman & Varian, 2000). Collecting, indexing, and analyzing these documents presents a great challenge. Similarly, the population of Web users is much larger than that of traditional information systems. Collaboration
among users is more feasible because of the availability of a large user base, but it can also be more difficult because of the heterogeneity of the user base.
In the next section, we review how machine learning techniques for traditional IR systems have been improved and adapted for Web mining applications, based on the characteristics of the Web. Significant work has been undertaken both in academia and industry. However, because most commercial applications do not disclose technical or algorithmic details, our review will focus largely on academic research,
Web Content Mining
Web content mining is mainly based on research in information retrieval and text mining, such as information extraction, text classification and clustering, and information visualization. However, it also includes some new applications, such as Web resource discovery. Some important Web content mining techniques and applications are reviewed in this subsection.
Text Mining for Web Documents
As discussed earlier, text mining is often considered a sub-field of data mining and refers to the extraction of knowledge from text documents (Chen, 2001; Hearst, 1999). Because the majority of documents on the Web are text documents, text mining for Web documents can be considered a sub-field of Web mining, or, more specifically, Web content mining. Information extraction, text classification, and text clustering are examples of text-mining applications that have been applied to Web documents.Although information extraction techniques have been applied to plain text documents, extracting information from HTML Web pages can present a quite different problem. As has been mentioned, HTML documents contain many markup tags that can identify useful information. However, Web pages are also comparatively unstructured. Instead of a document consisting of paragraphs, a Web page can be a document composed of a sidebar with navigation links, tables with textual and numerical data, capitalized sentences, and repetitive words. The range of
formats and structures is very diverse across the Web. If a system could parse and understand such structures, it would effectively acquire additional information for each piece of text. For example, a set of links with a heading “Link to my friends’ homepages” may indicate a set of people’s
names and corresponding personal home page links. The header row of a table can also provide additional information about the text in the table cells. On the other hand, if these tags are not processed correctly but simply stripped off, the document may become much noisier.
Chang and Lui (2001) used a PAT tree to construct automatically a set of rules for information extraction. The system, called IEPAD (Information Extraction Based on Pattern Discovery), reads an input Web page and looks for repetitive HTML markup patterns. After unwanted patterns have been filtered out, each pattern is used to form an extraction rule in regular expression. IEPAD has been tested in an experiment to extract search results from different search engines and achieved a high retrieval rate and accuracy. Wang and Hu (2002) used both decision tree and SVM to learn the patterns of table layouts in HTML documents. Layout features, content type, and word group features
are combined and used as a document’s features. Experimental results show that both decision tree and SVM can detect tables in HTML documents with high accuracy. Borodogna and Pasi (2001) proposed a fuzzy indexing model that allows users to retrieve sections of structured documents such as HTML and XML. Doorenbos, Etzioni, and Weld (1997) also have applied machine learning in the ShopBot system to extract product information from Web pages.
Some commercial applications also extract useful information from Web pages. For instance, FlipDog (http://www.flipdog.com), developed by the Whizbang! Labs
(http://www.inxight.com/whizbang)c,r awls the Web to identify job openings on employer Web sites. Lencom Software (http://www.lencom.com) also developed several products that can extract e-mail addresses and image information from the Web.
Although information extraction analyzes individual Web pages, text classification and text clustering analyze a set of Web pages. Again, Web pages consist mostly of HTML documents and are often noisier and less structured than traditional documents such as news articles and academic abstracts. In some applications the HTML tags are simply stripped from the Web documents and traditional algorithms are then applied to perform text classification and clustering. However, some useful characteristics of Web page design would be ignored. For example, Web page hyperlinks would be lost, but “Home,” “Click here,’’ and “Contact us,” would be included as a document’s features. This creates a unique problem for performing text classification and clustering of Web documents
because the format of HTML documents and the structure of the Web provide additional information for analysis. For example, text from neighboring documents has been used in an attempt to improve classification performance However, experimental results show that this method does not improve performance because, often, too many neighbor terms and too many cross-linkages occur between different classes.
Likewise, text clustering algorithms have been applied to Web applications.
The Suffix-Tree Clustering algorithm described earlier to the search results of the HuskySearch system. The self-organizing map (SOM) technique also has been applied to Web applications Chen and colleagues (Chen, Fan, Chau, & Zeng, 2001; Chen, Chau, & Zeng, 2002)
used a combination of noun phrasing and SOM to cluster the search results of search agents that collect Web pages by meta-searching popular search engines or performing a breadth-first search on particular Web sites. He, Zha, Ding, and Simon (2002) use a combination of content, hyperlink structure, and co-citation analysis in Web document clustering. Two Web pages are considered similar if they have similar content, they point to a similar set of pages, or many other pages point to both of them..
Intelligent Web Spiders......
Web spiders, also known as crawlers, wanderers, or Webbots, have been defined as “software programs that traverse the World Wide Web information space by following hypertext links and retrieving Web documents by standard HTTP protocol” .Since the early days of the Web, spiders have been widely used to build the underlying databases of search engines perform personal searches archive particular Web sites or even the whole Web or collect Web statistics (e.g., Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, et al., 2000). Chau and Chen (2003) provide a review of Web spider research.Although most spiders use simple algorithms such as breadth-first search , some use more advanced algorithms.These spiders are very useful for Web resource discovery. For example, the Itsy Bitsy Spider searches the Web using a best-first search and a genetic algorithm approach Each URL is modeled as an individual in the initial population. Crossover is defined as extracting the URLs that are pointed to by multiple starting URLs. Mutation is modeled by retrieving random URLs from Yahoo!. Because the genetic algorithm approach is an optimization process, it is well-suited to finding the best Web pages according to particular criteria.
Webnaut is another spider that uses a genetic algorithm (Zacharis & Panayiotopoulos, 2001).
Other advanced search algorithms have been used in personal spiders. Yang, Yen, and Chen (2000) applied hybrid simulated annealing in a personal spider application. Focused Crawler located Web pages relevant to a predefined set of topics based on example pages provided by the user
(Chakrabarti, van den Berg, & Dom, 1999). It determined the relevance of each page using a na'ive Bayesian model and the analysis of the link structures among the Web pages collected using the HITS
algorithm (discussed in more detail in the section on Web structure mining). These values are used to judge which URL links to follow. Another similar system, Context Focused Crawler, also uses a nai've Bayesian classifier to guide the search process (Diligenti, Coetzee, Lawrence, Giles, & Gori, 2000).
Chau and Chen (in press) apply the Hopfield Net spreading activation to collect Web pages in particular domains. Each Web page is represented as a node in the network and hyperlinks are represented simply as links between the nodes. Each node is assigned an activation score, which is a weighted sum of a content and link scores. The content score is calculated by comparing the content of the page with a domainspecific lexicon, and the link score is based on the number of outgoing
links in a page. Each node also inherits the scores from its parent nodes. Nodes are then activated in parallel and activation values from different sources are combined for each individual node until the activation scores of nodes on the network reach a stable state (convergence). Relevance feedback also has been applied in spiders (Balabanovic & Shoham, 1995; Vrettos & Stafylopoatis, 2001). These spiders determine the next URL to visit based on the user’s ratings of the relevance of the Web pages returned.
Google Webmaster give you information about your owned page crawled by Googlebot spider .
You can see in Crawl section we have robots.txt Tester,Which on the right have been shown in Green means Google Web Crawler called Googlebot has crawled the given page .
If You click on robots.txt Tester you will see the following page.
You can see the sitemap which is used to locate or define your site by Google Robot called Googlebot.You can see that my site is allowed by the Google Spider called Googlebot.
Multilingual Web Mining :
The number of non-English documents on the Web continues to grow-more than 30 percent of Web pages are in a language other than English. In order to extract non-English knowledge from the Web, Web mining systems have to deal with issues in language-specific text processing. One might think that this would not be a problem because the base algorithms behind most machine learning systems are languageindependent. Most algorithms, such as text classification and clustering, need only a set of features (a vector of keywords) for the learning process. However, the algorithms usually depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent Web documents. Many existing extraction programs, especially those employing a linguistic approach (e.g., Church, 1988), are language-dependent and work only with English texts. In order to perform analysis on non-English documents, Web mining systems must use the corresponding phrase extraction program for each language. Other learning algorithms, such as information extraction and entity extraction, also have to be tailored for different languages.
Some segmentation and extraction programs are language-independent. These programs usually employ a statistical or a machine learning approach. For example, the mutual-information-based PAT-Tree algorithm is a language-independent technique for key phrase extraction and has been tested on Chinese documents (Chien, 1997; Ong & Chen, 1999). Similarly, Church and Yamamoto (2001) use suffix arrays to perform phrase extraction. Because these programs do not rely on specific linguistic rules, they can be easily modified to work with different languages.
Web Visualization
Because it is often difficult to extract useful content from the Web, visualization tools have been used to help users maintain a “big picture” of a set of retrieval results from search engines, particular Web
sites, a subset of the Web, or even the whole Web. Various techniques have been developed in the past decade. For example, many systems visualize the Web as a tree structure based on the outgoing links of a set of starting nodes (e.g., Huang, Eades, & Cohen, 1998). The bestknown example of this approach is the hyperbolic tree developed by Xerox PARC (Lamping & Rao, 19961, which employs the “focus+context” technique to show Web sites as a tree structure using a hyperbolic view. Users can focus on the document they are looking at and maintain an overview of the context at the same time. A map is another metaphor widely used for Web visualization. The ET-Map provides a visualization of the manually cataloged Entertainment hierarchy of Yahoo! as a twodimensional
map (Chen, Schuffles, & Orwig, 1996). Some 110,000 Webpages are clustered into labeled regions based on the self-organizing map approach, in which larger regions represent more important topics,
and regions close to each other represent topics that are similar (Lin, Chen, & Nunmaker, 2000). The WEBSOM system also utilizes the SOM algorithm to cluster over a million Usenet newsgroup documents (Kohonen, 1995; Lagus, Honkela, Kaski, & Kohonen, 1999). Other examples of Web visualization include WebQuery, which uses a bullseye’s view to visualize Web search results based on link structure (CarriBre & Kazman, 1997), WebPath, which visualizes a user’s trail as he or she browses the Web (FrBcon & Smith, 1998), and threedimensional models such as Natto View (Shiozawa & Matsushita, 1997) and Narcissus (Hendley, Drew, Wood, & Beale, 1995). Dodge and
Kitchin (2001) provide a comprehensive review of cybermaps generated since the inception of the Internet.
In these visualization systems, machine learning techniques are often used to determine how Web pages should be placed in the 2-D or 3-D space. One example is the SOM algorithm described in the section on pre-Web IR (Chen et al., 1996). Web pages are represented as vectors of keywords and used to train the network that contains a two-dimensional grid of output nodes. The distance between the input and each output node is then computed and the node with the least distance is selected After the network is trained through repeated presentation of all inputs, the documents are submitted to the trained network and each region is labeled by a phrase, the key concept that best represents the cluster of documents in that region. Multidimensional scaling (MDS) is another method that can position documents on a map. It tries to map high dimensionality (e.g., document vectors) to low dimensionality (usually 2D) by solving a minimization problem (Cox & Cox, 1994). It has been
tested with document mapping and the results are encouraging (McQuaid, Ong, Chen, & Nunamaker, 1999).
The Semantic Web
A recent significant extension of the Web is the Semantic Web (Berners-Lee, Hendler, & Lassila, 2001), which seeks to add metadata to describe data and information, based on such standards as RDF (Resource Description Framework) and XML. The idea is that Web documents will no longer be unstructured text; they will be labeled with meaning that can be understood by computers. Machine learning can play three important roles in the Semantic Web. First, machine learning
can be used to automatically create the markup or metadata for existing unstructured textual documents on the Web. It is very difficult and timeconsuming for Web page authors to generate Web pages manually, according to the Semantic Web representation. To address this problem, information extraction techniques, such as entity extraction, can be applied to automate or semi-automate tasks such as identifying entities in Web pages and generating the corresponding XML tags. Second,
machine learning techniques can be used to create, merge, update, and maintain ontologies. Ontology, the explicit representation of knowledge combined with domain theories, is one of the key elements in the Semantic Web (Berners-Lee et al., 2001; Fensel & Musen, 2001). Maedche and Staab (2001) propose a framework for knowledge acquisition using machine learning. In that framework, machine learning techniques, such as association rule mining or clustering, are used to extract knowledge from Web documents in order to create new ontologies or improve existing ones. Third, machine learning can understand and perform reasoning on the metadata provided by the Semantic Web in order to extract knowledge from the Web more effectively. The documents in the Semantic Web are much more precise, more structured, and less “noisy” than the general, syntactic Web. The Semantic Web also provides context and background information for analyzing Web pages. It is believed that the Semantic Web can greatly improve the performance of Web mining systems (Berendt, Hotho, & Stumme, 2002).
Web Structure Mining(Practical implementation on any web page can be seen in the Google Webmaster tool which can fetch your page HTML code and can locate structure within the page like Structured Data Testing Tool,Structured Data Markup Helper).
You can See that on your Google Webmaster Tool as following
Structured Data
Promote Your Content with Structured Data Markup
"Structured data markup" is a standard way to annotate your content so machines can understand it. When your web pages include structured data markup, Google (and other search engines) can use that data to index your content better, present it more prominently in search results, and surface it in new experiences like voice answers, maps, and Google Now.
Structured data markup makes your content eligible for two kinds of Google features:
- Enhanced Presentation in Search Results: By including basic structured data appropriate to your content, your site can enhance its search results with Rich Snippets, Breadcrumbs, or a Sitelinks Search Box.
- Answers from the Knowledge Graph: If you're the authority for certain content, Google can treat the structured data on your site as factual and import it into the Knowledge Graph, where it can power prominent answers in Search and across Google properties. Features are available for authoritative data about organizations, events, movie reviews, and music/video play actions.
Come to the topic of "Web Structure Mining" in much greater details
In recent years, Web link structure has been widely used to infer important information about Web pages. Web structure mining has been largely influenced by research in social network analysis and citation analysis (bibliometrics). Citations (linkages) among Web pages are usually indicators of high relevance or good quality. We use the term in-links to indicate the hyperlinks pointing to a page and the term out-links to indicate the hyperlinks found in a page. Usually, the larger the number
of in-links, the more useful a page is considered to be. The rationale is that a page referenced by many people is likely to be more important than a page that is seldom referenced. As in citation analysis, an oftencited article is presumed to be better than one that is never cited. In addition, it is reasonable to give a link from an authoritative source (such as Yahoo!) a higher weight than a link from an unimportant personal home page.
By analyzing the pages containing a URL, we can also obtain the anchor text that describes it. Anchor text shows how other Web page authors annotate a page and can be useful in predicting the content of the target page. Several algorithms have been developed to address this issue.
Among various Web-structure mining algorithms, PageRank and HITS (Hyperlinked Induced Topic Search) are the two most widely used.The PageRank algorithm is computed by weighting each in-link to a page proportionally to the quality of the page containing the in-link (Brin & Page, 1998). The qualities of these referring pages also are determined by PageRank. Thus, the PageRank of a page p is calculated recursively as follows:
A Web page has a high PageRank score if it is linked from many other pages, and the scores will be even higher if these referring pages are also good pages (pages that have high PageRank scores). It is also interesting to note that the PageRank algorithm follows a random walk model the PageRank of a page is proportional to the probability that a random surfer clicking on random links will arrive at that page.
Kleinberg (1998) proposed the HITS algorithm, which is similar to PageRank. In the HITS algorithm, authority pages are defined as highquality pages related to a particular topic or search query. Hub pages are those that are not necessarily authorities themselves but provide pointers to other authority pages. A page to which many others point should be a good authority, and a page that points to many others should be a good hub. Based on this intuition, two scores are calculated in the HITS
algorithm for each Web page: an authority score and a hub score, which are calculated as follows:
algorithm for each Web page: an authority score and a hub score, which are calculated as follows:
In other words, a page with a high authority score is one pointed to by many good hubs, and a page with a high hub score is one that point to many good authorities.
Following the success of the PageRank and HITS algorithms,other similar algorithms also have been proposed. Examples include the Stochastic Approach to Link-Structure Analysis (SALSA) algorithm (Lempel & Moran, 2001) and the Probabilistic HITS (PHITS) algorithm (Cohn & Chang, 2000). Web structure mining techniques are often used to enhance the performance of Web applications. For instance, PageRank has been shown to be very effective for ranking search results in the commercial search engine Google (http://www.google.com) (Brin & Page, 1998). It also has been used as a measure to guide search engine spiders, where URLs with higher PageRank are visited first (Cho et al., 1998). The HITS algorithm also has been used in various Web applications. One example is the Clever search engine (Chakrabarti, Dom, Kumar, Raghavan, Rajogopalan, Tomkins, et al., 1999), which achieves a higher user evaluation than the manually compiled directory of Yahoo!. Bharat and Henzinger (1998) have added several extensions to
the basic HITS algorithm, such as modifying how much a node influences its neighbors based on a relevance score. One of the major drawbacks shared by most Web structure analysis algorithms is their high computational requirement, because the scores often have to be calculated iteratively (Haveliwala, 1999; Kleinberg, 1998).Another application of Web structure mining is to understand the structure of the Web as a whole. Broder et al. (2000) analyzed the graph structure of a collection of 200 million Web pages and 1.5 billion links. Their results suggest that the core of the Web is a strongly connected component and that the Web’s graph structure is shaped like a bow tie.
The strongly connected component (SCC) comprises around 28 percent of the Web. Another group that consists of 21 percent of Web pages is called IN, in which every Web page contains a direct path to the SCC. Another 21 percent of Web pages are in the group OUT. For every page
in OUT, a direct path from SCC links to it. Twenty-two percent of Web pages are in the group TENDRILS, which consists of pages hanging off IN and OUT but without a direct path to SCC. The remaining Web pages, accounting for around 8 percent of the Web, are isolated components
that are not connected to the other four groups.
the basic HITS algorithm, such as modifying how much a node influences its neighbors based on a relevance score. One of the major drawbacks shared by most Web structure analysis algorithms is their high computational requirement, because the scores often have to be calculated iteratively (Haveliwala, 1999; Kleinberg, 1998).Another application of Web structure mining is to understand the structure of the Web as a whole. Broder et al. (2000) analyzed the graph structure of a collection of 200 million Web pages and 1.5 billion links. Their results suggest that the core of the Web is a strongly connected component and that the Web’s graph structure is shaped like a bow tie.
The strongly connected component (SCC) comprises around 28 percent of the Web. Another group that consists of 21 percent of Web pages is called IN, in which every Web page contains a direct path to the SCC. Another 21 percent of Web pages are in the group OUT. For every page
in OUT, a direct path from SCC links to it. Twenty-two percent of Web pages are in the group TENDRILS, which consists of pages hanging off IN and OUT but without a direct path to SCC. The remaining Web pages, accounting for around 8 percent of the Web, are isolated components
that are not connected to the other four groups.
About schema.org
Google and other major search engines support the
schema.org vocabulary for structured data. This
vocabulary defines a standard set of type names and property names, for example,
http://schema.org/MusicEvent indicates
a concert performance, with startDate and
and location properties to specify the
concert's key details.
Data in the schema.org vocabulary may be embedded in an HTML page using any of
three alternative formats: microdata, RDFa, and JSON-LD.
To do this, add schema.org Organization markup to your official website that identifies the location of your preferred logo.
-
Microdata and
RDFa define new HTML attributes that let you indicate what
schema.org field names correspond with what user-visible text on the page.
-
JSON-LD is the newest and simplest markup format:
it lets you embed
a block of JSON data inside a
script
tag anywhere in the HTML. Since the data does not have to be interleaved with the user-visible text, it's much easier to express nested data items (say, the Country of a PostalAddress of a MusicVenue of an Event). Also, Google can read JSON-LD data even when it is dynamically injected into the page's contents, such as by Javascript code or embedded "widgets".
Google is in the process of adding JSON-LD support to more markup-powered features. So far, JSON-LD is supported for all Knowledge Graph features, sitelink search boxes, and Event Rich Snippets; Google recommends the use of JSON-LD for those features. For the remaining Rich Snippets types and breadcrumbs, Google recommends the use of microdata or RDFa.
Customizing Your Knowledge Graph
The Knowledge Graph is Google's system for organizing information about millions of well-known "entities": people, places, and organizations in the real world. Google's algorithms merge information about entities from many data sources. For some types of information, though, the best source of data is the entity itself.Specifically, companies and people can now customize their own data in the Knowledge Graph by adding structured data markup to their official website. The following types of data may be customized:
1. Logos 2.Company contact Number 3. social Profile Linking
The Customizes knowledge graph in the output of Google Search result loos like as following in terms of 1.Logo 2.Company contact Number 3.Social Profile Linking
Specifying Your Organization's Logo
You can specify which image Google should use as your organization's logo in search results and the Knowledge Graph.To do this, add schema.org Organization markup to your official website that identifies the location of your preferred logo.
Corporate Contacts
Use corporate contact markup on your official website to add your company's contact information to the Google Knowledge panel in some searches. Knowledge panels can prominently display your customer service phone number.Get started quickly with these instructions for formatting and publishing the proper markup code if your company has service phone numbers that are national or global in scope.
Company phone numbers
Use structured data markup embedded in your public website to specify your preferred phone numbers. You can specify the following types of phone numbers:- customer service
- technical support
- billing support
- bill payment
- sales
- reservations
- credit card support
- emergency
- baggage tracking
- roadside assistance
- package trackingYou can specify more categories of contact numbers than these, but they aren’t currently included in Google search results.
For each number, you can also specify:
- is the number toll-free?
- is the number for the hearing-impaired?
- does the number serve a specific country or countries?
Adding structured markup to your site
The Schema.org vocabulary and JSON-LD markup format are an open standard for embedding structured data in web pages. In order for Google to recognize structured data as company contact numbers, make sure you fulfill these requirements:- Publish markup on a page on your company’s official website
- Pages with markup must not be blocked from crawling by robots.txt directives
- Include an Organization record in your markup that includes both:
- Your organization's official URL
- One or more
ContactPoint
records
The Organization record is specified first. The only required properties on the Organization are
url
, which must be the home page of the company’s official site, and contactPoint
. The value of contactPoint
must be a list of nested ContactPoint records. Google considers these properties on each ContactPoint:Property | Value specification | Example values |
---|---|---|
@type | Required to be "ContactPoint". | "ContactPoint" |
telephone | Required. An internationalized version of the phone number, starting with the “+” symbol and country code (+1 in the US and Canada). | "+1-800-555-1212" "+44-2078225951" |
contactType | Required to be one of the values listed at right. These values are not case sensitive. (Additional contact types may be supported later.) | "customer support" "technical support" "billing support" "bill payment" "sales" "reservations" "credit card support" "emergency" "baggage tracking" "roadside assistance" "package tracking" |
areaServed | Optional. The geographical region served by the number, specified as a Schema.org/AdministrativeArea. Countries may be specified concisely using just their standard ISO-3166 two-letter code, as in the examples at right. If omitted, the number is assumed to be global. | "US" "GB" ["US","CA","MX"] |
contactOption | Optional details about the phone number. Currently only the two values shown at right are supported. | "TollFree" "HearingImpairedSupported" |
availableLanguage | Optional details about the language spoken. Languages may be specified by their common English name. If omitted, the language defaults to English. | "English" "Spanish" ["French", "English"] |
Specify your social profiles to Google
Include your social profile in search results
Use markup on your official website to add your social profile information to the Google Knowledge panel in some searches. Knowledge panels can prominently display your social profile information.Get started quickly with these instructions for formatting and publishing the proper markup code.
Social Profiles
Use structured data markup embedded in your public website to specify your preferred social profiles. You can specify these types of social profiles:- Google+
- YouTube
- Myspace
Google algorithms process the social profiles you specify and then display the most relevant ones in response to users' queries. (For sites that have a verification process, Google will only show verified profiles.) The social profiles in your markup must correspond to the ones that users can see on the same page.
Adding structured markup to your site
The schema.org vocabulary and JSON-LD markup format are an open standard for embedding structured data in web pages. In order for Google to recognize structured data as social profiles, make sure you fulfill these requirements:- Publish markup on a page on your official website
- Pages with markup must not be blocked to the Googlebot by robots.txt
- Include a Person or Organization record in your markup with:
- "url" = the url of your official website
Testing and Publishing Your Markup
The block of structured data you produce, enclosed within the <script type="application/ld+json"> ... tags, can be inserted into any HTML page on your company's official website that is crawled and indexed by Google. Within the page, it may be placed in either the or region. Either way, it won't affect how your document appears in users' web browsers.To verify that your markup is well-formed and can be processed by Google, paste the HTML source of your marked-up page (or just the <script> block) into Google's Structured Data Testing Tool.
When Google next crawls the page, its indexing algorithms will process the profiles from your markup and make them eligible to be used in search results. You can ask Google to crawl the page.
Web Usage Mining
Web servers, proxies, and client applications can quite easily capture data about Web usage. Web server logs contain information about every visit to the pages hosted on a server. Some of the useful information includes what files have been requested from the server, when they were requested, the Internet Protocol (IP) address of the request, the error code, the number of bytes sent to the user, and the type of browser used. Web servers can also capture referrer logs, which show the page from
which a visitor makes the next request. Client-side applications, such as Web browsers or personal agents, can also be designed to monitor and record a user’s actions. By performing analysis on Web usage data (sometimes referred to as clickstream analysis), Web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests. This knowledge has various applications, such as personalization and collaboration in Web-based systems, marketing,Web site design, Web site evaluation, and decision support (Chen &
Cooper, 2001; Marchionini, 2002).
Pattern Discovery and Analysis
One of the major goals of Web usage mining is to reveal interesting trends and patterns. Such patterns and statistics can often provide important knowledge about a company’s customers or the users of a system. Srivastava, Cooley, Despande, and Tan (2000) provided a framework for Web usage mining, consisting of three major steps: preprocessing, pattern discovery, and pattern analysis. As in other data mining applications, preprocessing involves data cleansing. However, one of the major challenges faced by Web usage mining applications is that Web server log data are anonymous, making it difficult to identify users and user sessions from the data. Techniques like Web cookies and
user registration have been used in some applications, but each method has its shortcomings (Pitkow, 1997). In pattern discovery and analysis, generic machine learning and data mining techniques, such as association rule mining, classification, and clustering, can often be applied. For instance, Yan, Jacobsen, Garcia-Molina, and Dayal (1996) performed clustering on Web log data to identify users who have accessed similar Web pages.Web usage mining has been used for various purposes. For example,Buchner and Muhenna (1998) proposed a knowledge discovery process for mining marketing intelligence from Web data. Data such as Web traffic patterns also can be extracted from Web usage logs in order to improve the performance of a Web site (Cohen, Krishnamurthy, & Rexford, 1998). Many commercial products have been developed to support analysis and mining of Web site usage and Web log data. Examples of these applications include WebTrends developed by NetIQ (http://www.netiq.com/ webtrends), WebAnalyst by Megaputer (http://www.megaputer.com/prod uctdwa), NetTracker by Sane Solutions (http://www.sane.com/products/ NetTracker), and NetGenesis by Customercentric (http://www.customer centricsolutions.com/content/solutions/ent_web_analytics.cfm).Although most Web usage analysis applications focus on single Web sites, the advertising company Doubleclick (http://www.doubleclick.com), selling and administrating two billion online advertisements per day, collects gigabytesof clickstream data across different Web sites.Search engine transaction logs also provide valuable knowledge about user behavior in Web searching. Various analyses have been performed on the transaction logs of the Excite search engine (http:l/www. excite.com) (Jansen, Spink, & Saracevic, 2000; Spink & Xu, 2000; Spink, Wolfram, Jansen, & Saracevic, 2001). Silverstein, Henzinger, Marais, and Moricz (1999) also conducted a study of 153 million unique search queries collected from the AltaVista search engine (http:llwww.altavista. com). Some of the interesting findings from these analyses include the set of most popular words used by the public in Web search queries, the average length of a search query, the use of Boolean operators in queries,
and the average number of result pages viewed by users. Such information is particularly useful to researchers trying to reach a better understanding of users’ Web searching and information-seeking behaviors and hoping to improve the design of Web search systems.
- http://artificialintelligenceadvancedmathema.blogspot.in/2015/03/web-miningmachine-learning-for-web.html
- http://electronicsandcommunicationadvancedma.blogspot.in/2015/03/3g-networksorthogonal-frequency.html
- https://www.youtube.com/channel/UCd2Gfi81vXUZiQlz6zQjVKQ
Intelligent Web Spiders coming soon
ReplyDelete