This document contains information about several useful tools and resources for Web Mining. This includes tools for HTML/XML Parsing, Web Crawling, Visualization, Text Extraction, Classification, Clustering, etc.
 

Maintainer(s):
   Noppadon (Koo^)  noppadon@cs.utexas.edu

Table of Contents:

==================
Web Tech Tutorials
==================
Contributor(s):
   Noppadon (Koo^)    noppadon@cs.utexas.edu
 

There are so many Web technologies out there. Too much too learn, too little time. These two Web sites provide concise, practical, and efficient explanations and demonstrations of several important Web technologies, e.g. HTML, XML, CSS, JavaScript, WAP, DOM, ASP, etc.

http://www.w3schools.com/
http://www.w3scripts.com/

The most authoritative resource for Web technologies contain quite a few tutorial links:

http://www.w3c.org/
 

=========
HTML Tidy
=========
Contributor(s):
   Noppadon (Koo^)    noppadon@cs.utexas.edu
   Dr. Ray Mooney
 

Automatically tidy up (possibly) invalid, messy HTML documents and generate valid HTML and XML files. The result helps make parsing and processing easier.

http://www.w3.org/People/Raggett/tidy/
 

Getting and Compiling HTML Tidy in Linux
----------------------------------------
1. Go to the URL above and scroll down to the source code section.
2. Download a file ended with tgz (unix tar-gziped version), e.g. tidy4aug00.tgz,
gzipped tar file for source code (Unix line ends).
3. gunzip and tar -xf
      gunzip tidy4aug00.tgz
      tar -xf tidy4aug00.tar
4. Go to the directory with the Tidy source files and run:
      make all
5. The file called "tidy" is the executable file produced. Run it.

Running Tidy
------------
tidy [[options] filename]*

Examples:

   tidy -f index.err index.html > indextidy.html
   tidy -asxml -f indexxml.err index.html > indextidy.xml

   The first example produces a legal HTML file. The second, an XML file.

Useful options:

-f [errorfile]  to write errors and warning into a file
-asxml          to convert html to well-formed xml
-help           to get more options
 
 

======
XERCES
======
Contributor(s):
   Dr. Ray Mooney

XML parsers in Java, C++ (with Perl and COM bindings)

http://xml.apache.org/
 

====
WGET
====
Contributor(s):
   Noppadon (Koo^)         noppadon@cs.utexas.edu
   Puay                    puay@cs.utexas.edu

GNU Wget is a freely available network utility to retrieve files from the World Wide Web using HTTP and FTP, the two most widely used Internet protocols. It works non-interactively, thus enabling work in the background, after having logged off.

http://www.gnu.org/manual/wget/index.html

============
Web Crawlers
============

The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/robots.html
 
 

=====================================
Writing Web Crawler in Java Tutorials
=====================================
Contributor(s):
   Noppadon (Koo^)         noppadon@cs.utexas.edu
 

Writing a Web Crawler in the Java Programming Language
http://developer.java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

Automating Web exploration
http://www.javaworld.com/javaworld/jw-11-1996/jw-11-webcrawler.html
 

=========
WebSPHINX
=========
Contributor(s):
   Noppadon (Koo^)         noppadon@cs.utexas.edu
 

A Personal, Customizable Web Crawler

WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers. A Web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

http://www.cs.cmu.edu/~rcm/websphinx/

There is also an extension to WebSphinx for personalized search:
http://www.cis.upenn.edu/~lrossey/websphinx.html
 

==========
JavaCC 2.0
==========
Contributor: Ted Wild

JavaCC - The Java Parser Generator
http://www.metamata.com/JavaCC/

Java Compiler Compiler (JavaCC) is the most popular parser generator for use with Java applications. A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building (via a tool called JJTree included with JavaCC), actions, debugging, etc.

Our latest release of JavaCC is Version 2.0. This version was released jointly by Metamata and Sun Microsystems on October 26, 2000.

=============
WebKB Project
=============

http://www.cs.cmu.edu/~webkb/
http://www.cs.cmu.edu/~tom/

Work at CMU by Prof. Tom Mitchell and his research group. Data sets and papers are available on-line. The text learning techniques primarily used in this project is based on Bayesian learning.

Goal:
To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving.

============
WEKA Project
============
http://www.cs.waikato.ac.nz/~ml/weka/index.html

Contributor: Dr. Ray Mooney
 

Weka is a collection of machine learning algorithms for solving real-world data mining problems. Implemented in Java. Open-source. There is a companion book on data mining & machine learning.

Implemented schemes for classification include:

                               decision tree inducers
                               rule learners
                               naive Bayes
                               decision tables
                               locally weighted regression
                               support vector machines
                               instance-based learners
                               logistic regression
                               voted perceptrons

Implemented schemes for numeric prediction include:

                               linear regression
                               model tree generators
                               locally weighted regression
                               instance-based learners
                               decision tables

Implemented "meta-schemes" include:

                               bagging
                               stacking
                               boosting
                               regression via classification
                               classification via regression
                               cost sensitive classification