Maintainer(s):
Noppadon (Koo^) noppadon@cs.utexas.edu
Table of Contents:
==================
Web Tech Tutorials
==================
Contributor(s):
Noppadon (Koo^) noppadon@cs.utexas.edu
There are so many Web technologies out there. Too much too learn, too little time. These two Web sites provide concise, practical, and efficient explanations and demonstrations of several important Web technologies, e.g. HTML, XML, CSS, JavaScript, WAP, DOM, ASP, etc.
http://www.w3schools.com/
http://www.w3scripts.com/
The most authoritative resource for Web technologies contain quite a few tutorial links:
=========
HTML Tidy
=========
Contributor(s):
Noppadon (Koo^) noppadon@cs.utexas.edu
Dr. Ray Mooney
Automatically tidy up (possibly) invalid, messy HTML documents and generate valid HTML and XML files. The result helps make parsing and processing easier.
http://www.w3.org/People/Raggett/tidy/
Getting and Compiling HTML Tidy in Linux
----------------------------------------
1. Go to the URL above and scroll down to the source code section.
2. Download a file ended with tgz (unix tar-gziped version), e.g. tidy4aug00.tgz,
gzipped tar file for source code (Unix line ends).
3. gunzip and tar -xf
gunzip tidy4aug00.tgz
tar -xf tidy4aug00.tar
4. Go to the directory with the Tidy source files and run:
make all
5. The file called "tidy" is the executable file produced. Run it.
Running Tidy
------------
tidy [[options] filename]*
Examples:
tidy -f index.err index.html > indextidy.html
tidy -asxml -f indexxml.err index.html > indextidy.xml
The first example produces a legal HTML file. The second, an XML file.
Useful options:
-f [errorfile] to write errors and warning into a file
-asxml to convert
html to well-formed xml
-help to
get more options
======
XERCES
======
Contributor(s):
Dr. Ray Mooney
XML parsers in Java, C++ (with Perl and COM bindings)
====
WGET
====
Contributor(s):
Noppadon (Koo^)
noppadon@cs.utexas.edu
Puay
puay@cs.utexas.edu
GNU Wget is a freely available network utility to retrieve files from the World Wide Web using HTTP and FTP, the two most widely used Internet protocols. It works non-interactively, thus enabling work in the background, after having logged off.
http://www.gnu.org/manual/wget/index.html
============
Web Crawlers
============
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/robots.html
=====================================
Writing Web Crawler in Java Tutorials
=====================================
Contributor(s):
Noppadon (Koo^)
noppadon@cs.utexas.edu
Writing a Web Crawler in the Java Programming Language
http://developer.java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/
Automating Web exploration
http://www.javaworld.com/javaworld/jw-11-1996/jw-11-webcrawler.html
=========
WebSPHINX
=========
Contributor(s):
Noppadon (Koo^)
noppadon@cs.utexas.edu
A Personal, Customizable Web Crawler
WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers. A Web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
http://www.cs.cmu.edu/~rcm/websphinx/
There is also an extension to WebSphinx for personalized search:
http://www.cis.upenn.edu/~lrossey/websphinx.html
==========
JavaCC 2.0
==========
Contributor: Ted Wild
JavaCC - The Java Parser Generator
http://www.metamata.com/JavaCC/
Java Compiler Compiler (JavaCC) is the most popular parser generator for use with Java applications. A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building (via a tool called JJTree included with JavaCC), actions, debugging, etc.
Our latest release of JavaCC is Version 2.0. This version was released jointly by Metamata and Sun Microsystems on October 26, 2000.
=============
WebKB Project
=============
http://www.cs.cmu.edu/~webkb/
http://www.cs.cmu.edu/~tom/
Work at CMU by Prof. Tom Mitchell and his research group. Data sets and papers are available on-line. The text learning techniques primarily used in this project is based on Bayesian learning.
Goal:
To develop a probabilistic, symbolic knowledge base that mirrors the
content of the world wide web. If successful, this will make text information
on the web available in computer-understandable form, enabling much more
sophisticated information retrieval and problem solving.
============
WEKA Project
============
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Contributor: Dr. Ray Mooney
Weka is a collection of machine learning algorithms for solving real-world data mining problems. Implemented in Java. Open-source. There is a companion book on data mining & machine learning.
Implemented schemes for classification include:
decision tree inducers
rule learners
naive Bayes
decision tables
locally weighted regression
support vector machines
instance-based learners
logistic regression
voted perceptrons
Implemented schemes for numeric prediction include:
linear regression
model tree generators
locally weighted regression
instance-based learners
decision tables
Implemented "meta-schemes" include:
bagging
stacking
boosting
regression via classification
classification via regression
cost sensitive classification