XHTML Well-formedness Validation with Prolog


Here I am presenting a prolog program that will check the well-formedness of an XHTML document. The term XHTML well-formedness describes an XHTML document where all the texts follows all the syntactic rules labeled as well-formedness rules in the XHTML specification.

 

Detailed Working Process

My prolog program will read input from a file xhtml_nodes.txt which will contain a plan list (not a prolog list) of prolog terms. Then it will read those terms one by one and put them in a prolog list. Now I have an XHTML document in a prolog list – in terms of translation between html elements and prolog terms.

Now what it needs to run the validator is to compile the file main.pl and run it with the following commands

?- [main].

?- main.

 

The file main.pl actually contains a predicate ensure_loaded/1 which will load 4 other files as readFile.pl, ncount.pl, dcg_rules.pl, dcg_lexicon.pl.

 

:- ensure_loaded([readFile, ncount, dcg_rules, dcg_lexicon]).

 

They will work as their name implies

readFile.pl will read the xhtml_nodes.txt and return a list of nodes.

Ncount.pl will just count the number of elements of the list.

dcg_lexicon.pl contains the lexicon list that are needed in the DCG rules. As for information lexicons are not generated for all the elements of XHTML.

All the elements are grouped according to their behavior. The covered elements in this project are listed in Appendix 1.

dcg_rules.pl is the heart of the program where all the validation rules for XHTML is stored according to the XHTML specification. These rules are DCG rules in Prolog to validate the list of nodes. These list of node is actually representing the XHTML element ordering as in the input file. As for information, not all the rules of the XHTML specification is implemented.

 

The rules that I have covered are as bellow

  1. A XHTML document should be constructed with html, head and body element node.
  2. The root element of the document must be html.
  3. An html element must have a body and a head
  4. A document must have a title in the head.
  5. XHTML elements must be closed.
  6. XHTML elements must be properly nested.
  7. Empty elements must be terminated.
  8. Body element must contain a block element or a series of block element or can be empty.
  9. A block element can contain a series of block elements or a series of inline element or can be empty.
  10. Tables (special type of block element) should only use table, tr and td elements, optionally caption.
  11. An inline element can contain a series of inline element or can be empty but cannot contain a block element.
  12. Some special inline element cannot contain other inline elements (input, textarea, br)
  13. Select element (special type of inline element) shall contain N* options.
  14. Anchor (a) element can have certain elements like img, strong, span, i, em, b, caption and label. Among then some are self-contained inline and some are container inline element.

 

The implementation of these rules has been commented in the prolog program code. All the codes are under Appendix 2.

 

There is also a C#.NET program that will read a raw XHTML document (i.e. .html, .htm) and then will help to prepare input for the prolog program. The input for the prolog program is a file that contains a plain list of prolog terms.

For example the following XHTML chunk will be transformed as bellow.

 

<html lang=”en-US” xmlns=”http://www.w3.org/1999/xhtml”&gt;

<head>

<title>HTML Tutorial</title>

<link rel=”stylesheet” type=”text/css” href=”./sample1_files/stdtheme.css”>

</head>

<body>

</body>

</html>

 

The translated prolog terms

html_start.

head_start.

title_start.

title_stop.

link_start.

link_stop.

head_stop.

body_start.

body_stop.

html_stop.

 

There are normally two types of terms-

  • Opening term (xxx_start.) will define a start of an element.
  • Closing term (xxx_stop.) will define a close of an element.

 

The UI of the program is as fig. 1. Browse an .html or .htm file with the file button. Then click read button to produce the output-normally shown in the textbox and list below the read button. But the main output which is a text file called xhtml_nodes.txt will be saved in your prolog directory (if the “Prolog” directory is under “My Document” i.e. C:\Users\rizvis\Documents\Prolog\xhtml_nodes.txt).

Figure_1

 

 

How to run

To start with a test pick a XHTML file or a normal HTML file will also work. Start the C#.NET application and generate a xhtml_nodes.txt file. The input and output file path will be show in the application. If the xhtml_nodes.txt file is not in your prolog directory move it there. Now compile the mail.pl file and run main predicate (?- main.). Make sure all the other supporting files (readFile.pl, ncount.pl, dcg_rules.pl, dcg_lexicon.pl) are there along with main.pl. You will get a text saying “Valid Document” and a true return if the document is valid otherwise false return.

Figure_2

 

Conclusion

There are so many rules as of XHTML specification that it is quite hard to implement all of them in this project which has only one member. But I have tried my best to cover up all the major rules that dictate XHTML structure and well-formedness.

 

References

Tutorial by Paul Brna – http://homepages.inf.ed.ac.uk/pbrna/prologbook/index.html

Tutorial Learn Prolog Now – http://www.learnprolognow.org/lpnpage.php?pageid=top

Sources of validation rules are W3C organization – http://www.w3.org/TR/xhtml2/mod-document.html

XHTML 1 and 2 specification – http://www.w3.org/TR/xhtml1/ and http://www.w3.org/TR/xhtml2/

List of Block Elements – http://www.cs.sfu.ca/CourseCentral/165/sbrown1/wdgxhtml10/block.html

List of Inline element – http://www.cs.sfu.ca/CourseCentral/165/sbrown1/wdgxhtml10/inline.html

 

 

 

 

Appendix 1

All the elements that has been covered in the project

Block level Elements

  • div – Generic block-level container
  • h1 – Level-one heading
  • h2 – Level-two heading
  • h3 – Level-three heading
  • h4 – Level-four heading
  • h5 – Level-five heading
  • h6 – Level-six heading
  • hr – Horizontal rule
  • p – Paragraph
  • pre – Preformatted text

 

Special block level element

 

Inline elements

  • b – Bold text
  • code – Computer code
  • em – Emphasis
  • i – Italic text
  • span – Generic inline container
  • strong – Strong emphasis
  • Caption – place holder a for a caption text

 

Self-inline

  • img – Inline image
  • textarea – Multi-line text input
  • input – Form input
  • br – Line break
  • Option – An option in the select element
  • label – Form field label
  • a – Anchor

Special Inline –

select – Option selector

 

 

 

 

Appendix 2

Prolog program code

File main.pl

% Main file to start the program.

% Coded by Rizvi Hasan

% Date 20141013

 

%Loadthepredicatesofotherfiles

:- ensure_loaded([readFile, ncount, dcg_rules, dcg_lexicon]).

 

% Run the XHTML test in a line of code

main:- readFile(‘xhtml_nodes.txt’,Y),nl,nl,write(Y),nl,nl,ncount(Y),doc(Y,[]),nl,nl,write(‘Valid Document’),nl,nl,!.

 

% Frequently used commands

% [main]. main.

% [dcg_rules, main]. main.

 

 

File readFile.pl

% For reading a file of a list of terms

% Coded by Rizvi Hasan

% Date 20141013

% readFile(+,-).  example: readFile(‘xhtml_nodes.txt’,Y),nl,write(Y),nl,nl

 

 

% open a file in reading mood.

readFile(F,Out):- open(F, read, Strm),

reading(Strm,Out).

 

% Read from the stream and store in a list.

reading(Strm,Out):- reading1(Strm,[],Out1),reverse(Out1,Out).

 

reading1(_,[end_of_file|Acc],Acc).

reading1(Strm,Acc,Out):-

read(Strm,X1),

reading1(Strm,[X1|Acc],Out),!.

 

% Reverse the list.

reverse(X,X1) :- reverse(X,[],X1).

reverse([],A,A).

reverse([H|T],A,Y):- reverse(T,[H|A],Y).

 

%   \+ X1 == end_of_file,

 

File ncount.pl

% Extra information about the XHTML document node count

% Coded by Rizvi Hasan

% Date 20141013

 

% nodecount(+,-) is for counting the total number of nodes.

 

ncount(Y):-nodecount(Y,0,N),write([‘Node Count’, N]).

nodecount([],Acc,Acc).

nodecount([_|T],Acc,N):-  Acc1 is Acc + 1, nodecount(T,Acc1,N).

 

File dcg_rules.pl

% Grammer rules for XHTML validation.

% Coded by Rizvi Hasan

% Date 20141013

 

% example execution: doc([html,body,body_close,html_close],[])

 

doc –> html_start, htmlCont, html_stop.

 

htmlCont –> head, body.

head –>head_start, headElm, title, headElm, head_stop.

title –> title_start, title_stop.

 

headElm –> [].

headElm –> meta_start, meta_stop, headElm.

headElm –> link_start, link_stop, headElm.

 

body –> body_start, seris0, body_stop.

 

% Block element and inline element validation.

% Seris of mixed block and inline element

seris –> [].

seris –> blockElm, seris.

seris –> inlineElm, seris.

 

% seris of block elements

seris0 –> [].

seris0 –> blockElm, seris0.

 

% Seris of inline elements

seris1 –> [].

seris1 –> inlineElm, seris1.

 

% Seris of a elements can contain specifically img, strong,

% label  etc.

serisA –> [].

serisA –> inlineElm(A1),{  A1 == strong;

A1 == label;

A1 == em;

A1 == i;

A1 == b;

A1 == span;

A1 == caption;

A1 == img}, serisA.

 

 

% A block element must have a start element andendelement

% and can contin

% N* block or inline elements.<table>isaspecialblock

% level element.

blockElm –> blockElm_Start(other,X), seris, blockElm_End(other,X).

blockElm –> blockElm_Start(table), caption, tableHeader, serisRow, blockElm_End(table).

 

% An inline element must have a start element and end element and

% can contin N* inline.

 

inlineElm –> inlineElm_Start(other,Y), seris1, inlineElm_End(other,Y).

% Some inline element should not contain any other inline elements.

inlineElm –> inlineElm_Start(self,Z), inlineElm_End(self,Z).

% <select> is a special type of inline elemrnt which should contain

% only options.

inlineElm –> inlineElm_Start(select), serisOption, inlineElm_End(select).

% <a> is a special type of inline elemrnt which should contain some

% specific inlines.

inlineElm –> inlineElm_Start(a), serisA, inlineElm_End(a).

inlineElm(Y) –> inlineElm_Start(_,Y), serisA, inlineElm_End(_,Y).

% Block element and inline element validation End

 

% <Table> validation

caption –> caption_start, caption_stop.

caption –> [].

 

tableHeader –> [].

tableHeader –> tr_start, serisHeader ,tr_stop.

serisHeader –> [].

serisHeader –> th , serisHeader.

th –> th_start, seris ,th_stop.

 

serisRow –>[].

serisRow –> row, serisRow.

row –> tr_start, serisCol, tr_stop.

serisCol –> [].

serisCol –> col, serisCol.

col –> td_start, seris ,td_stop.

% Table validation End

 

% <select> validation

serisOption –> [].

serisOption –> option_start, option_stop, serisOption.

% select validation End

 

% grouping of all the elements according to thair prpperties.

blockElm_Start(table) –> table_start.

blockElm_Start(other,X) –> pre_start,{X=pre};

p_start,{X=p};

div_start,{X=div};

hr_start,{X=hr};

h1_start,{X=h1};

h2_start,{X=h2};

h3_start,{X=h3};

h4_start,{X=h4};

h5_start,{X=h5};

h6_start,{X=h6}.

 

blockElm_End(table) –> table_stop.

blockElm_End(other,X) –>   pre_stop,{X=pre};

p_stop,{X=p};

div_stop,{X=div};

hr_stop,{X=hr};

h1_stop,{X=h1};

h2_stop,{X=h2};

h3_stop,{X=h3};

h4_stop,{X=h4};

h5_stop,{X=h5};

h6_stop,{X=h6}.

 

inlineElm_Start(other,Y) –>    label_start,    {Y=label};

code_start,     {Y=code};

caption_start,  {Y=caption};

span_start,     {Y=span};

strong_start,   {Y=strong};

em_start,       {Y=em};

i_start,        {Y=i};

b_start,        {Y=b}.

inlineElm_Start(self,Z) –> img_start       ,{Z=img};

br_start        ,{Z=br};

input_start     ,{Z=input};

textarea_start  ,{Z=textarea}.

inlineElm_Start(select) –> select_start.

inlineElm_Start(a) –>  a_start.

 

inlineElm_End(other,Y) –>      label_stop,     {Y=label};

code_stop,      {Y=code};

caption_stop,   {Y=caption};

span_stop,      {Y=span};

strong_stop,    {Y=strong};

em_stop,        {Y=em};

i_stop,         {Y=i};

b_stop,         {Y=b}.

inlineElm_End(self,Z) –>   img_stop        ,{Z=img};

br_stop         ,{Z=br};

input_stop      ,{Z=input};

textarea_stop   ,{Z=textarea}.

inlineElm_End(select) –>   select_stop.

inlineElm_End(a) –>    a_stop.

 

 

 

 

 

 

 

File dcg_lexicon.pl

% Lexicons

% Coded by Rizvi Hasan

% Date 20141013

 

 

% Head elements

html_start –>  [html_start].

html_stop –>   [html_stop].

head_start –>  [head_start].

head_stop –>   [head_stop].

meta_start –>  [meta_start].

meta_stop –>   [meta_stop].

title_start –> [title_start].

title_stop –>  [title_stop].

link_start –>  [link_start].

link_stop –>   [link_stop].

 

% Body elements

body_start –>  [body_start].

body_stop –>   [body_stop].

 

% Block elements

h1_start –>    [h1_start].

h1_stop  –>    [h1_stop].

h2_start –>    [h2_start].

h2_stop  –>    [h2_stop].

h3_start –>    [h3_start].

h3_stop  –>    [h3_stop].

h4_start –>    [h4_start].

h4_stop  –>    [h4_stop].

h5_start –>    [h5_start].

h5_stop  –>    [h5_stop].

h6_start –>    [h6_start].

h6_stop  –>    [h7_stop].

div_start –>   [div_start].

div_stop –>    [div_stop].

p_start –>     [p_start].

p_stop –>      [p_stop].

hr_start –>    [hr_start].

hr_stop –>     [hr_stop].

table_start –> [table_start].

table_stop –>  [table_stop].

th_start –>    [th_start].

th_stop –>     [th_stop].

tr_start –>    [tr_start].

tr_stop –>     [tr_stop].

td_start –>    [td_start].

td_stop –>     [td_stop].

pre_start –> [pre_start].

pre_stop –> [pre_stop].

 

% Inline elements

caption_start–>[caption_start].

caption_stop–> [caption_stop].

strong_start –> [strong_start].

strong_stop –> [strong_stop].

em_start –> [em_start].

em_stop –> [em_stop].

i_start –> [i_start].

i_stop –> [i_stop].

b_start –> [b_start].

b_stop –> [b_stop].

span_start –>  [span_start].

span_stop –>   [span_stop].

code_start –>  [code_start].

code_stop –>   [code_stop].

select_start –> [select_start].

select_stop –> [select_stop].

 

% SelfInline elements

img_start –>   [img_start].

img_stop –>    [img_stop].

br_start –>    [br_start].

br_stop –>     [br_stop].

input_start –> [input_start].

input_stop –> [input_stop].

textarea_start –> [textarea_start].

textarea_stop –> [textarea_stop].

option_start –> [option_start].

option_stop –> [option_stop].

a_start –>     [a_start].

a_stop –>      [a_stop].

label_start –> [label_start].

label_stop –> [label_stop].

 

 

 

 

 

 

 

 

 

 

C#.NET program code

// Coded by Rizvi Hasan

// Date 2014-09-28

 

using HtmlAgilityPack;

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.Windows;

using System.Windows.Controls;

using System.Windows.Data;

using System.Windows.Documents;

using System.Windows.Input;

using System.Windows.Media;

using System.Windows.Media.Imaging;

using System.Windows.Navigation;

using System.Windows.Shapes;

using System.Xml;

using System.IO;

 

namespace XHTMLReader

{

/// <summary>

/// Interaction logic for MainWindow.xaml

/// </summary>

public partial class MainWindow : Window

{

private string _strExcelFilename = “”;

 

public MainWindow()

{

InitializeComponent();

_strExcelFilename =  @”%userprofile%\documents”;

// Commant at production

//_strExcelFilename = @”C:\Users\rizvis\Documents\Mina Mapp\Dropbox\ID2213\ProjectProlog\Input Files\sample1.htm”;

lblInputFile.Text = _strExcelFilename;

}

 

private void btnFile_Click(object sender, RoutedEventArgs e)

{

//strExcelFilename = System.IO.Path.GetDirectoryName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName)

//_strExcelFilename = “%userprofile%\\documents”;

 

 

// Do not Import namespace System.Windows.Forms. Iit will confuse with other identifiers in WPF.

var  f = new System.Windows.Forms.OpenFileDialog();

f.Filter = “Excel files (*.html, *.htm,*.xml,) |*.html;*.htm;*.xml”;

f.InitialDirectory = _strExcelFilename;

if (f.ShowDialog() == System.Windows.Forms.DialogResult.OK)

{

if (f.FileName != null && f.CheckFileExists == true) {

_strExcelFilename = f.FileName;

lblInputFile.Text = _strExcelFilename;

txtFileName.Text = f.SafeFileName;

}

}

 

}

 

private void btnRead_Click(object sender, RoutedEventArgs e)

{

xmlTextBlock.Text = string.Empty;

// Good source for XHTML defination: http://www.w3schools.com/html/html_xhtml.asp

 

HtmlDocument doc = new HtmlDocument();

doc.Load(_strExcelFilename);

 

//var myNodes = doc.DocumentNode.SelectNodes(“//a[starts-with(@id,’menu-item-‘)]”);

List<HtmlNode> myNodes = doc.DocumentNode.Elements(“html”).ToList();

ExploreNodes(myNodes);

 

 

//xmlTextBlock.Text = xmlTextBlock.Text + ” STOP!!!” ;

 

write_to_File(xmlTextBlock.Text);

 

 

}

 

 

private void OutputLog(HtmlNode node, string indicator)

{

if (node.Name == “#text”) return;

xmlTextBlock.Text = xmlTextBlock.Text + node.Name + “_” + indicator + ” # “;

xmlList.Items.Add(node.Name + “_” + indicator);

 

}

 

private void ExploreNodes(List<HtmlNode> nodes)

{

foreach (var item in nodes)

{

OutputLog(item,“start”);

if (item.ChildNodes.Count > 1)

{

ExploreNodes(item.ChildNodes.ToList());

OutputLog(item, “stop”);

}

else

{

OutputLog(item, “stop”);

}

 

}

}

 

private void write_to_File(string p)

{

string mydocpath = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);

StringBuilder sb = new StringBuilder();

 

// Split the array

var _nodes = p.Split(new string[] { ” # ” }, StringSplitOptions.RemoveEmptyEntries).ToList();

 

// Build the string for the file

foreach (string node in _nodes)

{

 

sb.AppendLine(node.ToString().Trim() + “.”);

//sb.AppendLine(node.ToString().Trim() + ” –> [” + node.ToString().Trim() + “].”);

 

}

 

// Write the string builder string to a file.

using (StreamWriter outfile = new StreamWriter(mydocpath + @”\Prolog\xhtml_nodes.txt”))

{

outfile.Write(sb.ToString());

}

lblOutFile.Text = mydocpath + @”\Prolog\xhtml_nodes.txt”;

lblNodeCount.Text = “Nodes : ” + _nodes.Count.ToString();

 

}

 

 

 

}

}

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s