I am supposed to complete an java assignment which satisfies the following constraints
1)read an .html file which is given as input(This is not an weburl) just an file stored in hard-disk
2)Count the occurences of all the tags which are present in the html file but it must exclude the closing tags
3)It must prompt an user to enter an tag name such that all the attributes of the tag entered by user must be displayed
4)and the old attribute value has to be modified based on the new value and it has to be written to an new html file

P.S:I couldn't understand how to reject the closing tags from counting i will be happy if someone give the directions to complete this assignment

Recommended Answers

All 11 Replies

Opening tags have stuff between less than and greater than symbols. Closing tags ditto, but the first thing after the opening less than is a forward slash. That should be enough for you to tell which is which and process or ignore them accordingly.

James so shoud i use startswith() and endswith() methods in an if statement or an Regex utility??

How can i read the attributes??

For counting occurences can i use HashMap or is there any other simple way??kindly comment your answer with the ways of solving it

You should start by doing some research of your own, and trying things to see what works best. I know from your recent posts that you are very capable of doing that. If you get stuck you can always get help from DaniWeb, but you should not rely on this site to guide every step of your learning process - that's your teacher's job.

:) ok James! yes i did some searching before posting it so far i have found some ideas about this assignment and they are as follows
1.To read an html file i can use an BufferedReader
2.For counting the occurences of an tag i can use HashMap
but i am couldn't find any good solution for tracing an tag and check whether its an closing tag and extracting the attributes of tags ,while searching i found many answers not to use regex with htmlparsing some are suggesting jsoup but one of my constraint is i should not use any external API thats why i have posted my question here

One place to start is to split the lines using greater than OR less than as the delimiter, and look at what that gives you...
or you could use indexOf to find the greater thans ands less thans and use substring to pick pieces out of the string...

ok James i have got something interesting names HTMLEditorKit learning to use it will post my snippet soon :)

Here is the code which i did need more codings to complete it

import java.io.*;
import java.util.*;

class HtmlAnalyser{
public static void main(String[] dinesh){
long start,end;
String line;
String[] tags;
start=System.currentTimeMillis();
Properties props=new Properties();
try{
FileInputStream fis=new FileInputStream("HtmlAnalyser.properties");
props.load(fis);
}
catch(IOException e){
System.out.println(e);
}
File file=new File(props.getProperty("htmlPath"));
try{
   BufferedReader br=new BufferedReader(new FileReader(file));
   while((line=br.readLine())!=null){
       tags=line.split(">");
       for(String read:tags){
           if(!(read.startsWith("</"))){
            System.out.println(read);
           }

   }
   }
}
catch(IOException e){
    System.out.println(e);
}

}
}

HtmlAnalyser.properties:

htmlPath=/home/dinesh/test.html
tagName=img
attributeName=class
attributeValue=newAttributeValue
newHtmlFile=/home/dinesh/new.html

test.html:

<html>
<head>
<title> Sample 
</title>
</head>
<body bg=”pink“>
</body>
</html>

output:

java HtmlAnalyser
<html
<head
<title
 Sample 
<body bg=”pink“

That's a good start!

I think i should not use split.("<") its not allowing me to calculate the endindex such that i couldn't find the substring between < and > any other suggestions??

The output you posted above shows you have 90% solved already. Just look at those split sections - if they begin with <\ they are an end tag, if they begin with < they are a start tag, the rest are values.
(Your parsing may not be 100% if the source HTML has weird line breaks, but maybe you can worry about that later.)

So i have made some good improvements here is the code

import java.io.*;
import java.util.*;

class HtmlAnalyser {

    public static void main(String[] dinesh) {
        long start, end;
        String line;
        String[] tags;
        Map<String, Integer> TagCounter = new HashMap<String, Integer>();
        start = System.currentTimeMillis();
        Properties props = new Properties();
        try {
            FileInputStream fis = new FileInputStream("HtmlAnalyser.properties");
            props.load(fis);
        } catch (IOException e) {
            System.out.println(e);
        }
        File file = new File(props.getProperty("htmlPath"));
        try {
            BufferedReader br = new BufferedReader(new FileReader(file));
            String tagName = props.getProperty("tagName");
            while ((line = br.readLine()) != null) {
                if ((line.contains(tagName)) && !(line.startsWith("</"))) {
                    int startIndex = line.indexOf(tagName);
                    int endIndex = line.indexOf(">");
                    startIndex += tagName.length();
                    if (startIndex + 1 <= endIndex) {
                        System.out.println(line.substring(startIndex + 1, endIndex));
                    }
                }
                tags = line.split("[\\n ]+");
                for (String read : tags) {
                    if (read.startsWith("<") && !(read.startsWith("</"))) {
                        int startIndex = read.indexOf("<");
                        int endIndex = read.indexOf(">");
                        if (endIndex == -1) //The tag has attributes
                        {

                            Integer freq = TagCounter.get(read.substring(startIndex + 1));
                            TagCounter.put(read.substring(startIndex + 1), (freq == null) ? 1 : freq + 1);
                            System.out.println();

                        } else //It does'nt
                        {
                            Integer freq = TagCounter.get(read.substring(startIndex + 1, endIndex));
                            TagCounter.put(read.substring(startIndex + 1, endIndex), (freq == null) ? 1 : freq + 1);
                        }
                    }
                }
            }
            System.out.println(TagCounter);
            br.close();
        } catch (IOException e) {
            System.out.println(e);
        }

    }
}

After parsing this Html file

<html>
<head>
<title> Sample 
</title>
</head>
<body bg=”pink“>
<div id="4" >
</div>
<div id="2" bg="black">
</div>
</body>
</html>

I have got this output

java HtmlAnalyser

id="2"

{body=1, title=1, div=2, html=1, head=1}

But it has to be

id="4"
id="2" bg="black"
{body=1, title=1, div=2, html=1, head=1}

i couldn't understand why this is going wrong here any suggestions/corrections??

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.