Hello everyone,
what i am trying todo is get the following data from a html file....

Friday 31 - 1 - 2014

Created:    31/01/2014 2:32 PM
Updated:    31/01/2014 7:03 PM
Location:   37°35'13 S  145°7'41 E

Shift 143
Car 7008
Bus 280
564 / 9
Car 7008
Car 7011
Bus 361
572 / 20

the html file looks like this...

<html>
<head>
  <title>Evernote Export</title>
  <basefont face="Tahoma" size="2" />
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name="exporter-version" content="Evernote Windows/276152; Windows/6.3.9600;"/>
  <style>
    body, td {
      font-family: Tahoma;
      font-size: 10pt;
    }
  </style>
</head>
<body>
<a name="655"/>
<h1>Friday 31 - 1 - 2014</h1>
<div>
<table bgcolor="#D4DDE5" border="0">
<tr><td><b>Created:</b></td><td><i>31/01/2014 2:32 PM</i></td></tr>
<tr><td><b>Updated:</b></td><td><i>31/01/2014 7:03 PM</i></td></tr>
<tr><td><b>Location:</b></td><td><a href="http://maps.google.com/maps?z=6&q=-37.587000,145.128000"><i>37°35'13 S  145°7'41 E</i></a></td></tr>
</table>
</div>
<br/>

<div><span style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>Shift 143<br/>Car 7008<br/>Bus 280<br/>564 / 9<br/>Car 7008<br/>Car 7011<br/>Bus 361<br/>572 / 20<br/></div></span>
</div></body></html>

my code looks like this at the moment...

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click





        If My.Computer.FileSystem.FileExists("H:\michael-documents\friday31.html") Then
            Dim str As String = My.Computer.FileSystem.ReadAllText("H:\michael-documents\friday31.html")
            ' search string for Required Data
            Dim Findstr As String
            'The number you get below is the START
            Findstr = (str.IndexOf("<h1>"))
            Dim Findstr2 As String
            'The number you get below is the START
            Findstr2 = (str.IndexOf("</h1>"))
            Dim Findstr3 As String
            'This get the Characters needed.
            Findstr3 = (str.Substring(Findstr + 4, ((Findstr2 - 4) - Findstr)))

            Dim Bstr As String
            'This finds the Date & Time file was Created
            Bstr = (str.IndexOf("<b>"))
            Dim Bstr2 As String
            Bstr2 = (str.IndexOf("</b>"))
            Dim Bstr3 As String
            Bstr3 = (str.Substring(Bstr + 3, ((Bstr2 - 4) - Bstr)))

            Dim istr As String
            istr = (str.IndexOf("<i>"))
            Dim istr2 As String
            istr2 = (str.IndexOf("</i>"))
            Dim istr3 As String
            istr3 = (str.Substring(istr + 3, ((istr2 - 3) - istr)))
            Dim Createdstr As String
            Createdstr = (Bstr3 + " " + istr3)

            Dim B1str As String
            B1str = (str.IndexOf("Updated"))
            Dim B1str3 As String
            B1str3 = (str.Substring(B1str, 8))
            Dim B1str4 As String
            B1str4 = (str.Substring(B1str + 24, 18))
            Dim Updatedstr As String
            Updatedstr = (B1str3 + " " + B1str4)


            Dim Locstr As String
            Locstr = (str.IndexOf("Location"))
            Dim Locstr2 As String
            Locstr2 = (str.Substring(Locstr, 8))
            Dim Locstr3 As String
            Locstr3 = (str.IndexOf("<i>37"))
            Dim Locstr4 As String
            Locstr4 = (str.Substring((Locstr3 + 3), 22))
            Dim Locationstr As String
            Locationstr = (Locstr2 + " " + Locstr4)


            TextBox1.Text = Locstr3
            TextBox2.Text = Locstr4

            ListBox1.Items.Add(Findstr3)
            ListBox1.Items.Add(Createdstr)
            ListBox1.Items.Add(Updatedstr)
            ListBox1.Items.Add(Locationstr)


        Else
            MsgBox("File not found.")
        End If





    End Sub

I think there has to be a better/simple way of doing this...

Thanks for reading.

Edited 1 Year Ago by Dani: Attempting to resubmit post

don't know what happened there.?????

when i checked it was a properly formatted message.!!

loooks like i will have to try again.

tried to put my code into a code block, but it would not allow me to paste....???

how do i now paste code into my messages, as the "</>code " clicking on this button will not allow pasting of code.?

When I click on </>code I am presented with a textbox where I can paste the code that I want to post. What happens when you do it?

I'm not sure what went wrong, but I attempted to edit the post, I copy/pasted it all into Notepad, and then copy/pasted it all back into the post editor. There must have been some type of funky character that broke the database that Notepad stripped away.

Edited 1 Year Ago by Dani

Oh goodness gracious! I resubmitted the post and it looked just fine, and now it's corrupted again. I'll investigate this bug within the next day :)

Comments
thank you, thought it was me..hehe

OK fixed. It got messed up by the little degree symbol in the Location. Our parser couldn't handle it.

Just for the record, what I saw in the original post was a few lines consisting of nothing but less-than and greater-than symbols. Nothing like what is there now.

while this has been happening, i have been doing some research.
and found out about Regular expressions.....
I think that might be just what i want.....learning about it at the moment.

If anybody has any ideas, always willing to learn.....

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
    ' Input.
    Dim html As String = My.Computer.FileSystem.ReadAllText("H:\michael-documents\friday31.html")

    ' Call Function.
    Dim tagless As String = StripTags(html)

    ' Write.
    TextBox3.Text = tagless

End Sub

Function StripTags(ByVal html As String) As String
    ' Remove HTML tags.
    Return Regex.Replace(html, "<.*?>", "")
End Function

i take it that this is fine for pasting code. as if i click on the "</>Code" i can not paste anything into it. when i right click there is no option for paste.

Every line that is indented by at least four spaces will be posted as code so you don't actually need the code tool to post code. When the cursor is in any part of a line, pressing the tab key will indent the entire line. Shift-tab will outdent it. Selecting multiple lines allows you to in/outdent multiple lines at once.

ok this is my code so far...

        Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
            ' Dim html As String = My.Computer.FileSystem.ReadAllText("H:\michael-documents\friday31.html")
            'use StreamReader to read file into textbox
            Dim SR As StreamReader = File.OpenText("H:\michael-documents\friday31.html")
            Do While SR.Peek() <> -1

                Dim LineofText As String = ""
                LineofText = LineofText & SR.ReadLine
                Dim Tagless As String = StripTags(LineofText)

                Dim regex As Regex = New Regex("F\w+4")
                Dim Match As Match = regex.Match(Tagless)
                If Match.Success Then
                    ListBox1.Items.Add(Match.Value)
                End If

            Loop

            SR.Close()
    end sub

the problem is nothing is happening... when i step throu the code the VAR.tagless contains the text without the html code, but regex.match is not recognise the text. so what am i doing wrong.

found out my problem, i needed to replace the "4" with a "y"
that found the Day, but, with using "\d+4"
have found all the text\numbers that i want.

now all i have to do is find out how to put a "cr" after the Shift number and the vechile numbers.
Please refer to my First post.

Why you are not using Beautiful Soup to parse data instead of writing your own parse .

just had a search and found it.

  1. Beautiful Soup is for python . i am using windows and vb.net

  2. gives me pratice coding.......

Thank you for your reply.

Beautiful Soup is for python . i am using windows and vb.net

You should find a parser for vb.net,like Html Agility Pack

i have been doing some research.
and found out about Regular expressions.....

Regex i bad tool for html,read this funny answer bye bobince .

gives me pratice coding.......

Yes and lot frustration if you gonne to this every time you want to parse stuff.
Beautiful Soup was mention so i can use some minutes and make a demo.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
h1_tag = soup.find('h1')
table_tag = soup.find_all('table')
span_tag = soup.find_all('span', style=True)
span_tag = span_tag[0].find_all('div')

print(h1_tag.text)
for item in table_tag:
    print(item.text)
for item in span_tag:
    r = str(item).replace('<br/>', '\n')
    print(r.replace('<div>', '').replace('</div>', '').strip())

Output:

Friday 31 - 1 - 2014

Created:31/01/2014 2:32 PM
Updated:31/01/2014 7:03 PM
Location:37°35'13 S  145°7'41 E

Shift 143
Car 7008
Bus 280
564 / 9
Car 7008
Car 7011
Bus 361
572 / 20

Edited 1 Year Ago by snippsat

COOL

You mean that i can import this program into vb.net?

Pray Tell, how do i do that...

Many thanks

Michael

Could also be done just using vb.net and regex. I create a lookup list first to filter the lines out that get used. Than clean it all up.

Imports System.IO
Imports System.Text.RegularExpressions
Public Class Form1
    Dim lookList As New List(Of String)
    Dim indexlist As New List(Of Integer)
    Dim found As Integer = 0
    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
        If My.Computer.FileSystem.FileExists(mpath) Then
            Dim readText() As String = File.ReadAllLines(mpath)
            For Each lin As String In readText
                TextBox1.Text = TextBox1.Text & vbCrLf & lin
            Next
        End If
        lookList.Add("Monday")
        lookList.Add("Tuesday")
        lookList.Add("Wednesday")
        lookList.Add("Thursday")
        lookList.Add("Friday")
        lookList.Add("Saturday")
        lookList.Add("Sunday")
        lookList.Add("Created:")
        lookList.Add("Updated:")
        lookList.Add("Location:")
        lookList.Add("Shift")
        lookList.Add("Car")
        lookList.Add("Bus")
        lookList.Add("Created:")
        'Apply filter
        Dim str As String = ""
        For Each words As String In lookList
            For Each lin As String In TextBox1.Text.Split(vbNewLine)
                If lin.Contains(words) Then
                    str = str & vbCrLf & lin
                End If
            Next
        Next
        str = str.Replace("�", "°")
        TextBox1.Text = str
    End Sub
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        Application.Exit()
    End Sub
    Function StripTags(ByVal html As String) As String
        ' Remove HTML tags.
        Return Regex.Replace(html, "<.*?>", " ")
    End Function
    Private Sub Button5_Click(sender As Object, e As EventArgs) Handles Button5.Click
        Dim str As String = Trim(TextBox1.Text)
        str = str.Replace("<br/>", vbCrLf)
        TextBox1.Text = str
        'Remove tags
        Dim html As String = TextBox1.Text
        Dim Tagless As String = StripTags(html)
        TextBox1.Text = Tagless
        'Move to Listbox
        str = Trim(TextBox1.Text)
        str = str.Replace(" ", "")
        TextBox1.Text = str
        '-------------
        Dim match As String = (vbLf & vbCr)
        Dim lineIndex As Integer = 0
        For Each lin As String In TextBox1.Text.Split(vbNewLine)
            If Len(lin) < 3 Then
                indexlist.Add(lineIndex)
            End If
            lineIndex += 1
        Next
        found = indexlist(1) + 1
        '-------------
        lineIndex = 0
        For Each lin As String In TextBox1.Text.Split(vbNewLine)
            lineIndex += 1
            If lineIndex <= found Then
                ListBox1.Items.Add(lin)
            End If
        Next
        ListBox1.Items.RemoveAt(0)
        TextBox1.Text = ""
    End Sub
End Class
Attachments html.gif 70.93 KB

Thanks to Minimallst this is going great...
have added to the looklist.add.....
was going great until i hit on a snag... the "&" this is causing a "index out of range error....
the file in question is

<html>
<head>
  <title>Evernote Export</title>
  <basefont face="Tahoma" size="2" />
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name="exporter-version" content="Evernote Windows/276152; Windows/6.3.9600;"/>
  <style>
    body, td {
  font-family: Tahoma;
  font-size: 10pt;
}
  </style>
</head>
<body>
<a name="1147"/>
<h1>Tuesday 2012-07-31</h1>
<div>
<table bgcolor="#D4DDE5" border="0">
<tr><td><b>Created:</b></td><td><i>23/08/2013 10:19 AM</i></td></tr>
<tr><td><b>Location:</b></td><td><a href="http://maps.google.com/maps?z=6&q=-37.652400,145.021000"><i>37°39'9 S  145°1'16 E</i></a></td></tr>
</table>
</div>
<br/>

<div><span style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>V/line <br/>Vechile 126 & 353<br/>Start 1430<br/>Finish 2000</div></span>
</div></body></html>

going to have to look into regex on how to remove "amp;"

Edited 1 Year Ago by westsiderailway: had to show html instead

Since you got a few indexes: looklist.index, listbox1.index and from the string.split you need to be more specific which index is going out of range and show the code where the error occurs. Otherwise we can't even guess what is happening.

Assuming you added "Vehicle" to the list you can replace the lines between these ------------ two lines with this code snippet.

 '-------------

        Dim lineIndex As Integer = 0
        For Each lin As String In TextBox1.Text.Split(vbNewLine)
            If Len(lin) < 3 Then
                indexlist.Add(lineIndex)
            End If
            lineIndex += 1
        Next
        If indexlist.Count > 1 Then
            found = indexlist(1) + 1
        Else
            found = 100
        End If
        '-------------

it works, but now i get 4 sets of texts.

the old way

 found = indexlist(1)

works for every file that i have tried expect the "tuesday2012-07-31"
????

Edited 1 Year Ago by westsiderailway

How Very True...heheeh

will have to figure out a way to getaround this, like checking for a crlf at the end of the file.

is one way to go.

    'Enter into an Array
    Dim ArrayCount As Integer = 0
    For Each words As String In lookList
        For Each StrLine As String In TextBox1.Text.Split(vbNewLine)
            If StrLine.Contains(words) Then
                TextArray(ArrayCount) =  & vbCrLf & StrLine
                ArrayCount = ArrayCount + 1
            End If
        Next
    Next


    'Apply filter
    Dim str As String = ""
    For Each words As String In lookList
        For Each lin As String In TextBox1.Text.Split(vbNewLine)
            If lin.Contains(words) Then
                'str = str & vbCrLf & lin
            End If
        Next
    Next

i am thinking that if i put the file into an array, i can check to see if the last line has a vbcrlf, if not add one.

am having a problem, keep getting the error "Property "chars" is readonly. on the TextArray line. ?

This article has been dead for over six months. Start a new discussion instead.