#Tutorial - Content extraction using Apache Tika From the official website: > The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. In this tutorial we will try and implement the four most important features of Apache Tika (as of version 1.14). ##Table of contents 1. Is this tutorial for me? 1. Requirements 1. How can I detect a file's type? 1. …

Member Avatar
+0 forum 0

i am trying to use xerces in ubuntu 13.10, it is instaled, i can see the files in usr folder but i have no luck including it in eclipse cdt, i've found this thread "[Click Here](http://www.daniweb.com/hardware-and-software/linux-and-unix/threads/409769/ubuntu-11.10-xerces-c)" but it is dead and the answer is not clear for me, could anyone help me?

Member Avatar
+0 forum 0

What's a better way to do this? I need to sift through this file and store the following values Cpl File name Content kind Package type Encryption status Container File size Duration Timed text/png Number of audio channels 2d/3d Fps 1) I'm not sure if I should just store these as lists? 2) Should I use dictionaries ? Here's one sample file. [Click Here](http://tny.cz/aeca9b6c) And here's how I was attempting to sort through the file, with not as much luck as I would like. import glob import os import sys print(os.getcwd()) dir=raw_input(["Please enter directory location of dcp_inspect output"]) print (dir) …

Member Avatar
+0 forum 0

Hi, I tried parsing a multi-record genbank file (from this site: http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk) using the code below. The code returned an error: readline() on unopened filehandle at parser.pl line 62. The code: #!/usr/local/bin/perl -w use strict; my $record; print "Please type in the name of a file\n"; my $file = <STDIN>; chomp $file; while( $record = get_next_record($file) ) { my ($annotation, $seq) = get_dna ($record); open my $fh, '>', 'oufile.txt', or die "cant't open outfile:$!"; print "Sequence:\n\n", $seq, "\n"; close $fh or die "cant't open outfile:$!"; } sub get_dna { my ($file) = @_; my @annotation = (); my $seq = …

Member Avatar
+0 forum 0

Can someone help me with solving boolean expressions with the help of forward chaining. A good tutorial will also help me. Example: A.(A + B) = A A.(A + B) => A.A + A.B [Applying distributive law] A.A + A.B => A + A.B [Applying idempotency law] A + A.B => A.(1 + B) A.(1 + B) => A.(1) => A I have made huge efforts but still am unable to do this. The procedure would require parsing the boolean expression and then recursive rule checking. I was thinking about creating a binary tree of the expression and then doing …

Member Avatar
+0 forum 0

(edit) this is solved, was a unicode issue. Hi I'm hoping someone has used the library [pugixml](http://pugixml.org/) I'm just trying to use a simple example provided but I'm not getting the result I expect. int _tmain(int argc, _TCHAR* argv[]) { pugi::xml_document doc; pugi::xml_parse_result result = doc.load_file("tree.xml"); //pugi::char_t * c = "Fail\0"; // used with as_string() method std::cout << "Load result: " << result.description() // output = "Load result: No Error " << ", mesh name: " << doc.child("mesh").attribute("name").value() // output = ", mesh name: " << std::endl; return 0; } I was expecting "Load result: No Error, mesh name: mesh_root" …

Member Avatar
+0 forum 0

I am going to create graphics interface to intermediate code generated by the gcc. so the output from gcc is like ;; Function main (main, funcdef_no=1, decl_uid=2162, cgraph_uid=1) main () { int i; int c[10]; int b; int a; int D.2177; <bb 2>: a = 1; b = 20; if (a < b) goto <bb 3>; else goto <bb 7>; <bb 3>: i = 0; goto <bb 5>; <bb 4>: c[i] = 1; i = i + 1; ... <L9>: return D.2177; } and i want to parse it using **python** and to have graphical interface as graphviz's **.dot language …

Member Avatar
+0 forum 0

Dear forummembers, Well I am a total noob in visual basic programming, but I want to make a program that opens an ascii file that i choose and reads the file line per line. The ascii file excist as follow 0.0000 Start of measurement time id rx dlc b0 b1 b2 b3 b4 b5 b6 b7 0.3480 C010327x Rx d 8 CF FF FF FF FF FF FF FF 0.3480 18FEF027x Rx d 8 FF FF FF FF FF F0 CC FF 0.3490 18EF4E27x Rx d 8 00 00 00 FF FF FF FF FF 0.3500 18FEF117x Rx d 8 …

Member Avatar
+0 forum 0

I am using a json function to parse a link and return a json object. the code is below : public JSONObject getJSONFromUrl(String url) { // Making HTTP request try { // defaultHttpClient DefaultHttpClient httpClient = new DefaultHttpClient(); HttpPost httpPost = new HttpPost(url); Log.v("url", " " + url); HttpResponse httpResponse = httpClient.execute(httpPost); HttpEntity httpEntity = httpResponse.getEntity(); is = httpEntity.getContent(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (ClientProtocolException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } try { BufferedReader reader = new BufferedReader(new InputStreamReader( is, "iso-8859-1"), 8); StringBuilder sb = new StringBuilder(); String line = null; while ((line …

Member Avatar
+0 forum 0

I am developing a program that gets the html source code of a certain webpages in a website. I already developed one program that does so, here's the sample code Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(TextBox2.Text) Dim response As System.Net.HttpWebResponse = request.GetResponse() Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream()) Dim sourcecode As String = sr.ReadToEnd() TextBox1.Text = sourcecode Recently, I found out that I could do the same using Sockets. This time I want to parse HTML of those web pages SIMULTANEOUSLY. I tried parsing simultaneously on my previous program using multithreading but my bandwidth keeps decreasing as threads increase …

Member Avatar
+0 forum 0

Here is an amazing article. Peter Norvig does not let small detail, that the language was never implemented in it's time, disturb his debugging the code written without implementation, but single handedly implements it with Python parser. Frustration warning, this is frustratingly amazing stuff: [Prescient but Not Perfect: A Look Back at a 1966 Scientific American Article on Systems Analysis By Peter Norvig | August 23, 2011](http://blogs.scientificamerican.com/at-scientific-american/2011/08/23/systems-analysis-look-back-1966-scientific-american-article/)

Member Avatar
+0 forum 0

I wrote an XML parser that works great to fit my needs, but I can't retrieve the root attribute nodes no matter what I try! So far I have the following code that works great to retrieve the child tag values: import xml.dom.minidom def parse(filename): xmlDoc = xml.dom.minidom.parse(filename) tag = xmlDoc.getElementsByTagName('date_time')[0].childNodes[0].nodeValue return tag Here is the XML data I'm parsing: <?xml version="1.0"?> <order type="buy" subtype="limit" orderID="7659"> <date_time>03:23:2012:11:50.35</date_time> <userID>Miss.Wanda.Sleeplate</userID> <stock_symbol>BROK</stock_symbol> <shares_ammount>50</shares_ammount> <limit>23.87</limit> </order> What I'm trying to get is the value of the attributes of the root **order** tag (**type, subtype, orderID**) Something like: node = tag.attributes['type'].value print node **>>> buy** …

Member Avatar
+0 forum 0

Hey all, I have a problem to read a XML file using XmlReader. I have this code: [CODE] Dim xCount As Integer = 0 Using xReader As XmlReader = XmlReader.Create("tekno.xml") TreeView1.Nodes.Clear() Do While xReader.Read() With TreeView1 .Nodes.Add(xReader.Item("title")) .Nodes(xCount).Nodes.Add("(link: " & xReader.Item("link") & ")") .Nodes(xCount).Nodes.Add("Date: " & xReader.Item("pubDate")) .Nodes(xCount).Nodes.Add(xReader.Item("description")) xCount = xCount + 1 End With Loop TreeView1.ExpandAll() End Using [/CODE] And the file .XML named ‘tekno.xml’: [CODE] <?xml version="1.0" encoding="iso-8859-1"?> <rss version="2.0"> <channel> <item> <title>Tablet Windows Tak Akan Dukung Flash</title> <link>http://tekno.kompas.com/read/2011/09/20/11432939/Tablet.Windows.Tak.Akan.Dukung.Flash</link> <pubDate>Tue , 20 Sep 2011 11:43:29 +0000</pubDate><description>&lt;img src=&quot;http://stat.k.kidsklik.com/data/photo/2011/09/16/1430299t.jpg&quot; align=&quot;left&quot; hspace=&quot;7&quot; width=&quot;120&quot; height=&quot;90&quot;&gt;Pengguna perangkat tablet yang menggunakan Windows 8 bakal …

Member Avatar
+0 forum 0

hi guys, I'm kind of new with xml. I have an xml file in the following format, with 592 records: [CODE]<?xml version="1.0" encoding="ISO-8859-1"?> <data> <item> <buildingName>Chateau</buildingName> <address>81 North Broadway, White Plains, NY</address> <mail>I</mail> <laundry>I</laundry> <garbage>I</garbage> <parking>I</parking> <rating>No Access</rating> <municipality>White Plains</municipality> </item> <item> <buildingName>44 Franklin Avenue</buildingName> <address>44 Franklin Avenue, New Rochelle, NY</address> <mail>A</mail> <laundry>A</laundry> <garbage>A</garbage> <parking>A</parking> <rating>Good Access</rating> <municipality>New Rochelle</municipality> </item> </data>[/CODE] I'm using the following code in order to read all the addresses: [CODE] var xml = GXml.parse(data); //this function is from google map API var addresses =xml.documentElement.getElementsByTagName("address"); for (var i = 0; i < addresses.length; i++) { showAddress(addresses[i].childNodes[0].nodeValue); } …

Member Avatar
+0 forum 0

Hello, I'm trying to convert a double (3.16) and take its integer part (3). However I'm encountering problems doing this. Can someone pleeease help me out ? Here is my code: [CODE]main: li $v0, 7 #asking the user for a double syscall mov.d $f2, $f0 cvt.s.d $f4, $f2 #converting from double to single precision. I have a #also tried to convert to word li $v0, 3 #print the value mov.d $f12, $f4 #move da converted value (ie.from 3.1 to 3) to f12 syscall[/CODE] All I want is to convert 3.16 to 3 so I can then only have the floating …

Member Avatar
+0 forum 0

The End.