I'd like to be able to make those boolean. Suggestions?
So I feel rather foolish for this, but I haven't used command line often. I'm using sys.argv to gather arguments from the command line. I'm saving one in particular to a variable called test. It is suppose to be a boolean; however, passing in True or False still renders if(test): to be true. It's something simple i'm sure. Ideas?
I am attempting to encode using a module called Beautiful Soup. All I need some direction on solving the problem. The encoding maps to <undefined>, so the unicode is not defined within the charmap.
The error I get is: UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-34: character maps to <undefined>
The sequence being encoded is: u'\u0411\u044a\u043b\u0433\u0430\u0440\u0441\u043a\u0438 \u043f\u0440\u0435\u0432\u043e\u0434 \u043d\u0430 \u0440\u0430\u0437\u0433\u043b\u0435\u0436\u0434\u0430\u0447\u0430 \u041c\u043e\u0437\u0438\u043b\u043b\u0430.'
The test should be: Български превод на разглеждача Мозилла.
The code being used is pieced below. Without an understanding of BeautifulSoup, it wouldn't make much sense. However, the above encoding error is where I need help:
#parses the long name for a project from index page
def parse_project_longname(html):
p=re.compile('>Name: <strong>.+?</strong>')
results=p.findall(html)
if(results):
name=results[0]
name=name[15:len(name)-9]
name=BeautifulSoup(name,convertEntities=BeautifulSoup.HTML_ENTITIES)
name=name.contents[0]
else:
name=None
return name
def test():
utils=FLOSSmoleutils('dbInfoTest.txt')
select='SELECT project_name, indexhtml FROM sv_project_indexes WHERE datasource_id=2'
utils.cursor.execute(select,)
results=utils.cursor.fetchall()
for result in results:
name=result[0]
html=result[1]
print("Name: "+name)
id=SavannahParsers.parse_project_longname(html)
print(id)
test()
Any help or direction would be appreciated. Thank you.
Actually, it was not a print error. The cursor.execute was running, but failing out. However, it was difficult to notice because the fail was due to a duplication. I had to print a traceback.format_exc() to find the error. It's working fine now. Thanks for the help.
I'm running a portion of code in a loop. It accesses a global variable only to print it, and that variable is never changed after it is set in the __init__. However, the first print statement occasionally will fail, but the second one does not. Any ideas as to why this is happening? The method is below:
EDIT: it's not the print statement that is failing.
Sometimes the except statement runs for no apparent reason even after the self.cursor.execute completes
def db_insert(self,query_string,*params):
try:
self.cursor.execute(query_string, params)
print ("Inserted into: "+self.database+".")
except:
print("!!!!WARNING!!!! Insertion into "+self.database+" failed.\n")
I'm not sure of the terminology you've used, as I know nothing about mySQL. I was commissioned to write python code, but I'm helping a small bit with the mySQL planning just as a helpful hand. I know that the way they run the proccess is multiple machines accessing the same database over the internet. Due to this, any sort of update that labelled the row "in progress" of any kind is out of the question. Mere milliseconds is just enough time for more than one machine to try and access the same row, thus creating confusion and chaos. I don't know if you're ideas fall into that catagory, but it sounded like they might. I wish I could understand the speak a bit better. My apologies.
Problem solved. I needed to .compile with the re.DOTALL parameter.
I'm attempting to search a string for a certain sequence. I have not done anything with re, and it is... a bit confusing.
I'm looking for this string:
<div class="indexcenter"> (there's a portion of text here using newlines and characters) <!-- end indexcenter -->
I am thinking something alone these lines:
re.findall('<div class=\"indexcenter\">.+<!-- end indexcenter -->',html)
suggestions?
I am using mySQL to develop a database collected from internet sources. Python is used for the spider coding, but I'm having issues with the mySQL portion. I'm pretty new to mySQL, so I'm a bit lost.
I need a way to lock a single row of a table and then unlock it. We have multiple machines access a single job table and picking the jobs labelled pending. Unfortunately, without locking and unlocking, the machines will, in milliseconds, pick the same jobs and screw up the whole proccess. We are currently locking the whole table, selecting a job, writing its status as in progress, then unlocking the table. That wastes time. Any ideas on how to make this more efficient? I was thinking, lock the single row being accessed, then the other rows could be accessed by the other machines. I do not, however, have any clue how to do that. Thanks in advance.
Scratch that. minor error.
Wow. I knew it would be somethign simple like that. The database requests are still failing though. They worked fine when I put the code directly into main, but that was repetative and ugly. Any suggestions?
If you use the HTMLParser class provided in python's HTMLParser module. You can use it to create method's that parse through the tags to find certain tags and links and whatnot. That might help.
If that doesn't you can also use the re module provided with python to search for patterns. Those are the simplier methods. It seems like re would be your best option.
I'm having a simple problem. I'm using this class to do some stuff with another module. Now, for some reason the self.database and other variables are not able to be accessed by the other methods. I thought _init_ was suppose to work as a constructor, thus the other methods should have access to any variables define as self. that are set in _init_. Help? Please?
'''
Created on Jun 5, 2009
@author: Steven Norris
This module provides basic utilities for the FLOSS mole spiders.
'''
class FLOSSmoleutils:
'''
This method provides the ability to gather a page
'''
def _init_(self):
try:
dbfile = open("dbInfo.txt", 'r')
except:
raise Exception("Database file error: dbinfo.txt")
try:
self.host = dbfile.readline().strip()
self.port = int(dbfile.readline().strip())
self.username = dbfile.readline().strip()
self.password = dbfile.readline().strip()
self.database = dbfile.readline().strip()
self.db=MySQLdb.connect(host=self.host, user=self.username, passwd=self.password, db=self.database)
self.cursor = self.db.cursor()
except:
print("Database connection failed.")
def get_page(self,url, conn):
try:
conn.request("GET",url)
resp=conn.getresponse()
html_page=resp.read()
html_page=str(html_page)
return html_page
except:
print ("The page request failed.")
'''
This method provides the ability to insert into a database
'''
def db_insert(self,query_string,*params):
try:
self.cursor.execute(query_string, params)
print ("Inserted into: "+self.database+".\n")
except:
print("Database insertion failed for: "+self.database+"\n")
I'm using HTMLParser to find some thing among this page given below. The link i'm looking to find and follow is in red. I'm using the code, also provided below to find this link, but it isn't seeming to find it at all. The portion of code that isn't working correctly is in red. There is no error, so to speak, but I am not getting the output I believe I should. In fact, in the handle_starttag portion of GraphSpider, re.search(etc.etc.etc.) is coming up None. Help please!!
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<title>hamish's glazesorg-new at master - GitHub</title>
<link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="GitHub" />
<link rel="fluid-icon" href="http://github.com/fluidicon.png" title="GitHub" />
<link href="http://assets1.github.com/stylesheets/bundle.css?c5a62b10ab8ad45bf9f3fa776adae8395d8222a4" media="screen" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script>
<script src="http://assets3.github.com/javascripts/bundle.js?c5a62b10ab8ad45bf9f3fa776adae8395d8222a4" type="text/javascript"></script>
<link href="http://github.com/feeds/hamish/commits/glazesorg-new/master" rel="alternate" title="Recent Commits to glazesorg-new:master" type="application/atom+xml" />
<meta name="description" content="" />
<script type="text/javascript">
github_user = null
</script>
</head>
<body>
<div id="main">
<div id="header" class="">
<div class="site">
<div class="logo">
<a href="http://github.com"><img src="/images/modules/header/logov3.png" alt="github" /></a>
</div>
<div class="topsearch">
<form action="/search" id="top_search_form" method="get">
<input type="search" class="search" name="q" /> <input type="submit" value="Search" />
<input type="hidden" name="type" value="Everything" />
<input type="hidden" name="repo" value="" />
<input type="hidden" name="langOverride" value="" />
<input type="hidden" name="start_value" value="1" />
</form>
<div class="links">
<a href="/repositories">Browse</a> | <a href="/guides">Guides</a> | <a href="/search">Advanced</a>
</div>
</div>
<div class="actions">
<a href="http://github.com">Home</a>
<a href="/plans"><b><u>Pricing and Signup</u></b></a>
<a href="http://github.com/popular/forked">Repositories</a>
<a href="/blog">Blog</a> …
It was because of the connection close in the utilities method.
When I use the external module that runs the same code (line commented out) the program terminates. However, when I run the code inside the current module, the program seems to work just fine. Anyone tell me where I'm going wrong? I'd like to be able to use the utilities module in several programs.
'''
Created on Jun 5, 2009
@author: Steven Norris
This module provides the spider capability to be used to collect pages from github.com.
'''
import FLOSSmoleutils
from HTMLParser import HTMLParser
import httplib
import re
import time
import MySQLdb
BASE_SITE="github.com"
'''
This class is used to check every page of the repository for a projects list
'''
class GitHubSpider(HTMLParser):
#Used to store the links needing to be checked
check_links=[]
#Used to reset check_links after every feed()
def reset_link_list(self):
self.check_links=[]
#Used to handle the start tags of the main page
def handle_start_tag(self,tag,attrs):
if tag=='a':
link=attrs[0][1]
if re.search('/tree', link)!=None:
check_links.append(link)
'''
This method runs the spider sequence needed to collect the information from github.com
'''
def main():
try:
#Establish the connection and get the base_page
conn=httplib.HTTPConnection(BASE_SITE)
try:
print("http://"+BASE_SITE+"/repositories")
conn.request("GET","http://"+BASE_SITE+"/repositories")
resp=conn.getresponse()
base_page=resp.read()
base_page=str(base_page)
print(base_page)
# base_page=FLOSSmoleutils.get_page("http://"+BASE_SITE+"/repositories",conn)
#Create the spider and begin the feed
print('making spider')
spider=GitHubSpider()
print('feed')
spider.feed(base_page)
print(spider.check_links)
for link in spider.check_links:
print (link)
conn.close()
except:
print ("Base site request failed.")
except:
print ("Connection failed.")
main()
'''
Created on Jun 5, 2009
@author: Steven Norris
This module provides basic utilities for the FLOSS mole spiders.
'''
def get_page(url, conn):
try:
conn.request("GET",url)
resp=conn.getresponse()
html_page=resp.read()
conn.close()
html_page=str(html_page)
return html_page …
It would help if you could someone make those error lines stand out a bit. Maybe a highlight or a comment or something. Do that, and I'll give it another look over.
Depends, if you mean taking certain elements from images and putting them together in one image, you'll have to find a way to isolate the pixels needed from the images and transfer the pixels over into the other image. You'll also have to take image size into account with this (hopefully you only want one size throughout, that makes it easier). If you want to take the images and combine the pixels into some strange mixed/artistic sort of development, you'll get the color values from each pixel, add them, then set the value to the pixel in the new image. I'm a little rusty on the coding, so I would have to look it up, but does this help any?
When the .request method is called, an error is happening. The page does exist, but for some reason the request is failing. Any idea as to why?
'''
Created on May 29, 2009
@author: snorris4
This program will spider the repo.or.cz site for information on their open source projects, gathering the html pages of each
project and adding them to a database.
'''
from http import client
import re
import time
INDEX_SITE='repo.or.cz/w?a=project_index'
BASE_SITE="http://repo.or.cs/w"
class RP_spider_projectslist():
check_links=[]
def get_page(self, site, page):
try:
conn=client.HTTPConnection(site)
try:
conn.request("GET","http://"+site+page) #error occurs here
resp=conn.getresponse()
html_page=resp.read()
return html_page
except:
print("The page request failed.")
except:
print ("The connection failed.")
def find_projects(self,page):
return re.findall("*.git")
def add_to_database(self,links):
for link in links:
page=self.get_page(BASE_SITE,link[3:len(link)])
#add page to database here.
def main():
spider=RP_spider_projectslist()
page=spider.get_page(BASE_SITE,'/')
page_string=str(page,"UTF-8")
spider.feed(page_string)
for i in spider.check_links:
print (i)
def test():
spider=RP_spider_projectslist()
page=spider.get_page(INDEX_SITE,'')
page_string=str(page,"UTF-8")
strings=find_projects(page_string)
print (strings)
test()
First off, using the module MySQLdb will help you access databases and whatnot (if your teacher will allow you to use a seperate module for that, if not, look at the code in the module to figure out your own coding for it.)
Other than that, I'm not exactly sure what you're doing. If you could reword your explanation and possibly give some sort of example it would be helpful.
A quick search found this site that provides a python module and tutorials for handling and manipulating media files. It might be of some help.
I'm new to the forum and to web programming. I appreciate all the help thus far, and I have a couple more questions.
1. What is the purpose of HTTPconnection object? If i can get a page simply using an HTTPresponse object formed from .request("GET",URL) call? What is the difference between the URL used to form the HTTPconnection object and the used to form the HTTPresponse object?
2. I'm having trouble error handling. If an HTTPconnection object does not make a connection, I would like to print my own error message or throw an exception, but i'm uncertain of what to check for that. The same follows for and HTTPresponse.
Any help would be greatly appreciated. If I can get one or two of these things under my belt, I'll be fine for coding them from here on. They are just proving more problematic than I first thought.
Thank you for the help!
Thank you for your help.
I am coding a web spider for research purposes and have run into an error I am uncertain about. I am fairly new to web programming and need a bit of guidance. I use http.client to get a connection, request a site, get the response, and read the resonse into a variable. Then, using HTMLparser, I attempt to read() the variable, but am given this error:
Traceback (most recent call last):
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 45, in <module>
main()
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 41, in main
spider.feed(page)
File "C:\Python31\lib\html\parser.py", line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
Any help would be very much appreciated. Thank you.
''
Created on May 26, 2009
@author: Steven Norris
This program runs as a spider for the the savannah.gnu.org to add information about
both the GNU projects and non-GNU projects to a database for further investigation.
'''
from html import parser
from http import client
import re
class SpiderSavannahProjectsList(parser.HTMLParser):
check_links=[]
def get_page(self, site, page):
conn=client.HTTPConnection(site)
conn.request("GET","http://"+site+page)
resp=conn.getresponse()
html_page=resp.read()
return html_page
def handle_starttag(self,tag,attrs):
if tag=='a':
link=attrs[0][1]
if re.search('\.\./projects/',link)!=None:
self.check_links.append(link)
def add_to_database(self,links):
for link in links:
page=self.get_page('savannah.gnu.org',link[3:len(link)])
#add page to database here.
def main():
spider=SpiderSavannahProjectsList()
page=spider.get_page('savannah.gnu.org','/search/?type_of_search=soft&words=%2A&type=1&offset=0&max_rows=400#results')
print (page)
spider.feed(page)
for i in spider.check_links:
print (i)
main()
I am coding a web spider for research purposes and have run into an error I am uncertain about. I am fairly new to web programming and need a bit of guidance. I use http.client to get a connection, request a site, get the response, and read the resonse into a variable. Then, using HTMLparser, I attempt to read() the variable, but am given this error:
Traceback (most recent call last):
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 45, in <module>
main()
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 41, in main
spider.feed(page)
File "C:\Python31\lib\html\parser.py", line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
Any help would be very much appreciated. Thank you.
I am coding a web spider for research purposes and have run into an error I am uncertain about. I am fairly new to web programming and need a bit of guidance. I use http.client to get a connection, request a site, get the response, and read the resonse into a variable. Then, using HTMLparser, I attempt to read() the variable, but am given this error:
Traceback (most recent call last):
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 45, in <module>
main()
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 41, in main
spider.feed(page)
File "C:\Python31\lib\html\parser.py", line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
Any help would be very much appreciated. Thank you.