I would like to start building a search database, by providing a portal for users to search a search engine, like Google.

Is it possible for a visitor to the portal site, to present to Google their own IP address, instead of the portal site's IP address?

Is it also possible, for the portal to access the search result, so that it can use the results to build a search database?

Web_14 commented: Web design refers to the design of websites that are displayed on the internet. It usually refers to the user experience aspects of website developmen +0

Recommended Answers

All 34 Replies

So, if I understand you correctly, you would like to create a simple website that lets people do searches, for example to search Google.

You then want to store the search results within your own database, in an effort to begin populating your own search database?

I'm assuming you want the search to use the end-user's IP address so that your server's IP address won't be hit by Google's flood control limits.

You can use Google's Custom Search javascript to have the end-user perform a search from their own IP. You can then use Javascript to scrape the results. You can create an engine at https://programmablesearchengine.google.com/cse/all

Another option is to use the Custom Search API at https://developers.google.com/custom-search/v1/overview but that would use your own IP address and begins to cost money if there are more than 100 queries per day.

I'm going with no since the way things work the reply from the server would go to the IP it came from.

Now if you wanted to use a VPN you might be able to craft a system that used the searcher's geo-location but again no to your initial question.

I'm going with no since the way things work the reply from the server would go to the IP it came from.

Not true. This can be done with Google's Custom Search javascript, in which case the end-user's web browser will be making the request, and not the OP's server.

commented: I'm outdated here. It was about a decade ago someone tried this and it was costly so no for them. +15

Hi,

I read your description of the project on Github, and I actually personally think the premise is flawed. A single domain website is not equal to non-commercialized content. Instead, it's much more equal with old 1990s content that hasn't been updated to the modern web.

Today, most sites use third-party CDNs to speed up their site's performance. I can't see any downsides to this.

If you actually are interested in continuing with this project, then unfortunately using Google's Custom Search javascript won't solve your dilemma, because the unmodified Google search results are user-facing. You can't manipulate them or remove any of them from the list.

You can, however, for free, use Google's Custom Search javascript and either limit the search results to a fixed list of domains, or to only web pages that include specific Schema markup, such as Article or CreativeWork markup.

On the other hand, if you want to go down your current path, you will need to use their Custom Search JSON API in order to reiterate over the results until you find enough that meet your criteria. However, there is a cost of $5 per 1,000 queries attached to that, so it could get pricey especially if you're executing multiple API queries per search result page. This can't easily (or in a way that isn't in violation of their terms of service) be faked to not use your server's IP address to bypass the throttling limit.

No, they don't want to alter or manipulate Google's actual search result pages.

They want to create their own unique search engine that uses Google's API to only search a subset of pages that are in alignment with their mission (non-commercialized pages only). The problem is that the API is rate limiting and it keeps throttling their IP address, and they want to see if there are any workarounds beyond spending $5 for every 1,000 API requests.

@Dani. To me that reads as altering the results. A subset is altering. This brings me back to no and reminds me of another person that wanted to use the results but ran headlong into the $5 per 1,000 cost. I think that something creative with VPN might extend your 1,000 calls since you get that many per VPN but again you can exhaust that method and you have the cost both dollars and your time creating, maintaining the solution.\

Sometimes something isn't worth doing.

Does big business use the same CDNs as sites created by individuals? If not, the script has a whitelist. I browse the web blocking all content not from the originating domain, and then whitelist on a global or per-site basis to get the functionality I need. Maybe you could look at the global whitelist and from your experience, identify some of the domains that would be good to whitelist for this project? My results are on this page: https://mekineer.com/information-technology/2020-ublock-origin-extension#rules-to-bypass-medium-mode-and-function-in-easy-mode-for-all-sites

Dani wrote: "Google's Custom Search javascript won't solve your dilemma, because the unmodified Google search results are user-facing. You can't manipulate them or remove any of them from the list."

You also mentioned that I would have access to scrape the results. I can then present the user with a filtered version of the results, either in place of Google's results, or alongside.

Dani wrote: "... limit the search results ... to only web pages that include specific Schema markup, such as Article or CreativeWork markup."

I wasn't familiar with the concept, but I will look into it. https://schema.org/CreativeWork This can be yet another filter.

@Dani. To me that reads as altering the results.

Altering the results sounds malicious. When I think of altering the results, I think of a Chrome plugin or a man-in-the-middle attack that modifies the Google search result pages.

In this case, we're talking about an open source project to build a novel search engine that is powered under the hood by Google's API.

Doing something creative with a VPN in order to circumvent API restrictions is obviously not the right path because that would violate Google's terms of service.

Does big business use the same CDNs as sites created by individuals?

Yes, there are a handful of popular CDNs out there that most sites, both big and small, use. Cloudflare (which we use at DaniWeb) is very popular with blogs because it has a powerful free tier.

Of course I may be biased, because of course I make my living off of ad sales, but I cannot think of a single site that only uses one domain. I think that your strategy would be depriving your search engine of most of the world's information, Wikipedia, etc. etc. etc.

My website only uses its own domain for resources, so I am biased LOL.

I'm going to explain this in a way that may be partially incorrect, but it's what I understand. uBlock Origin does not see man-in-the-middle CDNs, like Cloudflare, as website resources. For your website, the resource domains other than daniweb.com are: buysellads.net, doubleclick.net, and googletagmanager.com

Using the script and searching Google for "genomics", I filtered the following single domain websites:
https://www.jgenomics.com/
https://health.utah.gov/genomics/
https://molbiol-tools.ca/Genomics.htm
https://ec.europa.eu/jrc/en/event/conference/my-genome-our-future

These results are sort of ok to me, and complement Google results. I haven't tested the filter much beyond that, as work is underway with improvements to reduce false positives and not start over from scratch when the script is cut off by Google. It's hard to say how useful it will be, considering other filters could be used besides single-domain-with-a-whitelist. Also, thank you, we'll be looking into using the custom-google-search.

ps. I have wanted to somehow include forums, as there is a lot of user created content in these which are valuable. Google used to have its own filter for this, but they trashed it and it went to Google Cemetery. There's a site called boardreader.com that tries to take its place but doesn't have Google AI: their results are pretty random. I saw another way on makeuseof site, that just tacks on syntax to Google's search, and that may be the best way.

@mekineer.

Let's skip to your end game. How will you monetize this work? Do people pay for such results.

Also, you had to outsource the work so far if I read your top posts and links correctly. How will you deal with outages or changes by Google?
Maybe you should try other search engines? Maybe Duckduckgo?

Let's skip to your end game. How will you monetize this work? Do people pay for such results.

If you read his Github, it basically says that he is against commercializing or monetizing the web in any way. To that end, he is trying to build an open source search engine that shares his mentality. As he's not a strong developer, he self-funded the beginning stages of the development, but now he is hoping that other developers will contribute to the open source project.

How will you deal with outages or changes by Google?

Why would arguably one of the most popular public APIs in the world make breaking changes without providing months of advance notice to its developers along with an upgrade path? Also, if Google goes down, then the world will come to a grinding halt. Why would Duckduckgo have fewer server outages than Google?!

Sorry Dani, I think I should have not mentioned outages but left it at changes by any player here.

As to the end game, it has to be considered as even if the code is open source, somewhere this code has to live as in some server. Get a lot of traffice and your free hosting tends to not be free.

I wonder if they are confusing free services with open source?

As to Google making changes without notice, I am not a historian but Google has pulled fast ones like ending YouTube support on Smart TVs which may or may not be back as well as other changes which my web dev friends tell me cautht them scrambling.

@mekineer. This thought occurred to me.

If you limit the results to just certain domains, it sounds like the results could suffer from confirmation bias.

This is already an issue with Google and why you may want to compare your searches with duckduckgo and the old fashioned search at the college library.

which my web dev friends tell me cautht them scrambling.

I've never known Google to sunset any projects, or make any breaking API changes, with less than 6 months notice. Depends on your friends' definition of scrambing, I guess ;)

As to the end game, it has to be considered as even if the code is open source, somewhere this code has to live as in some server.

I suspect that, just like most web-based open source Github projects, they are free to be forked and for every contributor / user to maintain their own instance of the project.

I suspect that, just like most web-based open source Github projects, they are free to be forked and for every contributor / user to maintain their own instance of the project.

If that is true then a person running their own search would be fine for what may be 1000? searches per hour. The limits look ample for development and private use. If it's the basis for a product or web portal then you get into where it costs and you are faced again with how to feed the beasts.

I could be mistaken, but I get the impression it's meant to be a hobby project designed for private / limited use.

For private and limited use, then the limits are fine. Free beer is not the best but it's free.

That would be great if the limit was, indeed, 1000. But it's 100 per day.

Additionally, the way their code is designed, it takes multiple API calls to fulfill one search request.

Therefore, the current API limits would translate to roughly 15 searches a day. Which is why they're running into difficulty.

To develop then the VPN could be used. Once it's done, then they publish and note the limits on the free searches.

My mind still comes back to "is it worth doing?" I've been using search engines since they came out (Was it Veronica and Jughead?)

-> Why not just do the search without the overhead and filter it later. I think I've done 100's of searches with Chrome and it never gave me a message "You search too much."

Are they working too hard at this?
Could simple work?

To develop then the VPN could be used.

I disagree. Whether it's for development purposes or not, you cannot circumvent Google's terms of service. If it's for your own development purposes, then probably nothing bad will happen. But it's still unethical.

My mind still comes back to "is it worth doing?"

Personally, I don't think it is. But this seems to be a passion project for the OP, and I'm certainly not one to poopoo anyone's personal passion projects / hobbies.

Why not just do the search without the overhead and filter it later.

Because the filtering, whether done immediately or later as a background task, would still require multiple API calls to make it happen.

PS. Here's an example of updates that happen fast and furious.
Apple developers given only 24 hours' notice to prepare their apps for the update. September 17, 2020
The last round had a week's notice.

My take is this is going to be very rough on the smaller, namely one person companies.

What Apple updates are you referring to? I have never heard of Apple only giving one week notice to its developers before breaking changes would prevent existing apps from functioning.

It happened this week. https://nandy0140.com/ios-14-accelerated-rollout-threatens-app-glitches-frustrates-apple-developers/ among other news.
Here's a Tweet: https://twitter.com/AppleTerminal/status/1306744086523715584 Seems Apple has already issued updates. Remember I'm not a historian here. The office has moved away from our iOS app because, too few users.

I'm sure you've been watching the Apple Fortnite contest too.

rproffitt, the article you linked to does not say that Apple gave short notice to its developers before breaking changes would prevent any existing apps from functioning properly. Instead, it says that Apple gave short notice to its developers before releasing a new OS version out of beta and into gold release that supported new features.

Developers were upset because they had been using the developer build of the upcoming OS to create apps that supported the new soon-to-be-released functionality. Apple announced that the the OS with the new features was going to go to gold and be released to the public with short notice. Therefore, developers felt rushed to get the next version of their apps (that they had been developing to take advantage of the new features) ready and submitted to the App store by launch day to take advantage of the hype of the new features.

commented: I did not polish my replies. See James reply for a better explainer. We don't have this problem anymore! +15

Re IOS GM/release

The reason developers are upset is that every major release breaks something, and until you have the GM you cannot test for regression bugs or other problems. Ie: users are updating to an OS for which you have not been able to perform proper testing of your app.
I;ve been deluged with emails for production apps that I use advising me not to update to 14 until they have completed testing and released (got through Apple Store's release process) updated versions as necessary.

It's not about heving new features on day 1, it's about having an app that continues to work.

commented: Thanks James for the clarification. +15
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.