设万维读者为首页 万维读者网 -- 全球华人的精神家园 广告服务 联系我们 关于万维
 
首  页 新  闻 视  频 博  客 论  坛 分类广告 购  物
搜索>> 发表日志 控制面板 个人相册 给我留言
帮助 退出
 
东方文明/EasternCulture  
阴阳平衡之谓道 - The balance of two types of energy - Yin and Yang  
网络日志正文
精细化批量提取文本–电邮与网址 Regular-Expression-Formularized... 2012-06-14 00:29:36

Extraction of targets one by one

精细化批量提取文本信息的搜索应用 - 机械化信息采集

Regular-Expression-Formularized Information Capture

- Tailored solution for concrete target-focused search only

- Suitable for extraction of url links of websites and emails

(Part of good feedback at international arena can be seen at the below link.

国际平台的好评记录, 每个项目都有近10%的手续费,排除信用虚假炒作.)

https://www.odesk.com/users/~~2503215aaec18b32

看到数以千计的潜在目标时, 是否想过如何把数以千计的目标网址与电邮地址都整理出来? 通过精细化的互联网批量提取文本信息的搜索应用,是可行的. 通过正则表达式与VB的机械化操作,可以批量提取相关的第一步目标指向链接, 随后通过对目标指向链接的批量化逐一访问与分析,可查到目标网址是否有独立网站, 进而从独立网站上再批量化逐一访问与分析, 得到目标电邮地址.

When you look at thousands of potential targets, do you ever think of extracting out the target web links and e-mail addresses of thousands of entries listed on the isolated pages? Through the application of tailored solution and refined search accompanied by extracting information one by one in batched series, it can be done. The solution can be named as Regular-Expression-Formularized Information Capture.

The application is a programmed operation repeating the mechanic steps of information extracting by combination of Regular Expression for collecting target information with tailored VISUAL BASIC solution codes. It first collects the target web links in batches by tailored Regular Expression formula suitable for each concrete task, and then visits the target links one by one, extracting the website links with independent domain names; then through visiting the independent-domain sites, collecting the target email addresses related to the web links shown on the information platform of various websites (which can be either of the type of comprehensive ecommerce information or other types with specific focus on certain industry).

欧美地区与亚洲地区的人们可能常用如下专业化的平台:

For reference, people in Europe and Asia may use the below professional websites.

http://www.thomasnet.com/

http://www.kellysearch.com

http://www.alibaba.com/

http://china.alibaba.com/

http://www.made-in-china.com

http://www.globalsources.com/

您可以在专业化的平台上指定自己搜索出来的数以千计的潜在的待整理的目标网址与电邮, 通过带正则表达式的精细化文本处理予以采集; 也可以指定在某个偏好的网站上对潜在的目标信息进行采集.

You can search with your own choice of keywords and define the specific list of thousands of potential targets as listed in the result pages shown up there on those professional sites. The thousands of targets can be collected by the tailored solution ofRegular-Expression-Formularized Information Capture of target entries one by one and page by page in batches. Or you may evendesignate a special type or list of websites(either technical-forum or other types), and then just let the potential targets be visited and sorted out through this customized solution of Regular-Expression-Formularized Information Capture, which is an application solution combining Regular Expression formula for customized information collecting with tailored VISUAL BASIC solution codes for each specific task.

…………………………………………………………………..

以下是对方案思路的阐述, 对于具体的目标采集任务, 当具体评估.

In the below link, there’s description about some relevant details. For each concrete task of information search, it should be evaluated independently.

http://item.taobao.com/item.htm?id=17195436407

(The content in the link is in Chinese, whose idea is expressed here in English)

可能涉及的具体应用对象(TXT,HTML,HTM, RTF, WORD, EXCEL等文本信息):

1) 可提取文件夹里的电子邮件

2) 批量提取网页的电子邮件,网站链接地址, 或其它文本信息

It’s applicable for target files of various formats, such as TXT,HTML,HTM, RTF, WORD, EXCEL, etc. Through customized regular expression, it extracts the needed information of url links of websites and emails, or other type of text information.

备注:

熟悉VB应用程序的朋友说:目前好像还没有人专门去开发一种可针对各个信息平台进行高密度地批量采集信息的通用软件, 这个具体原因相当一部分是因为如果有人开发出了高密度的大批量信息采集的软件, 其软件必带有针对各种平台的页面布局进行相应的参数设置的复杂系统,而这种系统一旦侵犯到信息平台自身所需要的信息自我保护系统,他们必然对页面布局进行调整, 对相关的参数予以变更. 因此, 不会有这种通用软件. 只要有创新, 就不会有绝对完美的通用软件.

Remarks:

The technician who’s familiar with VB application said: currently there is no generally-working software that can capture information in high-density batches on the various types of information-platforms. It may be because that, in case there should appear such type of software, too large information-capture would surely infringe upon the security bottom-line of the platforms themselves. So in the case that the security of their platform is not secured enough, they’d surely take measures to adjust the relevant parameters or layout of webpage which are relevant to the information capture software.

By this logic, so long as there’s space of innovation, there’d be no absolutely-perfect software of general-purpose type.

........................

Translation is done by our friends David and Daniel. Their contact details are as below:

邮箱(Mailbox for translation):

easternculture88@gmail.com;

824693961@qq.com;

--(QQ邮箱也可电邮交流)

----------------------------------

东方文明--阴阳平衡之谓道

The balance of two types of energy EasternCulture

交流通道w

中医复兴中华文明复兴的一个环节

本草纲目-中医精华--中华文明的一部分

精细化批量提取文本 电邮与网址

Target-focused websites and emails extraction

交流邮箱(Mailbox):

easternculture88@gmail.com;

浏览(1324) (0) 评论(0)
发表评论
我的名片
easternculture
来自: Mainland, China
注册日期: 2012-06-14
访问总量: 40,882 次
点击查看我的个人资料
Calendar
我的公告栏
中国在歧路徘徊
最新发布
· A divisive figure
· Features of service - data pro
· China’s modernization hesitate
· PURTIVO(沛蒂芙), 缔造肌肤神话
· 精华节选-玄学 - 易经影响深远 T
· 精细化批量提取文本–电邮与网址
· 信息技术伴随着个性化与专业化,
友好链接
· 解滨:解滨
· 嘎拉哈:嘎拉哈的博客
· 谢盛友文集:德国谢盛友的博客
· 爱中华兴九州:民主润九州, 道义
· 秋念11:秋念11的博客
· Huahua:Huahua的博客
· 提刀围观:提刀围观的博客
分类目录
【阴阳平衡之谓道】
· 精神本身就是思维运动, 是生命能
· 沟通是思想交流的通道
· 转载: 本草纲目 原序
· 生命是一个整体, 生命在于平衡--
【AboutTranslation】
· 翻译应以意思到位、上下文语境一
【合璧中西文明】
· 应当重新审视对外开放 – 多与文
· 精华节选 -- 螺旋式震荡 (Robust
· The Definition of Freedom/自由
【保护文化传统】
· 转载: 本草纲目 原序
【信息一体化】
· Features of service - data pro
· China’s modernization hesitate
· 精细化批量提取文本–电邮与网址
· 信息技术伴随着个性化与专业化,
【精华节选】
· A divisive figure
· PURTIVO(沛蒂芙), 缔造肌肤神话
· 精华节选-玄学 - 易经影响深远 T
· 精华转贴 造口旁疝修复网片-- P
· 精华节选 -- 石油勘探领域, 创新
· 精华评析 - 条分缕析的法律条款,
· 精华评析 - 秉持承诺于风险暗示
· 精华节选 – 产品创新, 中英宣传
· 精华节选 – 血浆品质的监控极为
· 紫杉醇, 抗癌药物的副作用之应对
【精华评析】
· 精华评析 - 国策外交领域, 优先
· 精华评析 - 好公司, 好产品, 就
存档目录
2013-05-15 - 2013-05-15
2012-06-13 - 2012-06-14
 
关于本站 | 广告服务 | 联系我们 | 招聘信息 | 网站导航 | 隐私保护
Copyright (C) 1998-2024. CyberMedia Network /Creaders.NET. All Rights Reserved.