python을 활용한 웹 크롤링

최근 진행한 프로젝트에서 CNN 알고리즘을 활용한 이미지를 분류하는 기능을 구현했는데, 정작 모델을 학습시킬 이미지를 구하는게 참 어렵다는걸 깨달았다.. 단순 학습이라면 tensorflow에서 제공해주는 데이터셋을 이용하면 되지만, 내가 필요로 하는 이미지가 없다면 직접 만들어야하는데 이때 유용하게 쓰이는게 바로 웹 크롤링이다.
크롤링에 하기에 앞서 우선 웹 자동화를 하기 위한 chromedriver을 다운받은뒤 경로를 기억하고 있자.

1. 웹 띄우기

from selenium import webdriver

# bing.com
baseUrl = "https://www.bing.com/images/search?q="
baseUrl2 = "&form=HDRSC2&first=1&scenario=ImageBasicHover"
url = baseUrl + quote_plus(ko_name) + baseUrl2
# headless을 활용하여 화면을 안띄움
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')

driver = webdriver.Chrome(
    executable_path = "C:/~~~~~~~~~~/chromedriver.exe", # chromedriver's path
    chrome_options=options)

driver.get(url)
time.sleep(1)
SCROLL_PAUSE_TIME = 1.0
driver.close()

selenium 은 웹 브라우저의 자동화를 가능하게 하고 지원하는 라이브러리를 포함

우선 크롤링 시작에 앞서 웹을 먼저 띄우는 작업을 해야한다.

크롤링 하고 싶은 웹의 url을 가져온다
webdriver을 활용하여 웹을 화면에 띄우는 작업을 한다. ( 만약 화면에 안띄우고 싶다면 headless를 활용하여 웹을 숨길수 있다. 숨길경우 크롬을 불러오는 로컬 경로를 정의할때 chrome_options=options을 작성한다. )

2. 크롤링 진행

from bs4 import BeautifulSoup

pageString = driver.page_source
bsObj = BeautifulSoup(pageString, 'lxml')
try:
    for line in bsObj.find_all(name='div', attrs={"class":"img_cont hoff"}):
        page = line.find(name="img")["src"]
        if page.find("data:image/jpeg") == -1:
            url_path.append(page)
except IndexError as ider:
    print("IndexError")

last_height = driver.execute_script('return document.body.scrollHeight')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")

if new_height == last_height:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUSE_TIME)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    else:
        last_height = new_height
        continue

time.sleep(0.3)
url_path = list(set(url_path))

BeautifulSoup은 html의 구문분석을 하기위한 python 패키지

이미지 url을 가져오면서 url_path에 push시켜나감
스크롤을 내리면서 웹상의 이미지를 띄움

위 1, 2번과정을 계속해서 반복해나가는것
전체코드는 아래 링크에 가시면됩니다.

https://github.com/silverjjj/web-crawling/blob/master/crawling.py

[python] Generator 나만의 정리 (0)	2020.12.12
python의 클래스 기본적인 사용 및 with구문 (1)	2020.06.22

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

코딩은 잼있어

python을 활용한 웹 크롤링

python을 활용한 웹 크롤링

1. 웹 띄우기

2. 크롤링 진행

'프로그래밍 > python' 카테고리의 다른 글

'프로그래밍/python'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

python을 활용한 웹 크롤링

python을 활용한 웹 크롤링

1. 웹 띄우기

2. 크롤링 진행

'프로그래밍 > python' 카테고리의 다른 글

'프로그래밍/python'의 다른글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역