python으로 웹 스크래핑하기

web

python으로 웹 스크래핑하기

민사민서 2024. 2. 16. 00:17

Xpath

- XML 문서의 특정 요소나 속성에 접근하기 위한 경로

- HTML 문서에서 특정 요소 찾기 쉽게 문법 제공

re 모듈 - 정규식 표현

import re
p = re.compile("ca.e")
# . 는 문자 하나를 의미 > care, cafe, case
# ^ 는 문자열 시작 의미 > ^de : desk, destination
# $ 는 문자열의 끝 의미 > se$ : case, base

m = p.match("case")
if m:
    print(m.group()) # case 출력됨, 주어진 문자열에 처음부터 일치하는지 확인하기에 caseless도 일치라고 봄
else:
    print("매칭되지 않음")

requests vs. selenium

- requests는 정적 웹페이지에 사용, 속도 빠름, 이미지 파일 데이터 같은 raw data도 가져올 수 있음

- selenium은 동적 웹페이지에 사용됨, 속도 느리지만 자동화 등 다양한 기능 사용 가능

beautifulsoup 사용법

pip install beautifulsoup4 # 스캘핑 용도
pip install lxml # 구문 분석

필요 모듈들 설치

soup = BeautifulSoup(res.text, "lxml")

lxml 파서를 이용해 res.text 를 변환함

print(soup.title) # title 가져옴
print(soup.title.get_text()) # title 내부의 text만
print(soup.div) # 문서에서 처음 발견되는 div 태그
print(soup.div.attrs) # dictionary 형태로 가져옴 ex) {'id': 'root'}
print(soup.div['class']) # 이렇게 속성 값 가져올 수 있음

print(soup.find("div", attrs={"class": "u_skip"})) // 이렇게 조건 만족하는 첫번재 element 가져오게 할 수 있음
ul = soup.find("ul", attrs={"class": "WeekdayMainView__daily_list--R52q0"}); print(ul.a)
r1 = soup.find("li", attrs={"class": "TripleRecommendList__item--Uc4sT"})
r2 = r1.next_sibling.next_sibling; print(r1.find("span", attrs={"class":"text"}))

next_sibling 통해 다음 같은 depth의 element로 넘어갈 수 있음
.previous_sibling 도 가능 (가끔 중간에 개행 껴서 next_sibling.next_sibling 이렇게 두 번 넘어가야 될 수도 있음)
.parent 통해 부모로 갈 수도 있음

r1.find_next_sibling("li") 이렇게 sibling을 찾는데 조건을 줄 수도 있음

find_all 하면 조건 맞는 것들 전부 리스트로 반환

cartoons = soup.find_all("a", attrs={"class":"EpisodeListList__link--DdClU"})
for cartoon in cartoons:
    print(cartoon.span.get_text())
    print("https://comic.naver.com"+cartoon['href'])

https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/

뷰티플수프 문서 — 뷰티플수프 4.0.0 문서

find_all() 메쏘드는 태그의 후손들을 찾아서 지정한 여과기에 부합하면 모두 추출한다. 몇 가지 여과기에서 예제들을 제시했지만, 여기에 몇 가지 더 보여주겠다: 어떤 것은 익숙하지만, 다른 것

www.crummy.com

requests 이용해 웹사이트 이미지 다운로드하기

- requests 라이브러리 활용, res.content로 가져오고 "wb" 바이너리 모드로 저장하면 됨

imgs = soup.find_all("img", attrs={"class": "thumb_img"})

for idx, img in enumerate(imgs):
    img_url = imgs[idx]["src"]
    if img_url.startswith("data:image/"):
        img_url = imgs[idx]["data-original-src"]
    img_res = requests.get(img_url)

    with open(f"웹스크래핑/movies/movie{idx+1}.webp", "wb") as f:
        f.write(img_res.content)

Selenium 사용법

- chrome driver 설치해서, 적당한 경로에 넣어주고 webdriver.Chrome("./chromedriver.exe") 이런 식으로 해줘야 함

- webdriver_manager 라이브러리를 사용해 자동으로 적합한 드라이버 설치하고 설정할 수 있음

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

>>> driver.get("https://www.naver.com/") 이러면 naver로 이동하고

>>> elem = driver.find_element("class name", "MyView-module__link_login___HpHMW")  이러면 로그인 버튼 찾고
>>> elem.click() 이러면 로그인 버튼 클릭해서 로그인 페이지로 이동하고

>>> driver.find_element("id", "query")  검색창 검색해버리고
<seleniuhttp://m.webdriver.remote.webelement.WebElement (session="0f5ef1bb3778ce94fbedfb4c64e92656", element="B8F1B5436BCA0F440E6FF273573DEB2A_element_6765")>
>>> driver.find_element("id", "query").send_keys("김민서")  이렇게 검색어 입력해버리고

>>> from seleniuhttp://m.webdriver.common.keys import Keys
>>> driver.find_element("id", "query").send_keys(Keys.ENTER)  엔터 입력

Selenium 4 부터는 find_element_by_* 문법이 전부 사라졌음. 아래와 같이 바뀜

driver.find_element(By.CLASS_NAME, "")
driver.find_element(By.ID, "")
driver.find_element(By.CSS_SELECTOR, "")
driver.find_element(By.NAME, "")
driver.find_element(By.TAG_NAME, "")
driver.find_element(By.XPATH, "")
driver.find_element(By.LINK_TEXT, "")
driver.find_element(By.PARTIAL_LINK_TEXT, "")

browser.close() => 현재 탭만 닫는거고
browser.quit() => 전부 종료

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".ContentTitle__title_area--x24vt"))
)

- expected condition까지 최대 10초 대기시킴

options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(service=service, options=options)

- 이런 식으로 headless option을 주면 백그라운드에서 크롬 엔진이 실행됨

driver.get_screenshot_as_file("test.png")

- 이렇게 별도 스샷으로 저장해도 됨

Troubleshoot 1 - requests.get()으로 안가져와지면?

- 헤더에 "User-Agent"를 추가해주면 됨

- 구글에 user agent string 검색 (브라우져 따라, 모바일/pc 따라 달라짐)

import requests

url = "https://cse.snu.ac.kr/department-notices?c%5B%5D=1&keys="
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"}
res = requests.get(url, headers=headers)

Troubleshoot 2 - 동적 웹페이지이고, 스크롤 필요한 웹페이지라면 (feat. 네이버 웹툰)

- 네이버 웹툰 페이지(https://comic.naver.com/webtoon)인데, requests로 가져오니 html 뼈대만 존재

- 동적으로 js 실행되면서 웹페이지 내용이 채워짐, selenium으로 단순히 가져와도 동적 로드되지 않음

driver.get(url)
driver.execute_script("window.scrollTo(0, 0);")

- 이런 의미없는 js 코드 실행시키니까 동적으로 로드 잘 됨

- 하지만 스크롤 가능한 모든 페이지들의 콘텐츠들이 로드되지는 않았음

TYPE1. 무한로딩 페이지라면

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

- 페이스북 같이 스크롤 내렸을 때 페이지의 전체 높이가 동적으로 변하는 등 무한 스크롤이 있는 페이지에서 용이함

- document.body.scrollHeight 구한 다음 페이지 전체 높이까지 한 번에 스크롤

- last height과의 높이를 비교하여 추가 콘텐츠 로드 확인, 추가 내용 없다면 루프 중단 (마지막 페이지라는 뜻이므로)

(https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python)

TYPE2. Lazy loaded 콘텐츠들이 존재하는 스크롤 가능 페이지라면

screen_height = driver.execute_script("return window.innerHeight;")  # 브라우저 창의 높이
i = 1
while True:
    # 페이지 높이를 기반으로 스크린 높이만큼 스크롤
    driver.execute_script(f"window.scrollTo(0, {screen_height*i});")
    i += 1
    time.sleep(0.1)
    scroll_height = driver.execute_script("return document.body.scrollHeight;")
    if (screen_height * i) > scroll_height:
        break  # 모든 콘텐츠가 로드될 때까지 스크롤

- 점진적인 스크롤 방식을 활용해 중간 콘텐츠 로딩 유도가 필요함

- 브라우저 창의 높이를 가져온 뒤, 각 루프마다 창의 높이만큼 스크롤하며 scrollHeight 되기까지 점진적으로 스크롤 함

=> 네이버 웹툰 페이지의 경우 이렇게 가져와야

Troubleshoot 3 - 쿠팡 페이지 콘텐츠 받아오기 실패

- requests 사용 시 user-agent 분 아니라 accept-language 설정까지 해주어야 데이터 받아옴

import requests

url = "https://www.coupang.com/np/search?q=%EB%85%B8%ED%8A%B8%EB%B6%81&channel=recent&component=&eventCategory=SRP&trcid=&traid=&sorter=scoreDesc&minPrice=&maxPrice=&priceRange=&filterType=&listSize=36&filter=&isPriceRange=false&brand=&offerCondition=&rating=0&page=1&rocketAll=false&searchIndexingToken=1=9&backgroundColor="
headers = {
    # 언어 설정 추가를 해줘야 결과값 리턴받음
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Accept-Language": "ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3"
}
res = requests.get(url, headers=headers)
res.raise_for_status()

- selenium 사용 시 headless 모드 풀어야 잘 받아와짐

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.headless = False  # 필요에 따라 헤드리스 모드 조정
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
options.add_argument("lang=ko_KR")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

url = "https://www.coupang.com/np/search?q=%EB%85%B8%ED%8A%B8%EB%B6%81&channel=recent&component=&eventCategory=SRP&trcid=&traid=&sorter=scoreDesc&minPrice=&maxPrice=&priceRange=&filterType=&listSize=36&filter=&isPriceRange=false&brand=&offerCondition=&rating=0&page={}&rocketAll=false&searchIndexingToken=1=9&backgroundColor=".format(i)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")

Troubleshoot 4 - 쿠팡 여러 페이지에서 콘텐츠 받아오기 실패

- for loop 통해 여러 페이지 접속하려고 했더니 다음 페이지 접속 시 해당 웹페이지가 일시적으로 다운되었거나 새 웹 주소로 완전히 이동했을 수 있다고 뜸

TRY1. 매 루프에서 driver = webdriver.Chrome(service=service, options=options) 이렇게 웹 드라이버 생성하고 driver.quit()로 종료하는 방식

=> 무척 비효율적임

TRY2. 각 페이지 방문 후 웹 드라이버의 쿠키/캐시를 명시적으로 초기화함

- 페이지 간 발생할 수 있는 상태 유지 문제 방지

- 이전에 사용된 쿠키 재사용 시 감지하는 보안 정책이 있나? (추측임)

for i in range(1,6):
    url = "https://www.coupang.com/np/search?q=%EB%85%B8%ED%8A%B8%EB%B6%81&channel=recent&component=&eventCategory=SRP&trcid=&traid=&sorter=scoreDesc&minPrice=&maxPrice=&priceRange=&filterType=&listSize=36&filter=&isPriceRange=false&brand=&offerCondition=&rating=0&page={}&rocketAll=false&searchIndexingToken=1=9&backgroundColor=".format(i)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    driver.delete_all_cookies()

    items = soup.find_all("li", attrs={"class": re.compile("^search-product")})
    for item in items:
        # 광고 제품 제외
        ad_badge = item.find("span", attrs={"class": "ad-badge-text"})
        if ad_badge:
            continue
        # 애플 제품 제외
        name = item.find("div", attrs={"class":"name"}).get_text().strip()
        if "Apple" in name:
            continue
        price = item.find("strong", attrs={"class":"price-value"}).get_text()
        # 평점 없는 상품 제외
        rate = item.find("em", attrs={"class":"rating"})
        if rate:
            rate = rate.get_text()
        else:
            continue
        # 후기 없는 상품 제외
        rate_total = item.find("span", attrs={"class":"rating-total-count"})
        if rate_total:
            rate_total = rate_total.get_text()[1:-1]
        else:
            continue

        if float(rate) >= 4.5 and int(rate_total) >= 50:
            print(str(i)+"번째 페이지")
            print(f"제품명 : {name}")
            print(f"가격 : {price}")
            print(f"평점 {rate}점 ({rate_total}개)")
            print("https://www.coupang.com"+item.find("a", attrs={"class":"search-product-link"})["href"])
            print("-"*100)

Troubleshoot 5 - 네이버 로그인 시 captcha에 탐지되는 현상

driver.find_element(By.ID, "id").send_keys("my_id")
driver.find_element(By.ID, "pw").send_keys("my_pw")
driver.find_element(By.ID, "log.login").click()

- 이렇게 id/비번 입력 후 로그인하려고 하면 naver captcha에 탐지됨

# 네이버로 이동
driver.get("https://www.naver.com/")

# 로그인 버튼 
ele = driver.find_element(By.XPATH, '//*[@id="account"]/div/a')
ele.click()

# id, pw 입력 후 로그인
input_js = ' \
        document.getElementById("id").value = "{id}"; \
        document.getElementById("pw").value = "{pw}"; \
    '.format(id = "my_id", pw = "my_pw")

driver.execute_script(input_js)
driver.find_element(By.ID, "log.login").click()

# 기기 등록 안 함
time.sleep(1)
driver.find_element(By.ID, "new.dontsave").click()

- 이렇게 driver.execute_script() 이용해 js 코드를 실행시키면 문제없이 로그인이 잘 됨

현재글python으로 웹 스크래핑하기

(2023.02 ~ ) 해킹 공부 기록용으로 시작했다가 잡다한 거 다올리는 공부 메모장 느낌으로 봐주세요😺

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

대충공부한거적어두는블로그