BeautifulSoup로 다음과 같이 읽어드리면 utf-8로 자동 인코딩이 된다.
이때 stdout으로 출력하는 soup('memberdef') 검색 시도중 아래와 같은 오류가 발생하였다.
soup('memberdef')의 결과로 출력되는 리스트를 화면에 뿌리면서 인코딩 에러가 발생한 것이다.
lists = soup('memberdef') 와 같이 stdout으로 출력을 하지 않는다면 위와 같은 문제는 발생하지 않는다.
어쨌든 해당 리스트의 내용을 화면상으로 확인할 방법이 필요한데 이때는 아래와 같이 한다.
renderContents(코드명)을 사용한다.
In [1]: file_name = "d:\\sdd_word\\result\\xml\\class_c_c_i_status_manager.xml"
In [2]: f = open(file_name, 'r')
In [3]: contents = f.read()
In [4]: from BeautifulSoup import Beau
BeautifulSoup BeautifulSOAP BeautifulStoneSoup
In [4]: from BeautifulSoup import BeautifulSoup
In [5]: soup = BeautifulSoup(contents)
In [6]: soup.originalEncoding
Out[6]: u'utf-8'
In [2]: f = open(file_name, 'r')
In [3]: contents = f.read()
In [4]: from BeautifulSoup import Beau
BeautifulSoup BeautifulSOAP BeautifulStoneSoup
In [4]: from BeautifulSoup import BeautifulSoup
In [5]: soup = BeautifulSoup(contents)
In [6]: soup.originalEncoding
Out[6]: u'utf-8'
이때 stdout으로 출력하는 soup('memberdef') 검색 시도중 아래와 같은 오류가 발생하였다.
In [31]: soup.find('memberdef')
Out[31]: -----------------------------------------------------------------------
----
UnicodeEncodeError Traceback (most recent call last)
C:\Documents and Settings\70735\<ipython console> in <module>()
D:\Developer\Python_26\lib\site-packages\IPython\Prompts.pyc in __call__(self, a
rg)
550
551 # and now call a possibly user-defined print mechanism
--> 552 manipulated_val = self.display(arg)
553
554 # user display hooks can change the variable to be stored in
D:\Developer\Python_26\lib\site-packages\IPython\Prompts.pyc in _display(self, a
rg)
576 return IPython.generics.result_display(arg)
577 except TryNext:
--> 578 return self.shell.hooks.result_display(arg)
579
580 # Assign the default display method:
D:\Developer\Python_26\lib\site-packages\IPython\hooks.pyc in __call__(self, *ar
gs, **kw)
139 #print "prio",prio,"cmd",cmd #dbg
140 try:
--> 141 ret = cmd(*args, **kw)
142 return ret
143 except ipapi.TryNext, exc:
D:\Developer\Python_26\lib\site-packages\IPython\hooks.pyc in result_display(sel
f, arg)
169
170 if self.rc.pprint:
--> 171 out = pformat(arg)
172 if '\n' in out:
173 # So that multi-line strings line up with the left column of
D:\Developer\Python_26\lib\pprint.pyc in pformat(self, object)
109 def pformat(self, object):
110 sio = _StringIO()
--> 111 self._format(object, sio, 0, 0, {}, 0)
112 return sio.getvalue()
113
D:\Developer\Python_26\lib\pprint.pyc in _format(self, object, stream, indent, a
llowance, context, level)
127 self._readable = False
128 return
--> 129 rep = self._repr(object, context, level - 1)
130 typ = _type(object)
131 sepLines = _len(rep) > (self._width - 1 - indent - allowance)
D:\Developer\Python_26\lib\pprint.pyc in _repr(self, object, context, level)
221 def _repr(self, object, context, level):
222 repr, readable, recursive = self.format(object, context.copy(),
--> 223 self._depth, level)
224 if not readable:
225 self._readable = False
D:\Developer\Python_26\lib\pprint.pyc in format(self, object, context, maxlevels
, level)
233 and whether the object represents a recursive construct.
234 """
--> 235 return _safe_repr(object, context, maxlevels, level)
236
237
D:\Developer\Python_26\lib\pprint.pyc in _safe_repr(object, context, maxlevels,
level)
318 return format % _commajoin(components), readable, recursive
319
--> 320 rep = repr(object)
321 return rep, (rep and not rep.startswith('<')), False
322
UnicodeEncodeError: 'ascii' codec can't encode characters in position 305-306: o
rdinal not in range(128)
In [32]:
Out[31]: -----------------------------------------------------------------------
----
UnicodeEncodeError Traceback (most recent call last)
C:\Documents and Settings\70735\<ipython console> in <module>()
D:\Developer\Python_26\lib\site-packages\IPython\Prompts.pyc in __call__(self, a
rg)
550
551 # and now call a possibly user-defined print mechanism
--> 552 manipulated_val = self.display(arg)
553
554 # user display hooks can change the variable to be stored in
D:\Developer\Python_26\lib\site-packages\IPython\Prompts.pyc in _display(self, a
rg)
576 return IPython.generics.result_display(arg)
577 except TryNext:
--> 578 return self.shell.hooks.result_display(arg)
579
580 # Assign the default display method:
D:\Developer\Python_26\lib\site-packages\IPython\hooks.pyc in __call__(self, *ar
gs, **kw)
139 #print "prio",prio,"cmd",cmd #dbg
140 try:
--> 141 ret = cmd(*args, **kw)
142 return ret
143 except ipapi.TryNext, exc:
D:\Developer\Python_26\lib\site-packages\IPython\hooks.pyc in result_display(sel
f, arg)
169
170 if self.rc.pprint:
--> 171 out = pformat(arg)
172 if '\n' in out:
173 # So that multi-line strings line up with the left column of
D:\Developer\Python_26\lib\pprint.pyc in pformat(self, object)
109 def pformat(self, object):
110 sio = _StringIO()
--> 111 self._format(object, sio, 0, 0, {}, 0)
112 return sio.getvalue()
113
D:\Developer\Python_26\lib\pprint.pyc in _format(self, object, stream, indent, a
llowance, context, level)
127 self._readable = False
128 return
--> 129 rep = self._repr(object, context, level - 1)
130 typ = _type(object)
131 sepLines = _len(rep) > (self._width - 1 - indent - allowance)
D:\Developer\Python_26\lib\pprint.pyc in _repr(self, object, context, level)
221 def _repr(self, object, context, level):
222 repr, readable, recursive = self.format(object, context.copy(),
--> 223 self._depth, level)
224 if not readable:
225 self._readable = False
D:\Developer\Python_26\lib\pprint.pyc in format(self, object, context, maxlevels
, level)
233 and whether the object represents a recursive construct.
234 """
--> 235 return _safe_repr(object, context, maxlevels, level)
236
237
D:\Developer\Python_26\lib\pprint.pyc in _safe_repr(object, context, maxlevels,
level)
318 return format % _commajoin(components), readable, recursive
319
--> 320 rep = repr(object)
321 return rep, (rep and not rep.startswith('<')), False
322
UnicodeEncodeError: 'ascii' codec can't encode characters in position 305-306: o
rdinal not in range(128)
In [32]:
soup('memberdef')의 결과로 출력되는 리스트를 화면에 뿌리면서 인코딩 에러가 발생한 것이다.
lists = soup('memberdef') 와 같이 stdout으로 출력을 하지 않는다면 위와 같은 문제는 발생하지 않는다.
어쨌든 해당 리스트의 내용을 화면상으로 확인할 방법이 필요한데 이때는 아래와 같이 한다.
renderContents(코드명)을 사용한다.
In [39]: lists.renderContents('cp949')
Out[39]: '\n<type>struct sockaddr_in</type>\n<definition>struct sockaddr_in CMsg
Handler::m_sServAddr</definition>\n<argsstring></argsstring>\n<name>m_sServAddr<
/name>\n<briefdescription>\n<para>\xbc\xad\xb9\xf6 \xc1\xd6\xbc\xd2 \xba\xaf\xbc
\xf6 </para> </briefdescription>\n<detaileddescription>\n</detaileddescription>\
n<inbodydescription>\n</inbodydescription>\n<location file="C:/JTDLS_IDMS2/MsgHa
ndler/CMsgHandler.h" line="45" bodyfile="C:/JTDLS_IDMS2/MsgHandler/CMsgHandler.h
" bodystart="45" bodyend="-1"></location>\n'
Out[39]: '\n<type>struct sockaddr_in</type>\n<definition>struct sockaddr_in CMsg
Handler::m_sServAddr</definition>\n<argsstring></argsstring>\n<name>m_sServAddr<
/name>\n<briefdescription>\n<para>\xbc\xad\xb9\xf6 \xc1\xd6\xbc\xd2 \xba\xaf\xbc
\xf6 </para> </briefdescription>\n<detaileddescription>\n</detaileddescription>\
n<inbodydescription>\n</inbodydescription>\n<location file="C:/JTDLS_IDMS2/MsgHa
ndler/CMsgHandler.h" line="45" bodyfile="C:/JTDLS_IDMS2/MsgHandler/CMsgHandler.h
" bodystart="45" bodyend="-1"></location>\n'