Program to extract 汉字 from mixed text?

calkins
August 14, 2008, 03:37 PM posted in General Discussion

Does anyone know of an online program that will do this?  Basically, I would like to be able to extract characters from text that includes 汉字, pinyin, and English.  Example...I would like a program that I can paste the following into:

建筑声学     jiànzhù shēngxué    architectural acoustics

...and the program will get rid of the pinyin and English and return a result of:

建筑声学

If anyone knows of a program that does this, please let me know.  Thanks!

Profile picture
goulnik
August 14, 2008, 03:58 PM

if your text layout is always the same, I put together a simple webpage (based on javascript) I can send you to run locally.

I use something similar to clean Wenlin definitions, though a little more complex, can easily be adapted to different settings. This is based on very simple parsing, so you need systematic delimiters :

{hanzi} {tab} {pinyin} {tab} {english}

or

{hanzi} {spaces} {pinyin} {spaces} {english }

or some such combination, to make sure you don't break your English definition in the middle because it contains succesive spaces.

If what you need is to slice just about just any text with no particular order, it can also be done, it's just a little trickier as you need to pattern matching against pinyin tables (accented and/or numbered). I actually wrote a few other functions to do this but I forgot exactly what they do and how, so I would have to go take another close look.

Profile picture
calkins
August 14, 2008, 04:04 PM

Thanks goulniky.  Ideally, I'd like a program that would extract hanzi from any type of text layout, but that's probably unlikely.

I'd definitely give your webpage a try.  Appreciate it.

Profile picture
goulnik
August 14, 2008, 04:05 PM

If all you want is remove anything that's not hanzi it's really very easy, I'll try and post it here tonight.

Profile picture
goulnik
August 14, 2008, 08:56 PM

here you go. should pretty much work on any text: 

copy paste this page for instance into the text box at top, click on [remove] and you get the result at the bottom.

You can select what to keep and what to leave, but because of the way Unicode is implemented, you cannot separate traditional from simplified characters.

You can download the file (save as) onto your machine.

goulnik.com/chinese/parse/index.htm

Profile picture
calkins
August 14, 2008, 10:54 PM

This is awesome goulniky!  Exactly what I was looking for, thanks so much. 

The options on what to keep and what to remove are really helpful.  I will be using this a lot.

You poddies who create these little programs never cease to amaze me.